18 minute read

Transformer models: What are they, and how do they work?

Emmanuel Ohiri

Emmanuel Ohiri

Transformers have redefined how machines understand and generate human language, and their influence is now extending into other domains like image processing and even protein folding.

Initially introduced in 2017 by Vaswani et al. in the groundbreaking paper "Attention Is All You Need," the transformer architecture introduced a novel way to handle sequential data, outperforming its predecessors like recurrent neural networks (RNNs) and convolutional neural networks (CNNs).

In this article, we will dive deep into the world of transformer models—what they are, how they work, their advantages, and their far-reaching applications across industries. First, let’s discuss what transformers are.

Accelerate your NLP pipelines with CUDO Compute. We offer the latest NVIDIA GPUs with out-of-the-box CUDA support so you can build and scale your transformer models with ease. Sign up today!

What are transformer models?

Attention Is All You Need marked a significant shift in the approach to processing sequential data and is now regarded as one of the most influential papers in deep learning. Before the introduction of the transformer model, most deep learning architectures for natural language processing (NLP) tasks were based on RNNs and their more advanced variants like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs).

These architectures were specifically designed to process sequence data by iterating through input one step at a time, maintaining an evolving internal state, allowing them to capture dependencies between successive inputs, and making them well-suited for tasks like language modeling, translation, and speech recognition.

Read more on recurrent neural networks here.

However, these models faced significant challenges when dealing with long-range dependencies. As sequences grew longer, RNN-based models struggled to retain relevant information from earlier inputs due to the vanishing gradient problem, which made it difficult for gradients to propagate effectively over long sequences.

Additionally, LSTMs and GRUs, while better at managing these dependencies than standard RNNs, still encountered issues with efficiently processing long sequences in a parallelizable way. Their sequential nature meant that inputs had to be processed one after the other, limiting the ability to utilize parallel computation and slowing down training on large datasets.

Transformers address these issues by offering a fresh approach that significantly enhances the ability to model long-range dependencies and is more computationally efficient. The key innovation in transformer models is an attention mechanism, which allows the model to weigh the importance of different words in a sentence, regardless of their position.

Transformer Models Image 1 Source: Paper

Unlike RNNs, which process tokens sequentially, transformers can process an entire sequence in parallel. This attention mechanism revolutionized NLP by improving the ability of models to understand context and dependencies over long distances, thus enabling more accurate predictions and translations.

In the same paper, the concept of self-attention was introduced, where a model not only attends to the input sequence but also focuses on different parts of the sequence relative to each token. When combined with the multi-head attention and feed-forward networks, this self-attention mechanism provided a robust architecture that surpassed previous models in performance and scalability.

By solving the issues of long-range dependencies and computational efficiency, transformer models opened the door to the development of more advanced AI applications, such as Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT), which have since become the foundation of modern NLP.

Now that we know what transformers are let us discuss their architecture.

Architecture of transformers

The transformer model is based on an encoder-decoder structure, but its most notable innovation lies in the attention mechanism, particularly self-attention. Let's explore the mechanisms that power these models:

1. Attention mechanism

The attention mechanism allows the model to focus on different parts of the input sequence when processing a particular element, much like how humans pay attention to specific words or phrases when understanding a sentence.

Think of it as queries, keys, and values. For example, if you have a question (query) about a specific word in a sentence. To find the answer, you search for relevant information (keys) in other parts of the sentence. The information itself is the value.

Transformer Models Image 2 Source: Paper

Here's how attention works in a transformer:

  • Calculate Attention Scores: For each word (query) in the sequence, the model computes attention scores with every other word (key). These scores reflect the relevance of each word to the query word. The higher the score, the more important the key is for understanding the query.
  • Normalize Scores: The attention scores are normalized using a softmax function, producing a probability distribution where the weights sum up to 1, ensuring that the model focuses on the most relevant words.
  • Weighted Sum: The model then calculates a weighted sum of the values, where the weights are the normalized attention scores. The weighted sum represents the contextualized representation of the query word, enriched by information from other relevant words.

2. Multi-head attention

While a single attention mechanism is powerful, transformers use multi-head attention to capture a broader range of relationships within the input sequence, akin to having multiple sets of eyes, each focusing on different aspects of the sentence.

Transformer Models Image 3 Source: Paper

The model learns multiple sets of query, key, and value matrices in multi-head attention. Each set, or "head," focuses on a different relationship between words. For instance, one head might focus on syntactic relationships (subject-verb agreement), while another might focus on semantic relationships (synonyms or antonyms).

The outputs from all the heads are combined to create a comprehensive representation incorporating various aspects of the input sequence.

3. Positional encoding

Unlike RNNs, which process sequences sequentially, transformers process the entire sequence in parallel, significantly boosting efficiency but posing a challenge: preserving the order of words, which is crucial for understanding language.

The solution is positional encoding. Before feeding the input sequence to the transformer, each word's embedding is augmented with a positional encoding, a vector representing its position in the sequence.

Transformer Models Image 4 Source: Machine learning mastery

These encodings can be generated using different methods, such as sinusoidal functions or learned embeddings. The key is to provide the model with information about the word's position, allowing it to understand the sequence's structure despite processing it in parallel.

4. Feed-forward networks

The output is passed through feed-forward networks after the attention mechanism has identified and aggregated relevant information. These networks are fully connected layers that apply non-linear transformations to the attention output, further refining the representation of each word.

They can be considered filters that extract higher-level features and patterns from the contextualized information.

Read more about feed-forward networks here: Feedforward neural networks: everything you need to know.

5. The encoder-decoder structure

A transformer model follows an encoder-decoder structure. It is like a sophisticated communication system: the encoder meticulously analyzes the input sequence, extracting its meaning and nuances. At the same time, the decoder skillfully crafts the desired output sequence based on the encoder's understanding.

Transformer Models Image 5 Source: Jalammar

  • Encoder: The encoder takes the input sequence (e.g., a sentence in English) and transforms it into a rich, contextualized representation. It achieves this through a series of identical layers, each performing an important role in understanding the input.
  • Decoder: The decoder generates the output sequence (e.g., translating the sentence into French). It also consists of stacked layers, mirroring the encoder's structure. However, the decoder incorporates additional mechanisms to ensure the output is generated sequentially and coherently.

While the encoder and decoder share a similar structure, there are key differences in how they utilize attention:

  • Encoder: The encoder solely relies on self-attention, where each word attends to all other words in the same input sequence, allowing the encoder to comprehensively represent the input by considering the relationships between all its elements.
  • Decoder: The decoder uses two types of attention:
  • Masked self-attention: This prevents the decoder from "looking ahead" at future words when generating the output sequence. It ensures that the prediction for each word is based only on the preceding words, maintaining the sequential nature of the output.
  • Encoder-decoder attention: This allows the decoder to attend to all positions in the input sequence, enabling it to focus on the most relevant parts of the input when generating each word in the output.

By combining these specialized attention mechanisms, the decoder effectively uses the encoder's understanding of the input to generate a coherent and accurate output sequence.

Now that we've explored transformers' architectural components let's delve into how they interact to process information.

How transformer models work

We'll use snippets from the provided code to illustrate key concepts.

1. Input Encoding and Positional Embeddings

Like all neural network models, transformers begin with encoding the input sequence, which could be a sentence, a series of image patches, or any sequential data. Each element in the sequence is converted into a numerical vector called an embedding. In this code, nn.Embedding maps each input token to its corresponding embedding.

class Encoder(nn.Module):
    def __init__(self, ..., vocab_size, d_model, max_len, ...):
        ...
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        ...

    def forward(self, src, ...):
        x = self.embedding(src)
        x = self.pos_encoding(x)
        ...

Since transformers process sequences in parallel, we need to incorporate positional information, which we did using PositionalEncoding, as it adds a unique vector to each embedding, representing its position in the sequence and allowing the model to understand the order of elements despite parallel processing.

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        ...
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        ...

This generates a unique vector for each position in the sequence using sinusoidal functions. These positional encodings are then added to the word embeddings, allowing the model to understand the order of words even though it processes them all simultaneously.

2. Multi-Head Attention

Next, multi-head attention allows the model to weigh the importance of different elements in the sequence when processing a particular element.

class MultiHeadAttention(nn.Module):
    ...
    def forward(self, query, key, value, mask=None):
        ...
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
       ...
        attn = F.softmax(scores, dim=-1)
        output = torch.matmul(attn, V)
        ...

It calculates attention scores (scores) by comparing the query with all keys. These scores are then normalized using softmax (attn) to create a probability distribution. Finally, a weighted sum of the values is calculated based on these attention weights (output), allowing the model to focus on the most relevant parts of the sequence for each element.

3. Feed-Forward Network

After the attention mechanism has identified and aggregated relevant information, the output is passed through a feed-forward network consisting of fully connected layers with non-linear activation functions (ReLU in this case). It acts as a filter, extracting higher-level features and patterns from the contextualized information provided by the attention mechanism

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
       ...
        self.linear1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        ...
    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

It acts as a filter, extracting higher-level features and patterns from the contextualized information provided by the attention mechanism

4. Encoder-Decoder Structure

Next, the encoder processes the input sequence and generates a contextualized representation, and then the decoder uses the representation to generate the output sequence.``

class EncoderLayer(nn.Module):
    def __init__(self, ...):
        ...
        self.attention = MultiHeadAttention(d_model, num_heads)
        ...
    def forward(self, ...):
        ...

The encoder and decoder both utilize the core components we've discussed: self-attention, multi-head attention, and feed-forward networks. However, the decoder also incorporates masked self-attention to prevent it from "looking ahead" at future words when generating the output, maintaining the sequential nature of the output.

5. Output Generation

Finally, the decoder's output is passed through a linear layer and a softmax function to generate probabilities for the next token in the sequence.

class Transformer(nn.Module):
    def __init__(self, ...):
        super(Transformer, self).__init__()
        self.encoder = Encoder(...)
        self.decoder = Decoder(...)
        ...
    def forward(self, ...):
        ...

The process is repeated iteratively until the entire output sequence is generated.

Here is the entire code:

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

# Positional Encoding
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:x.size(0), :]

# Multi-Head Attention Mechanism
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        self.linear_q = nn.Linear(d_model, d_model)
        self.linear_k = nn.Linear(d_model, d_model)
        self.linear_v = nn.Linear(d_model, d_model)
        self.fc = nn.Linear(d_model, d_model)
        self.scale = math.sqrt(self.d_k)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        Q = self.linear_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.linear_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.linear_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attn = F.softmax(scores, dim=-1)
        output = torch.matmul(attn, V)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
        return self.fc(output)

# Feed Forward Network
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

# Transformer Encoder Layer
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        attn_output = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

# Transformer Encoder
class Encoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, vocab_size, max_len, dropout=0.1):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)

    def forward(self, src, mask=None):
        x = self.embedding(src) * math.sqrt(self.embedding.embedding_dim)
        x = self.pos_encoding(x)
        x = self.dropout(x)
        for layer in self.layers:
            x = layer(x, mask)
        return x

# Transformer Decoder Layer
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attention = MultiHeadAttention(d_model, num_heads)
        self.encoder_attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
        self_attn_output = self.self_attention(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(self_attn_output))
        enc_attn_output = self.encoder_attention(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(enc_attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

# Transformer Decoder
class Decoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, vocab_size, max_len, dropout=0.1):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)

    def forward(self, tgt, enc_output, src_mask=None, tgt_mask=None):
        x = self.embedding(tgt) * math.sqrt(self.embedding.embedding_dim)
        x = self.pos_encoding(x)
        x = self.dropout(x)
        for layer in self.layers:
            x = layer(x, enc_output, src_mask, tgt_mask)
        return x

# Transformer Model
class Transformer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, vocab_size, max_len, dropout=0.1):
        super(Transformer, self).__init__()
        self.encoder = Encoder(d_model, num_heads, d_ff, num_layers, vocab_size, max_len, dropout)
        self.decoder = Decoder(d_model, num_heads, d_ff, num_layers, vocab_size, max_len, dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)

    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        enc_output = self.encoder(src, src_mask)
        dec_output = self.decoder(tgt, enc_output, src_mask, tgt_mask)
        output = self.fc_out(dec_output)
        return output

# Example Dataset
class ExampleDataset(Dataset):
    def __init__(self, num_samples, seq_len, vocab_size):
        self.num_samples = num_samples
        self.seq_len = seq_len
        self.vocab_size = vocab_size

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        src = torch.randint(0, self.vocab_size, (self.seq_len,))
        tgt = torch.randint(0, self.vocab_size, (self.seq_len,))
        return src, tgt

# Training the Transformer
if __name__ == "__main__":
    # Parameters
    d_model = 512
    num_heads = 8
    d_ff = 2048
    num_layers = 6
    vocab_size = 10000
    max_len = 100
    dropout = 0.1
    num_epochs = 10
    batch_size = 32
    learning_rate = 0.0001

    # DataLoader
    dataset = ExampleDataset(num_samples=1000, seq_len=10, vocab_size=vocab_size)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    # Define the Transformer
    transformer = Transformer(d_model, num_heads, d_ff, num_layers, vocab_size, max_len, dropout)
    optimizer = optim.Adam(transformer.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss()

    # Training Loop
    transformer.train()
    for epoch in range(num_epochs):
        total_loss = 0
        for src, tgt in dataloader:
            tgt_input = tgt[:, :-1]
            tgt_output = tgt[:, 1:].contiguous().view(-1)

            optimizer.zero_grad()
            output = transformer(src, tgt_input)
            output = output.view(-1, vocab_size)
            loss = criterion(output, tgt_output)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

    # Example Inference
    transformer.eval()
    src = torch.randint(0, vocab_size, (1, 10))  # (batch size, sequence length)
    tgt = torch.randint(0, vocab_size, (1, 1))  # Start token
    with torch.no_grad():
        for _ in range(10):
            output = transformer(src, tgt)
            next_token = output.argmax(-1)[:, -1]
            tgt = torch.cat([tgt, next_token.unsqueeze(1)], dim=1)
        print("Generated sequence:", tgt.view(-1).tolist())

Note: While this code provides a foundational understanding of how transformers work, in practice, you can use the transformers library developed by Hugging Face, which offers a vast collection of pre-trained transformer models and easy-to-use tools, allowing you to quickly build and deploy transformers without implementing all the details from scratch.

By combining these mechanisms, transformers can effectively capture long-range dependencies and process sequences in parallel, significantly improving tasks involving sequential data.

Applications of transformers

Transformer models have found widespread applications, particularly in natural language processing (NLP), but they are also used in other domains like computer vision.

Natural Language Processing (NLP)

In NLP, transformer models are used for various tasks, including machine translation, text generation, and question-answering. Two of the most prominent models are BERT and GPT:

  • BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained model designed to understand the context of words in a sentence by looking at both directions (left and right). It is used for tasks like sentence classification, named entity recognition, and sentiment analysis.
  • GPT (Generative Pre-trained Transformer) models, such as GPT-2 and GPT-3, are designed to generate human-like text based on a given input. These models are used in chatbots, language translation, and creative writing.

Computer Vision

While transformers were originally designed for NLP, their architecture has also been adapted for computer vision tasks. The Vision Transformer (ViT) is one model that applies the self-attention mechanism to image patches instead of words. ViT has achieved state-of-the-art performance in image classification, demonstrating that transformers are versatile beyond text-based applications.

Real-world applications

  • Google search: Google's search engine uses transformer models (e.g., BERT) to understand user queries better and deliver more relevant search results.
  • Machine translation: Models like GPT and BERT are used in translation services such as Google Translate, making translations more accurate and context-aware.
  • Healthcare: Transformers are used in drug discovery and genomics, particularly in protein structure prediction, as demonstrated by DeepMind's AlphaFold, which uses transformer models to predict protein folding structures.

Benefits and limitations of transformers

Benefits

  • Parallelization: One of the biggest advantages of transformer models is their ability to process sequences in parallel, making them much more efficient and scalable for large datasets.
  • Long-Range Dependencies: Transformers handle long-range dependencies better than RNNs, thanks to the self-attention mechanism, which simultaneously attends to all sequence parts.
  • Versatility: Transformers have proven effective not only in NLP but also in other areas like computer vision and even protein folding.

Limitations

  • Computationally Intensive: While transformers can process data in parallel, they are also computationally expensive to train, requiring significant hardware resources like GPUs and TPUs.
  • Data Hungry: Transformers require large datasets to achieve state-of-the-art performance, which can be a barrier for smaller organizations or tasks with limited data availability.
  • Training Complexity: The large number of parameters in transformer models can make them difficult to train, often requiring extensive fine-tuning and experimentation.

Conclusion

Transformer models represent a paradigm shift in AI and deep learning. By introducing the attention mechanism and self-attention, transformers have enabled machines to understand complex, sequential data better, making them incredibly effective in natural language processing and computer vision.

You can train your large language models faster and cost-effectively with CUDO Compute's cloud GPUs. Access the latest NVIDIA GPUs, like the NVIDIA H100, from as low as $2.79 per hour. You can also get up to a 35% discount with commitment pricing on our platform. Get started today!

Subscribe to our Newsletter

Subscribe to the CUDO Compute Newsletter to get the latest product news, updates and insights.