Part 3: A Comprehensive Deep Dive into the Theory and Mechanisms of Transformer-Based Neural Networks

11 min readFeb 25, 2024

Welcome back to the journey through the heart of AI, where we continue our exploration of the transformative world of Transformers.

In this series, we explore the fundamentals, architecture, and inner workings of Transformers. We provide a holistic understanding before delving into intricate details. Subsequent articles will unveil the inner workings of the Transformer architecture.

Here’s a quick summary of the Series:

Part 1 — Introduction to Transformers: A foundational exploration of the basics and the overarching architecture, setting the stage for a deeper dive into the transformative world of Transformers.

Part 2 — Pre-Processing in Transformers: Unraveling the intricacies of pre-processing with a spotlight on Positional Encoding and Embedding. Understand how these components lay the groundwork for the model’s comprehension.

Part 3 (This article) — Attention Mechanism Unveiled: Delving into the heart of Transformer functionality, we peel back the layers to explore the internal mechanisms, with a detailed examination of the central powerhouse — multi-head attention.

Part 4 — Training and Inference with Transformers: Unpacking the training and inference processes within Transformer-based models. Explore how these models learn and make predictions, offering insights into the dynamic world of AI training.

Part 5 — The Titans of Transformers (BERT and GPT): Concluding our series with a spotlight on BERT and GPT, two transformative models that have reshaped the landscape of natural language processing. Dive into their architectures and understand their impact on the field.

In the previous parts, we dissected the components that make Transformers the powerhouses they are. Today, we delve even deeper, as we focus on the attention mechanism, the pulsating heart of the Transformer architecture.

Attention mechanism

The attention mechanism is a pivotal concept in neural network architectures, enhancing the model’s ability to focus on specific portions of input data. It’s widely used in various deep learning tasks, particularly in sequence-to-sequence models like Transformers.

Self-attention, a variant of the attention mechanism that allows elements within the same sequence to attend to each other. It’s a key component in the Transformer architecture and it enables them to capture complex relationships and dependencies within sequential data. It is particularly effective in modeling long-range dependencies in sequential data like sentences.

**Transformer: A Novel Neural Network Architecture for Language Understanding — Google Research Blog**

In the example on the left, “it” refers to the word “animal”, suggesting that the animal didn’t cross the street due to tiredness. In the right example “it” refers to the word “street”, indicating that the animal didn’t cross because the street was too wide. The distinction in meaning is based on the context provided by the associated adjective (“tired” in the first sentence and “wide” in the second).

In both cases, self-attention allows the model to capture the nuanced relationships between words and understand the specific referent for the pronoun “it” based on the surrounding context and semantics. The attention weights reveal which words the model considers more important for determining the referent of the pronoun in each instance.

Attention in Transformers

Multi-Head Attention

The Attention layer in the transfrmers architecture takes its inputs in form of parameters (Query, Key and Value). Each of these parameters plays a distinct role in determining how the model attends to different parts of the input sequence.

Query (Q): The Query matrix is responsible for generating queries or questions about the input sequence.
Key (K): The Key matrix provides information about the positions in the input sequence.
Value (V): The Value matrix contains information associated with each position in the input sequence.

The encoder input matrix (embeddings + positional encoding) is used for all of the three parameters in the self attention layer in the encoder.

Multi-Head means that the input matrix is duplicated into multiple matrices, each copy will be processed independently through a separate “head”.

Example:

Input sequence : [“<BOS>”, “the”, “cat”, “is”, “black”, “<EOS>”]
d_model = 8 and nb_heads = 4 and seq_size = 6

Multi-Head Attention — Linear layers

Linear layers are applied for Q, K and V separately, each their own weights.

These weights are of size (d_model x q_size) and are learned during the training phase. the query size is a new parameter which we introduce, it is equal to: q_size = d_model / nb_heads

Example:

Input sequence : [“<BOS>”, “the”, “cat”, “is”, “black”, “<EOS>”]
d_model = 8 and nb_heads = 4 and seq_size = 6 and q_size = 8 / 4 = 2

Multi-Head Attention — Scaled Dot-Product

The Scaled dot-product attention calculates the similarity between each pair of query and key vectors. This similarity is often computed as the dot product of the query and key vectors.

Scaled dot product formula (Q, K and V are the output of the previous Linear layers)

QKᵀ represents the dot product of the Query matrix (Q) and the transpose of the Key matrix (K). This provides a raw measure of similarity between the query and key and this similarity reflects how much attention should be given to each element in the input sequence. A higher dot product implies a higher level of similarity, while a lower dot product suggests lower relevance.

dₖ is the query size (q_size) parameter that we introduced before, the division by √dₖ is is called “Scaling” and it is introduced to prevent the dot product from becoming too large as the dimensionality of the vectors increases. It helps stabilize the gradients during the training process, making the model more robust.

The Softmax function is used to normalize the raw scores and ensures that the resulting scores are in the range [0, 1] and sum to 1, effectively representing a probability distribution. This distribution reflects the model’s certainty or confidence in assigning attention weights to different positions in the input sequence.

Multi-Head Attention — Concat and linear

The Scaled Dot-Product outputs a matrix of dimension (seq_size x q_size) for each head. The Concat layer concatenates matrices from each head so they come back to their original dimension (d_model x seq_size).

The linear layer applies a weight matrix of dimension (d_model x d_model) to the output.

Recap: Multi-Head Attention

The Multi-Head Attention mechanism in the Transformer architecture involves three key matrices: Q, K and V. These matrices play distinct roles in determining how the model attends to different parts of the input sequence.

In the self-attention layer of the encoder, the encoder input matrix (combining embeddings and positional encoding) is used for all three parameters. Multi-Head Attention involves duplicating the input matrix into multiple matrices, each processed independently through a separate “head.”

Linear layers are applied separately for Q, K, and V, with weights learned during training.

The Scaled Dot-Product Attention calculates similarity between query and key vectors using the dot product. Scaling is employed to stabilize gradients, and the Softmax function normalizes raw scores into a probability distribution.

The outputs from each head are concatenated, returning to the original dimension . A linear layer applies a weight matrix to the concatenated output.

This process allows the model to capture complex relationships within the input sequence through attention mechanisms, enhancing its ability to understand and process information effectively.

Why the use of multiple heads (Multi-head) ?

The attention mechanism allows the model to focus on different parts of the input sequence when making predictions. When multiple heads are used, it means that the attention mechanism is applied multiple times in parallel, each with its own set of weights.

Each attention head has its own set of learned weights, which means it can specialize in capturing specific patterns or dependencies in the data. Separate sections of the Embedding can learn different aspects of the meanings of each word, as it relates to other words in the sequence. This allows the Transformer to capture richer interpretations of the sequence.

In summary, Multi-Head Attention allows the model to enhance the model’s capacity to capture diverse patterns and attend to different aspects of the input sequence simultaneously.

Masked self-attention

In standard self-attention, each element token in a sequence attends to all other elements in the sequence, including itself.

Masked self-attention is a variant of the self-attention mechanism used in the Transformer architecture, specifically employed in the decoder side of the model during language generation tasks:

The purpose of masked self-attention is to prevent attending to future positions in the sequence during training, ensuring that each position can only attend to its past positions. This is because the model should not have access to information that hasn’t been generated yet.

During training, positions in the sequence that correspond to future tokens are masked. The masking is typically implemented by applying a triangular mask to the attention weights matrix, setting the upper triangular portion to -∞.

Example:

And since Softmax(-∞) = 0, we expect that the probability assigned to each element that had a value of −∞ in the input matrix to be equal to 0:

Encoder-decoder attention

The encoder-decoder attention is used for tasks like machine translation, text summarization, and question-answering. It facilitates the alignment of information between the input (encoder) and output (decoder) sequences.

The encoder-decoder attention enables the decoder to attend to different positions in the encoder’s output sequence:

In encoder-decoder attention, the query Q is derived from the decoder, while the key K and value V are derived from the encoder:

The attention scores are calculated by taking the dot product of the query from the decoder with the key from the encoder. These scores determine how much attention the decoder should give to each position in the encoder’s output sequence.

The encoder-decoder attention enables the model to selectively attend to different positions in the encoder’s output, allowing the decoder to gather pertinent information and context for accurate sequence generation. This mechanism is fundamental in tasks like machine translation, where understanding the entire input sequence is essential for generating a meaningful translation.

Feed Forward Network

The Feed Forward Network (FFN) is a normal Neural network consisting of two fully connected layers:

The dimension of the input and output layers are equal to d_model.
The dimension of the hidden layer is generally equal to (4 * d_model). After that, there is a Relu activation function: ReLu(x) = max(0, x)

In the Transformer architecture, the Feed Forward Network (FFN) is typically positioned after the multi-head self-attention layer in each encoder and decoder block:

This placement is a crucial element of the design and contributes to the mail role of Neural Networks which is to introduce non-linearity within the data. The FFN processes the information obtained from the Multi-Head Attention layer and learns complex relationships within the data.

Add & Norm layers

Add & Norm layers are placed after the attention and feed forward layers, the objective is to allow the preservation of essential information and maintaining stability throughout the network:

Add layer (Residual or Skip connection): x + Sublayer(x)

The purpose of the residual connection is to allow information to flow through the network without being completely transformed by the sublayers. It helps mitigate the vanishing gradient problem, making it easier to train very deep neural networks.

Norm layer (Layer Normalization):

E(x) = sum(x) / d_model
Var(x) = sum((x — mean)²) / d_model
ε (epsilon) is a small constant added to the denominator for numerical stability to avoid division by zero.
γ (gamma) and β (beta) are learnable parameters that allow the model to scale and shift the normalized values, introducing a degree of flexibility.

The purpose of layer normalization is to improve the training stability and convergence of deep networks. It helps maintain a consistent distribution of activations throughout the network, making it easier for the model to learn.

Linear and Softmax layers

The final linear and softmax layers are crucial for generating the output sequence, these layers take the decoder’s contextual embeddings and produce a probability distribution over the target vocabulary for each position in the output sequence.

Linear Layer: The linear layer is applied to the decoder output matrix (seq_size x d_model) to transform them into a numerical vector that matches the size of the target vocabulary.
Softmax Layer: Following the linear layer, the softmax layer is applied to the transformed values. The softmax function is used to convert these values into a probability distribution over the target vocabulary.
For each position in the output sequence, the softmax function calculates the likelihood of each word in the vocabulary being the next word in the generated sequence.

Example:

Decoder Input: “<BOS>, the, cat, is, black”

Vocabulary: (“<BOS>”, “the”, “cat”, “is”, “black”, “<EOS>”)

d_model = 8 and decoder_seq_size = 5 and vocab_size = 6

“vocab_size” represents the size of the vocabulary, which is the total number of unique words or tokens that the model can generate in the output sequence.

The linear layer facilitates the transformation of contextual embeddings into a format compatible with the target vocabulary, and the softmax layer converts these transformed values into a probability distribution so that the model can make decisions about the next word to generate.

Conclusion

In conclusion, the attention mechanism, feed-forward neural network, Add & Norm layers, and the linear and softmax layers are integral components of the Transformer architecture. Together, they enable the model to discern complex patterns, learn relationships, and generate meaningful sequences. The attention mechanism facilitates selective focus, while the feed-forward network introduces non-linearity. Add & Norm layers and residual connections ensure stable gradient flow, and the final linear and softmax layers play a pivotal role in generating output sequences. These components collectively showcase the Transformer’s prowess in handling sequential data, making it a versatile and powerful architecture in various applications.

In the upcoming Part 4, we will delve into the crucial aspects of training and inference with Transformer-based models. This exploration will demystify the processes through which these models learn and make predictions. By unraveling the intricacies of AI training, we aim to provide valuable insights into the dynamic world of machine learning. Stay tuned for a comprehensive understanding of the training and inference mechanisms that drive the capabilities of Transformer architectures, further enriching our journey into the realm of artificial intelligence.

Part 3: A Comprehensive Deep Dive into the Theory and Mechanisms of Transformer-Based Neural Networks

Attention mechanism

Attention in Transformers

Multi-Head Attention

Multi-Head Attention — Linear layers

Multi-Head Attention — Scaled Dot-Product

Multi-Head Attention — Concat and linear

Recap: Multi-Head Attention

Masked self-attention

Encoder-decoder attention

Feed Forward Network

Add & Norm layers

Linear and Softmax layers

Conclusion

Written by Youssef CHAFIQUI

No responses yet