Part 2: A Comprehensive Deep Dive into the Theory and Mechanisms of Transformer-Based Neural Networks

6 min readJan 17, 2024

Transformers stand as the avant-garde, reshaping the way machines process information. Welcome back to the journey through the heart of AI, where we continue our exploration of the transformative world of Transformers.

In this series of articles, we embark on a journey through the fundamentals, architecture, and internal workings of Transformers. Our approach is top-down, providing a holistic understanding before delving into the intricate details. The upcoming articles will lift the veil on the system’s operations, offering insights into the inner workings of the Transformer architecture.

Here’s a quick summary of the Series:

Part 1 - Introduction to Transformers: A foundational exploration of the basics and the overarching architecture, setting the stage for a deeper dive into the transformative world of Transformers.

Part 2 (This article) - Pre-Processing in Transformers: Unraveling the intricacies of pre-processing with a spotlight on Positional Encoding and Embedding. Understand how these components lay the groundwork for the model’s comprehension.

Part 3 - Attention Mechanism Unveiled: Delving into the heart of Transformer functionality, we peel back the layers to explore the internal mechanisms, with a detailed examination of the central powerhouse — multi-head attention.

Part 4 - Training and Inference with Transformers: Unpacking the training and inference processes within Transformer-based models. Explore how these models learn and make predictions, offering insights into the dynamic world of AI training.

Part 5 - The Titans of Transformers (BERT and GPT): Concluding our series with a spotlight on BERT and GPT, two transformative models that have reshaped the landscape of natural language processing. Dive into their architectures and understand their impact on the field.

In the previous part, we embarked on a foundational exploration, dissecting the components that make Transformers the powerhouses they are. Now, we delve even deeper into the labyrinth, focusing on the pulsating heart of the Transformer architecture.

Transformers sub-layers

Input Self-Attention:

This sub-layer is a crucial component of the Transformer’s encoder. It allows the model to weigh the importance of different words in the input sequence concerning each other. It enables the Transformer to capture relationships and dependencies within the input data efficiently.

Output Masked Self-Attention:

The output masked self-attention sub-layer is primarily associated with the decoder in a Transformer architecture. During the generation of each element in the output sequence, this sub-layer ensures that the model attends only to positions preceding the current position. This masking prevents the model from “cheating” by looking ahead in the sequence, ensuring that predictions are made based on previously generated elements.

Encoder-Decoder Attention:

The encoder-decoder attention sub-layer allows the model to pay attention to different parts of the input sequence (encoder’s output) while generating the output sequence. It allows the decoder to consider the entire input sequence’s context, enabling the model to generate accurate and contextually relevant predictions.

Feed Forward Neural Networks:

The feed forward neural network is an essential component, contributing to the model’s ability to capture and process intricate patterns within the input sequence. This network processes the information gathered by the attention mechanisms, injecting non-linearity and enabling the model to capture complex relationships within the data.

Transformer components in details

Tokenization

Tokenization is the initial step where input sequences, are broken down into smaller units or tokens. Tokens serve as the fundamental building blocks for the subsequent processing steps in the Transformer architecture.

Tokenization splits text into words (or sub-words) called tokens. Some other special tokens are also added to the input sequence:

<BOS> : Added in the start of the sequence and indicates the Beginning Of the Sequence.
<EOS> : Added in the end of the sequence and indicates the End Of the Sequence.

Example:

Embedding

Word embeddings are a technique in NLP that represents words as numerical vectors in a multi-dimensional space. These vectors capture the semantic meaning of words and their relationships, making it easier for computers to work with and understand language.

The key idea behind word embeddings is to represent words in a way that preserves their semantic meaning. This means that similar words are represented by vectors that are close to each other in the vector space.

Word embeddings are learned from large text corpora. During training, the model looks at the context in which words appear. Words that often appear in similar contexts are assigned vectors that are closer in the vector space.

Word embeddings transform a word into a vector (of size d_model), the Input Embedding layer transforms each input sequence (of size seq_size) into a matrix.

Example: [“<BOS>”, “the”, “cat”, “is”, “black”, “<EOS>”]
For this example, we have seq_size = 6 and for simplicity, let’s consider d_model = 8

Positional Encoding

In an RNN, words are processed one at a time in a sequential manner. However, this sequential processing limits parallelism and makes it challenging to capture long-range dependencies.

In contrast, Transformers process all words in a sequence in parallel, allowing for more efficient computation. This is its major advantage over the RNN architecture, but it means that the position information is lost, and has to be added back in separately.

Positional Encoding allows transformer models to take full advantage of parallelism while maintaining an understanding of the order and position of words in the sequence.

Positional Encoding creates a matrix with the same dimensions as the embeddings matrix (seq_size, d_model).

To populate this matrix, we use two formulas :

Where “pos” is the position of the word in the sequence (0 .. seq_size-1).
And “i” is the index in the embedding dimension (0 .. d_model — 1)
The first formula is used to populate even indices, while the second is used for odd indices.

Embedding + Positional Encoding

Word embeddings and positional encoding matrices are added together as an input preprocessing stage, combining these two matrices creates representations that are both semantically meaningful and positionally aware. This means that each vector in the input matrix encodes both the meaning of a token and its position.

This combined representation is then used as the input to the Transformer model’s self-attention mechanism, allowing the model to consider both the content of words and their positions when making predictions.

Conclusion

As we wrap up the first leg of our journey into the intricate world of Transformer components, we’ve laid a sturdy foundation by exploring the pivotal roles of Tokenization, Embedding, and Positional Encoding.

Tokenization initiated the process, breaking down complex input sequences into manageable units. Embedding enriched these tokens with semantic relationships, transforming them into numerical representations. Positional Encoding addressed the sequential challenge, infusing crucial information about the position of each token.

These initial components serve as the bedrock upon which the transformative power of the Transformer architecture unfolds. The symphony of Tokenization, Embedding, and Positional Encoding orchestrates the model’s ability to comprehend and process information in a way that is both sophisticated and groundbreaking.

In the upcoming installment, we’ll venture deeper, unraveling the mysteries of Multi-Head Attention, Feed Forward Networks, Add & Norm Layers, Linear Layers, and the indispensable Softmax Layer. Together, we’ll continue to decode the marvels that shape the landscape of modern artificial intelligence.