Part 5: A Comprehensive Deep Dive into the Theory and Mechanisms of Transformer-Based Neural Networks

11 min readOct 17, 2024

Welcome back to this journey where we delve once more into the core of artificial intelligence, continuing our exploration of the revolutionary realm of Transformers.

Throughout this series, we dissect the fundamentals, architecture, and internal mechanisms of Transformers. Here’s a quick summary of the Series:

Part 1 — Introduction to Transformers: A foundational exploration of the basics and the overarching architecture, setting the stage for a deeper dive into the transformative world of Transformers.

Part 2 — Pre-Processing in Transformers: Unraveling the intricacies of pre-processing with a spotlight on Positional Encoding and Embedding. Understand how these components lay the groundwork for the model’s comprehension.

Part 3 — Attention Mechanism Unveiled: Delving into the heart of Transformer functionality, we peel back the layers to explore the internal mechanisms, with a detailed examination of the central powerhouse — multi-head attention.

Part 4 — Training and Inference with Transformers: Unpacking the training and inference processes within Transformer-based models. Explore how these models learn and make predictions, offering insights into the dynamic world of AI training.

Part 5 (This article) — The Titans of Transformers (BERT and GPT): Concluding our series with a spotlight on BERT and GPT, two transformative models that have reshaped the landscape of natural language processing. Dive into their architectures and understand their impact on the field.

The fifth and last part will focus on two of the most revolutionary Transformer-based models: BERT and GPT. Both models have fundamentally transformed the field of natural language processing, each with a unique approach to understanding and generating human language. While BERT excels in comprehension through its bidirectional architecture, GPT has redefined language generation with its autoregressive design. This chapter will explore their architectures, training methods, and the groundbreaking impact they’ve had on AI, providing a deeper understanding of their roles in shaping modern NLP tasks.

Transformer based models

Transformer-based models have become the backbone of modern natural language processing (NLP) due to their unparalleled ability to handle complex language tasks. At the core of these models is the Transformer architecture, which utilizes self-attention mechanisms to process entire sequences of text in parallel, capturing long-range dependencies more efficiently.

Transformer models are divided into three main branches: Encoder, Decoder, and Encoder-Decoder models, representing different ways Transformer architectures are used in NLP tasks.

Understanding Encoder And Decoder LLMs (sebastianraschka.com)

1. Encoder Models

Encoder-only models primarily focus on tasks that involve understanding and interpreting text, such as classification or sentence understanding. Key examples include: BERT (2018) by Google which is one of the first widely adopted Transformer-based models that uses a bidirectional encoder to capture context from both directions in a sentence. RoBERTa (2019) by Meta: A variant of BERT. ALBERT (2020) by Google…

2. Decoder Models

Decoder-only models focus on generation tasks, such as text completion or conversation. They generate outputs one word at a time based on the previous tokens. Key examples include all GPT series by OpenAI, these are autoregressive models which revolutionized text generation, especially GPT-4, which became famous for its ability to generate human-like text. ChatGPT by OpenAI: Built on the GPT architecture but enhanced for conversational AI. PaLM (2022) and LaMDA (2021) by Google: Designed for more advanced language understanding and interaction, often for conversational agents like Google Bard which was introduced in 2023 and renamed to Gemini in December 2023…

3. Encoder-Decoder Models

These models use both encoder and decoder components and are generally used for tasks that require both understanding and generating text, such as translation or summarization. Examples include: T5 (2022) and Flan-T5 (2022) by Google: These are sequence-to-sequence models used for a wide range of tasks, from translation to summarization. BART (2020) by Meta: A model that combines the encoder-decoder architecture for improved generation tasks…

BERT (Bidirectional Encoder Representations from Transformers)

BERT, introduced by Google in 2018, is a pioneering Transformer-based model that utilizes an encoder-only architecture. This design allows BERT to excel at understanding the context and meaning of words within a sentence. Unlike models that process text sequentially (either from left to right or right to left), BERT uses a bidirectional approach, meaning it captures the full context of a word by looking at both the words that come before and after it in the sentence.

BERT was pre-trained on a vast amount of unlabeled textual data from sources like BookCorpus and Wikipedia, which provided it with diverse language patterns and contexts. This large-scale pre-training gives BERT a strong understanding of general language, allowing it to be fine-tuned for a variety of specific NLP tasks such as question answering, text classification, sentiment analysis and more...

BERT is pre-trained on two crucial NLP tasks to teach it how to understand and process natural language:

1. Masked Language Modeling (MLM):

In this task, certain words in a sentence are randomly masked by replacing them with the token [MASK]. The goal of the model is to predict the original masked words based on the context provided by the surrounding words.
For example, in the sentence “The dog [MASK] in the park,” BERT would predict the missing word “runs” by analyzing both the left and right context. This bidirectional training allows BERT to understand how words relate to each other within a sentence.

https://jalammar.github.io/illustrated-bert/

In BERT’s Masked Language Modeling (MLM) pre-training task, a specific portion of the input tokens is chosen at random, and these selected tokens are altered in various ways. The model is then tasked with predicting the original tokens at the output layer, allowing it to learn context and improve its understanding of language.

Here’s a breakdown of how this masking strategy works:

15% of the tokens in the input sentence are randomly selected for modification. These selected tokens undergo one of three possible changes:

80% of the selected tokens are replaced with the special [MASK] token: For example, the sentence “The dog is black” would become “The dog is [MASK]”, and BERT must predict the masked word (in this case, “black”) during training.
10% of the selected tokens are replaced with a random word from the model’s corpus: Using the same example, the sentence might change to “The apple is black” by replacing the word “dog” with an unrelated word, like “apple.” This teaches BERT to handle situations where the context doesn’t always align perfectly.
The final 10% of the selected tokens are left unchanged: In this case, the sentence “The dog is black” remains exactly the same for some selected tokens, like “dog,” to maintain variety and help the model learn to predict tokens correctly even when some context remains unchanged.

After applying these modifications, the model attempts to predict the original tokens at the output layer, leveraging the surrounding context provided by the remaining tokens in the sentence. This training process, focused on learning from both left and right contexts, allows BERT to develop a deep understanding of word relationships and the broader meaning of the text.

This mix of masking, replacing, and leaving some tokens unchanged adds randomness and helps the model learn better by not overfitting to specific patterns. It enhances BERT’s ability to handle diverse and real-world language tasks, as it must rely on context to accurately predict the missing tokens.

2. Next Sentence Prediction (NSP):

In this task, BERT is given pairs of sentences and is trained to predict whether the second sentence follows the first in a natural sequence. This is framed as a binary classification problem with two possible labels: IsNext (if the second sentence follows the first) and NotNext (if it does not).
For example, BERT would be given the sentences: Sentence 1: “The weather is nice today.” and Sentence 2: “Let’s go for a walk.”, BERT would predict that these sentences are related and assign the label IsNext. Conversely, if the second sentence was unrelated (e.g., “I need to finish my homework”), BERT would assign the label NotNext.

BERT Architecture

BERT is built using stacked encoder layers, with the number of layers and other parameters varying between the base and large versions of the model. The two main variants are:

BERT Base (Trained on 4 TPUs over a period of 4 days):

12 layers (Transformer encoders)
768 hidden size
12 attention heads
110 million parameters

BERT Large (Trained on 16 TPUs over 4 days):

24 layers (Transformer encoders)
1024 hidden size
16 attention heads
340 million parameters

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) — Jay Alammar — Visualizing machine learning one concept at a time. (jalammar.github.io)

Each layer in BERT’s architecture consists of the self-attention mechanism and feed-forward neural networks, which allow the model to process input tokens by capturing relationships across the entire input sequence. The larger model (BERT Large) can capture more complex representations of language due to the increased depth and number of parameters.

BERT Fine-tuning

After pre-training, BERT can be fine-tuned on specific downstream tasks by adding a small task-specific layer on top of the pre-trained BERT model. Fine-tuning involves additional training on labeled data for the specific task, where the task-specific layer’s parameters are updated. The versatility of BERT allows it to be used for four NLP tasks, as shown in the figure:

Sentence Pair Classification Tasks (Figure a):

BERT takes in two sentences and predicts whether they are related or not, or if one is a logical continuation of the other.

Single Sentence Classification Tasks (Figure b):

The model receives a single sentence and predicts a classification label, such as sentiment analysis or grammatical correctness.

Question Answering Tasks (Figure c):

BERT is fine-tuned to locate the start and end span of an answer given a question and a corresponding paragraph.

Single Sentence Tagging Tasks (Figure d):

BERT assigns a label to each token in the sentence e.g., identifying proper nouns such as people or locations (Named Entity Recognition — NER).

In each fine-tuning task, the specific objective (classification, tagging, etc.) is learned by the model on top of the general language understanding gained during pre-training. Fine-tuning BERT for a particular task only requires minimal additional parameters, which makes it an efficient and effective approach for a wide range of NLP applications.

GPT (Generative Pre-trained Transformer)

GPT, introduced by OpenAI in 2018, is a decoder-only Transformer model. Unlike models like BERT, which leverage both left and right contexts using an encoder, GPT processes information sequentially from left to right. This makes it effective for language generation tasks, where predicting the next token based on previous tokens is crucial.

Architecture Overview

(PDF) A Mathematical Investigation of Hallucination and Creativity in GPT Models (researchgate.net)

The provided diagram illustrates the architecture of GPT, showcasing how it builds upon stacked Transformer decoder blocks. Each block comprises multiple sub-layers that process inputs to generate coherent text.

Embedding Layer: Converts input tokens into dense vector representations.
Positional Encoding: Adds positional information to the embeddings, allowing the model to understand token order within the input sequence.
Dropout Layer: Introduced to prevent overfitting by randomly deactivating neurons during training.
Stacked GPT Blocks (decoder layers): The architecture builds on multiple identical decoder layers. Each decoder layer processes the inputs and improves the model’s understanding.
Layer Normalization: Normalizes inputs to stabilize training and improve convergence.
Fully Connected Layer: After processing the inputs through the multi-head attention module, the data passes through a feed-forward network, transforming token embeddings into meaningful representations.
Softmax Layer: In the final stage, the outputs are converted into probabilities over the vocabulary space using the Softmax function. This allows the model to predict the most likely next token based on previous inputs.

GPT Blocks (Decoder layers): each decoder layer consists of the following sub-layers:

Layer Normalization: Normalizes the inputs by re-centering and re-scaling, which helps stabilize and speed up training by improving convergence.
Multi-Head Attention Module: This mechanism allows the model to focus on relevant parts of the previous tokens. The attention is causal, meaning each token can only attend to itself and prior tokens, ensuring predictions do not “cheat” by seeing future tokens.
Dropout Layer: Prevents overfitting by randomly deactivating connections between neurons during the training phase.
Residual (Skip) connections: Skip connections help preserve the original input signal by adding the input of each block to its output. This technique ensures the gradients flow efficiently through the network during backpropagation, preventing the vanishing gradient problem.
Fully Connected Layer: A feed-forward neural network layer applied to transform the token embeddings.

GPT versions

Generative pre-trained transformer — Wikipedia

This table highlights the evolution of OpenAI’s GPT-n series, tracking advancements from GPT-1 to GPT-4. Let’s walk through the key characteristics and improvements across these versions:

GPT-1 (2018)

Architecture: 12-layer, 12-headed Transformer decoder with a linear-softmax layer.
Parameter Count: 117 million parameters.
Training Data: 4.5 GB from BookCorpus, containing 7,000 unpublished books.
Release Date: June 11, 2018.
Training Cost: 30 days on 8 NVIDIA P600 GPUs (~1 petaFLOP/s-day).
Significance: Introduced the autoregressive transformer architecture for text generation. This was the foundation for larger language models.

GPT-2 (2019)

Architecture: Built on GPT-1 with modified normalization techniques.
Parameter Count: 1.5 billion parameters.
Training Data: 40 GB of text collected from WebText (~8 million documents). WebText includes webpages curated from 45 million Reddit posts.
Release Date: Full version was released on November 5, 2019.
Training Cost: Tens of petaFLOPs per day (approximately 1.5e21 FLOP total).
Significance: Demonstrated large-scale text generation.

GPT-3 (2020)

Architecture: Similar to GPT-2, but scaled up with more layers and parameters.
Parameter Count: 175 billion parameters.
Training Data: 499 billion tokens sourced from CommonCrawl (570 GB), WebText, English Wikipedia, and two books corpora (Books1 and Books2).
Release Date: May 28, 2020.
Training Cost: 3640 petaFLOP/s-days (~3.12e23 FLOP total).
Significance: A breakthrough in natural language understanding and generation, setting new state-of-the-art benchmarks across multiple NLP tasks. GPT-3 became the foundation for products like ChatGPT.

GPT-3.5 (2022)

Architecture: Undisclosed.
Parameter Count: same as GPT-3 (175 billion).
Training Data: Also undisclosed, but closely related to GPT-3.
Release Date: March 15, 2022.
Significance: Enhanced response quality and contextual understanding compared to GPT-3.

GPT-4 (2023)

Architecture: Introduces Multimodality (accepts both text and images as input).
Parameter Count: Estimated 1.7 trillion parameters.
Training Data and Cost: Details remain largely undisclosed, but training cost is estimated at 2.1e25 FLOP.
Release Date: March 14, 2023.
Significance: Supports multimodal inputs (text and images) and represents the next generation of models, showing marked improvements in reasoning and creative writing tasks.

This overview demonstrates how GPT models have continuously pushed the boundaries of natural language processing and generation, with GPT-4 standing as the most advanced version to date.

Conclusion

In this final part of our series, we explored BERT and GPT, two transformative models that have revolutionized natural language processing. BERT’s bidirectional encoding enables rich contextual understanding, excelling in tasks like question answering and sentiment analysis, while GPT’s autoregressive design empowers it to generate coherent, human-like text. These models highlight the versatility of the Transformer architecture and demonstrate how innovations — whether through fine-tuning or scaling — continue to push the boundaries of AI.

Across the series, we journeyed through the theory and mechanisms of transformer-based neural networks, demonstrating how models based on this technology have redefined the field of NLP. As research continues to focus on scalability, efficiency, and responsible AI, transformers will remain at the forefront of innovation, opening new frontiers. With this series, I hope to have provided you with a comprehensive understanding of these groundbreaking models, their inner workings, and their future potential in the ever-evolving landscape of AI.