Transformer Architecture Explained
The Key Insight: Attention Over Recurrence
Before transformers, sequence processing meant recurrent processing: reading one token at a time, updating a hidden state, passing it to the next step. This sequential nature created two problems. First, it was slow because each step depended on the previous step, preventing parallel computation. Training an RNN on a 1,000-token sequence meant 1,000 sequential operations, even on a GPU with thousands of cores. Second, long-range dependencies degraded because information had to survive through hundreds of sequential state updates.
The transformer's insight is that you do not need to process sequences sequentially. Instead, you can process all positions at once and use a learned attention mechanism to determine which positions are relevant to each other. A word at position 500 can directly attend to a word at position 1 without the information passing through 499 intermediate steps. This is both faster (all attention computations are parallelizable) and more accurate (no information degradation over distance).
Self-Attention: How It Works
Self-attention computes a new representation for each position in the sequence by taking a weighted average of all positions' representations, where the weights are determined by how relevant each position is to the current one.
The mechanism uses three learned projections for each position: a query (Q), a key (K), and a value (V). Think of it as a search system. The query is "what am I looking for?" The key is "what do I contain?" The value is "what information do I provide?" For each position, the attention score between its query and every other position's key determines how much that other position's value contributes to the output.
The computation has four steps. First, project each input vector into Q, K, and V using learned weight matrices. Second, compute attention scores by taking the dot product of each query with all keys, then divide by the square root of the key dimension (to prevent the dot products from growing too large). Third, apply softmax to convert scores into weights that sum to 1. Fourth, compute the output as a weighted sum of the value vectors.
For the sentence "The animal didn't cross the street because it was too tired," the attention mechanism learns to assign a high weight between "it" and "animal" and a low weight between "it" and "street." This effectively resolves the pronoun reference, a task that requires understanding which entity "it" refers to, by learning that the attention pattern between pronouns and their antecedents is a useful feature for predicting the next word.
Multi-Head Attention
A single attention computation can only capture one type of relationship between positions. But language has many simultaneous relationship types: syntactic (subject-verb agreement), semantic (topic coherence), positional (nearby words are often related), and pragmatic (question-answer pairs).
Multi-head attention runs multiple attention computations in parallel, each with its own learned Q, K, and V projections. If the model has 12 attention heads, it computes 12 separate sets of attention weights and 12 separate weighted sums. These are concatenated and projected through a final weight matrix to produce the layer's output.
Research into attention head specialization shows that different heads do learn different functions. Some heads consistently track syntactic dependencies (connecting verbs to their subjects). Others track coreference (connecting pronouns to their antecedents). Others respond to positional proximity (attending to the immediately preceding token). This specialization is not programmed; it emerges from training as the most efficient division of labor for predicting the next token.
The Transformer Block
A transformer block consists of two sub-layers, each with a residual connection and layer normalization:
Multi-head self-attention computes context-dependent representations by allowing each position to attend to all others. The residual connection adds the input to the attention output, and layer normalization stabilizes the values.
Position-wise feedforward network applies a two-layer neural network independently to each position. This typically expands the dimension by a factor of 4 (e.g., from 768 to 3,072), applies a GELU activation, then projects back down. The feedforward network is where much of the model's factual knowledge is believed to be stored, as opposed to the attention layers which primarily route information between positions.
A full transformer model stacks many of these blocks. GPT-2 has 12 blocks, GPT-3 has 96, and larger models may have 120+. Each block refines the representations produced by the previous block, building increasingly abstract and task-relevant features.
Positional Encoding
Self-attention is position-agnostic by default: it computes the same attention weights regardless of where tokens appear in the sequence. But word order matters in language ("dog bites man" vs. "man bites dog"), so the model needs position information.
The original transformer used sinusoidal positional encodings, adding fixed sine and cosine functions of different frequencies to the input embeddings. Each position gets a unique vector, and the mathematical properties of sinusoids allow the model to learn to attend to relative positions (e.g., "the word 3 positions before me") without explicitly computing them.
Modern models typically use learned positional embeddings (a separate embedding table indexed by position) or rotary positional encodings (RoPE), which encode position by rotating the query and key vectors. RoPE has the advantage of naturally capturing relative position (how far apart two tokens are) rather than absolute position (what position number each token occupies), and it extends more gracefully to sequence lengths longer than those seen during training.
Encoder vs. Decoder Transformers
The original transformer paper described an encoder-decoder architecture for machine translation. The encoder processes the input sentence bidirectionally (each position attends to all other positions), and the decoder generates the output sentence autoregressively (each position attends only to previous positions and to the encoder output).
Encoder-only models (BERT, RoBERTa) process the full input bidirectionally, producing rich contextual representations of each token. They excel at understanding tasks: classification, question answering, named entity recognition. BERT's masked language modeling objective trains the model to predict missing words from bidirectional context.
Decoder-only models (GPT, Claude, LLaMA) process input left-to-right with causal masking, where each position can only attend to previous positions. They are trained to predict the next token, making them natural text generators. The same architecture handles understanding tasks by treating them as generation tasks (e.g., generate the answer after reading the question).
Encoder-decoder models (T5, BART) use both components. The encoder processes the full input, and the decoder generates the output while attending to the encoder's representations. These are natural for sequence-to-sequence tasks like translation, summarization, and text-to-SQL.
Decoder-only models have become the dominant paradigm because they can handle both understanding and generation, and scaling laws have shown that larger decoder-only models consistently outperform smaller encoder-decoder models at equivalent compute budgets.
Computational Cost and Optimization
The main computational bottleneck of transformers is the attention mechanism. Computing attention scores between all pairs of positions takes O(n^2) time and memory, where n is the sequence length. For a sequence of 100,000 tokens, this means 10 billion pairwise computations per attention layer, which is why long-context processing is expensive.
FlashAttention (2022) reorganizes the attention computation to be more memory-efficient by tiling the computation and keeping intermediate results in fast GPU SRAM rather than slower GPU memory. This does not change the O(n^2) complexity but reduces the constant factor by 2-4x, making long-context processing practical.
KV caching avoids redundant computation during autoregressive generation. When generating the 500th token, the keys and values for the first 499 tokens have not changed since they were computed. KV caching stores them, so only the new token's keys and values need to be computed at each step, reducing generation time significantly.
Sparse attention methods (Longformer, BigBird) restrict each position to attending to only a subset of positions (local neighbors plus some global positions), reducing complexity to O(n) at the cost of not being able to attend to every position. For very long sequences, this tradeoff is often worthwhile.
Why Transformers Dominate
Transformers dominate not because of any single property but because of a combination that no prior architecture matched. They parallelize efficiently on GPUs (unlike RNNs). They capture long-range dependencies directly (unlike RNNs and CNNs). They scale predictably with more parameters and data (scaling laws). They transfer across tasks (a pre-trained transformer can be fine-tuned for dozens of different tasks). And they are flexible enough to process text, images, audio, video, and structured data with minimal architectural changes.
The transformer is the foundation of the current AI revolution. Understanding it is essential for understanding anything that modern AI systems do.
The transformer processes sequences using self-attention, where every position computes weighted connections to every other position simultaneously. Multi-head attention captures multiple relationship types in parallel, and stacked transformer blocks build increasingly abstract representations. The architecture's parallelizability, long-range modeling, and predictable scaling have made it the universal foundation for language models, and it is rapidly expanding to vision, audio, and scientific computing.