How Transformers Work: The Architecture Behind Modern AI
Why Transformers Were Needed
Before transformers, sequence processing was dominated by recurrent neural networks, which read input one element at a time and accumulated context in a hidden state. This sequential processing had two major problems. First, it was slow: because each step depended on the previous step's output, the computation could not be parallelized across time steps. Training an RNN on a long sequence required processing each element in order, which underutilized modern GPUs designed for massive parallelism. Second, despite innovations like LSTM and GRU gates, RNNs still struggled with very long-range dependencies because information had to propagate through every intermediate time step to travel from one position to a distant one.
Attention mechanisms, first introduced as an addition to RNNs for machine translation in 2014, provided a partial solution. Instead of relying solely on the hidden state to carry information between positions, attention allowed the decoder to look directly at any position in the input sequence when generating each output word. This dramatically improved translation quality, especially for long sentences. The transformer architecture took this idea to its logical extreme: remove the recurrence entirely and use attention as the only mechanism for communicating between positions.
Self-Attention: The Core Mechanism
Self-attention is the operation that lets each position in a sequence attend to every other position. For a sentence of 10 words, self-attention computes a 10x10 matrix of attention weights that describes how strongly each word relates to every other word. The word "it" in "The cat sat on the mat because it was tired" should attend strongly to "cat" to resolve the pronoun reference, and the attention mechanism learns to do exactly this.
The computation uses three learned linear projections of each input position. For each input vector (representing a word or token), the network computes a Query (Q), a Key (K), and a Value (V) by multiplying the input by three separate weight matrices. The Query represents "what am I looking for?" The Key represents "what do I contain?" The Value represents "what information do I provide when attended to?" The attention weight between position i and position j is the dot product of position i's Query with position j's Key, scaled by the square root of the key dimension to prevent the dot products from growing too large.
The formula is: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V. The softmax converts the raw dot products into a probability distribution, ensuring the attention weights for each position sum to 1. The result is a weighted average of the Value vectors, where the weights are determined by how well each Key matches the Query. Positions with highly matching query-key pairs contribute more to the output. This entire computation is a matrix multiplication, making it highly parallelizable on GPUs.
Multi-Head Attention
A single attention computation captures one type of relationship between positions. But language has many simultaneous types of relationships: syntactic dependencies, semantic similarity, coreference, and positional proximity all matter at the same time. Multi-head attention runs several attention computations in parallel, each with its own set of Q, K, and V weight matrices. A typical transformer uses 8 to 64 attention heads.
Each head operates on a lower-dimensional projection of the input. If the model dimension is 768 and there are 12 heads, each head works with 64-dimensional queries, keys, and values. The outputs of all heads are concatenated and projected back to the full model dimension. During training, different heads naturally specialize in different types of relationships. Researchers have found heads that track syntactic structure, heads that attend to nearby positions, heads that connect pronouns to their antecedents, and heads that focus on specific semantic roles.
The Complete Transformer Block
A transformer block consists of two sub-layers: a multi-head self-attention layer and a position-wise feedforward network. Each sub-layer is wrapped in a residual connection and layer normalization. The residual connection adds the input of the sub-layer to its output, ensuring that gradient signals can flow directly through the block without attenuation. Layer normalization stabilizes the activations by normalizing across the feature dimension.
The feedforward network is a simple two-layer fully connected network applied independently to each position. It typically expands the dimension by a factor of 4 (from 768 to 3072 in a base model), applies a non-linear activation like GELU or SiLU, and projects back to the original dimension. This feedforward network is where much of the "knowledge" in a transformer is stored, as it provides the non-linear transformation capacity that attention alone cannot supply. The feedforward layers contain roughly two-thirds of the total parameters in a transformer model.
A complete transformer stacks many of these blocks sequentially. BERT-base has 12 blocks, GPT-3 has 96 blocks, and larger models have even more. Each block refines the representation, with early blocks capturing local and syntactic patterns and later blocks capturing increasingly abstract semantic relationships. The total parameter count is determined by the model dimension, number of heads, feedforward dimension, and number of blocks.
Positional Encoding
Because self-attention treats the input as an unordered set (the attention computation is the same regardless of position), the transformer needs an explicit mechanism to encode position information. Without this, the model would treat "the cat sat on the mat" and "the mat sat on the cat" identically. Positional encodings are added to the input embeddings before they enter the transformer blocks.
The original transformer used sinusoidal positional encodings: a fixed pattern of sine and cosine functions at different frequencies for each position. This has the nice property that the encoding for position p+k can be expressed as a linear function of the encoding for position p, which theoretically allows the model to learn relative position easily. Most modern transformers use learned positional encodings, where the encoding for each position is a trainable parameter, or rotary positional embeddings (RoPE), which encode relative position directly in the attention computation.
Encoder-Decoder vs Decoder-Only
The original transformer had two components: an encoder that processed the input sequence with bidirectional attention (each position attends to all positions) and a decoder that generated the output sequence with causal attention (each position only attends to earlier positions). This encoder-decoder architecture is natural for tasks like translation, where the entire input is available before generation begins.
BERT (2018) demonstrated that an encoder-only transformer, trained to predict masked words in a sentence, could produce powerful general-purpose text representations. Fine-tuning BERT on specific tasks set new records across a wide range of NLP benchmarks. The bidirectional attention was key: understanding a masked word benefits from context on both sides.
GPT (2018) took the opposite approach: a decoder-only transformer trained to predict the next token in a sequence. This autoregressive design generates text one token at a time, with each new token conditioned on all previously generated tokens. The decoder-only architecture turned out to scale more effectively than encoder-decoder or encoder-only designs, and virtually every major LLM since GPT-2 has been decoder-only. GPT-3, GPT-4, Claude, LLaMA, Mistral, and Gemini all follow this pattern.
Why Transformers Scale So Well
Transformers have a unique scaling property: they consistently improve with more data, more parameters, and more compute. Scaling laws, first documented systematically by Kaplan et al. in 2020, show that the loss of a transformer language model follows a power-law relationship with each of these three factors. Doubling the parameter count gives a predictable improvement. Doubling the training data gives a predictable improvement. These improvements show no sign of plateauing at current scales.
The parallelism of the architecture is critical to this scaling. Because self-attention processes all positions simultaneously, transformers fully utilize modern GPU hardware that is designed for massive parallel computation. Training an RNN on a sequence of 1000 tokens requires 1000 sequential steps. Training a transformer on the same sequence requires only the depth of the network (typically 12 to 96 steps), regardless of sequence length. This parallelism means that increasing model size translates directly into increased compute utilization rather than longer wall-clock training times.
The downside is the quadratic memory cost of self-attention. Computing attention weights for a sequence of length N requires an N x N attention matrix, which means memory usage grows quadratically with sequence length. For a sequence of 100,000 tokens, this matrix would have 10 billion entries. Various approaches address this: sparse attention patterns, linear attention approximations, and techniques like FlashAttention that optimize the memory access patterns to reduce the practical cost even though the theoretical complexity remains quadratic.
Transformers process sequences using self-attention, which computes relationships between all pairs of positions simultaneously. This parallelizable design, combined with predictable scaling behavior, has made transformers the universal architecture of modern AI, powering language models, vision systems, and scientific applications across virtually every domain.