How Large Language Models Work: The Technology Behind GPT, Claude, and LLaMA
The Core Mechanism: Next-Token Prediction
Every large language model is fundamentally a next-token predictor. Given a sequence of tokens (words or word pieces), the model outputs a probability distribution over all possible next tokens. If the input is "The capital of France is," the model assigns high probability to "Paris" and low probability to most other tokens. Training maximizes the probability of the actual next token across trillions of examples drawn from the internet, books, code repositories, and other text sources.
This training objective is deceptively simple, but its implications at scale are profound. To predict the next word accurately, the model must learn grammar (which word forms are syntactically valid), semantics (what words mean and how they relate), world knowledge (that Paris is the capital of France), reasoning patterns (if A implies B and B implies C, then A implies C), and even social conventions (how formal versus informal text differs). All of these capabilities emerge from the single objective of predicting the next token, because accurate prediction requires understanding the text at every level.
Generation works by repeatedly sampling from the predicted distribution. The model predicts a distribution over next tokens, one token is sampled from that distribution (with temperature and top-p controlling randomness), that token is appended to the input, and the process repeats. A 500-word essay requires approximately 700 sequential prediction steps. Each step involves a full forward pass through the entire model, which is why generation is computationally expensive and why inference optimization is a major engineering challenge.
Architecture and Scale
All major LLMs use the decoder-only transformer architecture: a stack of transformer blocks, each containing causal self-attention (each token attends only to previous tokens, not future ones) and a feedforward network, wrapped in residual connections and layer normalization. The model takes token embeddings as input and produces logits (unnormalized probability scores) over the vocabulary as output.
The scale of modern LLMs is staggering. GPT-3 (2020) had 175 billion parameters and was trained on 300 billion tokens. GPT-4 (2023) is estimated to have over a trillion parameters. LLaMA 3 (2024) ranges from 8 to 405 billion parameters. These models are trained on trillions of tokens, requiring thousands of GPUs running for months. The total compute for training a frontier model in 2026 exceeds 10^25 FLOPS, roughly equivalent to a billion laptops running continuously for a year.
The parameter count determines the model's capacity: how much knowledge and how many patterns it can store. The amount of training data determines how much of that capacity is utilized. Scaling laws, first documented by Kaplan et al. at OpenAI, show that loss decreases as a power law with both parameter count and training data. The Chinchilla scaling law (Hoffmann et al., 2022) showed that previous models were undertrained: optimal performance comes from training a smaller model on more data rather than a larger model on less data. A 70-billion-parameter model trained on 1.4 trillion tokens outperforms a 175-billion-parameter model trained on 300 billion tokens.
The Training Pipeline
Pre-Training
Pre-training teaches the model general language understanding through next-token prediction on a massive text corpus. The training data is curated from web crawls (Common Crawl, filtered for quality), books, Wikipedia, academic papers, code repositories (GitHub, Stack Overflow), and sometimes proprietary datasets. Data quality matters enormously: deduplication, filtering for quality and safety, balancing domains, and removing personal information are substantial engineering efforts that directly affect model capability.
Pre-training typically uses AdamW optimizer with cosine learning rate decay and linear warmup. Training runs for one to several epochs over the dataset, though most frontier models train for less than one epoch because the datasets are so large that a single pass through the data provides sufficient learning. Mixed-precision training (BF16) is standard, and distributed training across hundreds or thousands of GPUs uses combinations of data parallelism, tensor parallelism, and pipeline parallelism.
Supervised Fine-Tuning (SFT)
The pre-trained model generates plausible text but does not follow instructions, answer questions helpfully, or avoid generating harmful content. Supervised fine-tuning trains the model on a curated dataset of (instruction, response) pairs, where human writers have crafted high-quality responses to a diverse set of prompts. This dataset is much smaller than the pre-training corpus (thousands to millions of examples versus trillions of tokens) but shapes the model's behavior from a text predictor into a helpful assistant.
The quality of the SFT data is more important than its quantity. A small dataset of carefully written, thoughtful responses produces a better assistant than a large dataset of mediocre responses. The instruction distribution matters too: the SFT data should cover the range of tasks users will actually request, from creative writing to coding to analysis to conversation, in proportions that reflect expected usage.
Reinforcement Learning from Human Feedback (RLHF)
RLHF further aligns the model's behavior with human preferences. The process has two stages. First, a reward model is trained on human comparisons: annotators are shown pairs of model responses and indicate which one is better. The reward model learns to predict which response a human would prefer. Second, the language model is fine-tuned using reinforcement learning (typically PPO) to maximize the reward model's score while staying close to the SFT model (to prevent reward hacking, where the model finds responses that score highly with the reward model but are not actually good).
RLHF is what makes LLMs feel genuinely helpful rather than merely generating plausible text. It teaches the model to be honest about uncertainty, provide balanced perspectives, follow complex instructions precisely, and avoid generating harmful or misleading content. Constitutional AI, Direct Preference Optimization (DPO), and other variants of the RLHF concept are active areas of research, each trying to align model behavior with human values more effectively and efficiently.
Emergent Capabilities
Large language models exhibit capabilities that were not explicitly trained and that are absent in smaller models. Chain-of-thought reasoning, where the model solves problems step by step, appeared when models reached roughly 100 billion parameters. Code generation emerged from models trained on mixed text and code data, even though the training objective was simply next-token prediction. In-context learning, the ability to perform new tasks from a few examples provided in the prompt without any weight updates, is another emergent behavior that scales with model size.
These emergent capabilities are both the promise and the challenge of LLMs. The promise is that larger models may develop even more powerful capabilities. The challenge is that we cannot predict what capabilities will emerge at the next scale, making safety evaluation difficult. A model that passes all safety tests at one size might develop new behaviors at a larger size that the tests did not anticipate.
Context Windows and Memory
The context window is the maximum number of tokens the model can process in a single forward pass. GPT-3 had a 2,048-token context. GPT-4 expanded to 8,192 and later 128,000 tokens. Claude 3 supports 200,000 tokens. Some models now handle 1,000,000+ tokens. Longer contexts allow the model to process entire books, codebases, or conversation histories in a single call, eliminating the need to summarize or truncate long inputs.
Longer context windows come at a cost. Self-attention is quadratic in context length: doubling the context requires four times the memory and computation for the attention layers. Techniques like FlashAttention optimize the implementation without changing the mathematics, and sparse attention patterns reduce the theoretical complexity. But the fundamental tradeoff remains: longer contexts are more capable but slower and more expensive. Models also tend to use information at the beginning and end of long contexts more effectively than information in the middle, a phenomenon called the "lost in the middle" effect.
LLMs do not have persistent memory between conversations. Each interaction starts with a fresh context window. Retrieval-Augmented Generation (RAG) partially addresses this by retrieving relevant information from an external database and including it in the context. Fine-tuning on specific data bakes knowledge into the model's weights. But true persistent, updatable memory that allows a model to learn from interactions over time remains an open research problem.
Inference and Deployment
Serving LLMs to users requires substantial infrastructure. A 70-billion-parameter model in FP16 requires 140 GB of GPU memory just for the weights, plus additional memory for the KV cache (stored attention keys and values from previous tokens). During generation, the model performs a full forward pass for each token, and the speed is measured in tokens per second. A well-optimized H100 deployment can generate roughly 100 to 300 tokens per second for a 70B model, depending on batch size and precision.
Quantization reduces inference costs by lowering the precision of the weights. 4-bit quantization (GPTQ, AWQ, GGUF) shrinks a 70B model from 140 GB to roughly 35 GB, fitting it on a single GPU with memory to spare. The quality degradation from 4-bit quantization is typically small: perplexity increases by 1 to 3%, and human evaluators often cannot distinguish between the full-precision and quantized model's outputs. 8-bit quantization provides an even smaller quality tradeoff.
Speculative decoding accelerates generation by using a small, fast draft model to propose several tokens, which the large model then verifies in parallel. If the draft model's proposals match what the large model would have generated (which happens frequently for common text patterns), several tokens are accepted in a single forward pass of the large model. This can provide 2 to 3x speedup for typical text generation while producing the exact same output as standard autoregressive decoding.
The LLM Landscape in 2026
The field has consolidated around a few model families at the frontier. GPT-4 and its successors from OpenAI remain among the most capable models, accessible through APIs. Anthropic's Claude models emphasize safety and long-context understanding. Google's Gemini family spans from lightweight mobile models to frontier-scale systems integrated with Google services. Meta's LLaMA family provides the most capable open-weight models, enabling community fine-tuning and deployment on private infrastructure. Mistral, Qwen, and other organizations provide competitive open models at various scales.
The distinction between open and closed models is a defining tension. Closed models (GPT-4, Claude, Gemini) offer the highest capabilities and are accessed through paid APIs. Open models (LLaMA, Mistral, Qwen) are freely available weights that anyone can run, fine-tune, and modify. Open models lag closed models by roughly 6 to 12 months in capability but are advancing rapidly. For applications requiring data privacy, low latency, or independence from API providers, open models are the practical choice.
Large language models learn to predict the next token at massive scale, developing capabilities from grammar to reasoning to code generation as emergent properties of this simple objective. The training pipeline of pre-training, supervised fine-tuning, and RLHF transforms a text predictor into a helpful assistant. Scaling laws predict that larger models will continue to improve, while quantization and speculative decoding make deployment increasingly practical.