What Are Language Models?
The Core Idea: Predicting the Next Word
At its heart, a language model learns the statistical patterns of language. Given the text "The cat sat on the," a language model assigns probabilities to every possible next word: "mat" might get 15%, "floor" might get 12%, "chair" might get 8%, "couch" might get 5%, and thousands of other words share the remaining probability. These probabilities reflect how often each continuation appears in the data the model was trained on. The model has learned that cats sit on mats, floors, and furniture, not on abstractions or verbs, because that is what the training text demonstrates.
This simple objective, predicting the next word, turns out to require extraordinary depth of knowledge. To predict the next word after "The CEO of Tesla announced that the company would," the model must know what Tesla is (an electric vehicle company), what CEOs do (make corporate announcements), what "would" implies (a future action or plan), and what kinds of announcements companies make (product launches, financial results, strategy changes). The prediction task forces the model to encode facts, relationships, grammar, logic, and world knowledge into its parameters, because all of these contribute to predicting what comes next.
N-gram Models: The Statistical Foundation
The earliest language models counted word sequences in training data. A bigram model (n=2) estimates the probability of each word given only the previous word. From a training corpus, it counts how often "the" is followed by "cat" versus "dog" versus "end" and uses these counts to estimate probabilities. A trigram model (n=3) conditions on the previous two words: given "the cat," what is the next word? N-gram models are simple, fast, and formed the basis of speech recognition and machine translation systems through the 2000s.
The fundamental limitation of n-gram models is sparsity. As n increases, most possible n-grams never appear in the training data. The 5-gram "the quantum physicist explained that" might appear zero times in a billion-word corpus, giving the model no information about what follows. Smoothing techniques (redistributing probability mass to unseen n-grams) and backoff (falling back to shorter n-grams when long ones are unseen) partially address this, but n-gram models fundamentally cannot capture long-range dependencies. The word choice at the end of a 20-word sentence often depends on information from the beginning, which an n-gram model with n=5 cannot access.
Neural Language Models
Neural language models, introduced by Yoshua Bengio in 2003, replaced discrete word counts with continuous vector representations. Each word is represented as a dense vector (embedding), and a neural network predicts the next word based on the embeddings of recent context words. Because similar words have similar embeddings, the model can generalize: if it has seen "the physicist explained" in training, it can make reasonable predictions after "the scientist explained" even if that exact phrase never appeared, because "physicist" and "scientist" have similar embeddings.
Recurrent neural network (RNN) language models, particularly LSTM variants, extended the context beyond fixed n-gram windows. An RNN processes text one word at a time, maintaining a hidden state vector that accumulates information about the entire sequence so far. This allows the model to, in principle, consider the entire preceding text when predicting the next word. In practice, RNNs struggle with very long contexts because information in the hidden state decays over time. But for contexts of up to a few hundred words, RNN language models were dramatically better than n-gram models, and they powered improvements in speech recognition and machine translation from roughly 2010 to 2017.
Transformer Language Models
The transformer architecture, introduced in 2017, replaced recurrence with self-attention. Instead of processing words sequentially, a transformer processes all words in parallel, computing attention weights that determine how much each word influences every other word's representation. This parallelism makes training much faster on modern GPUs, and the direct attention connections between distant words solve the long-range dependency problem that plagued RNNs.
GPT (Generative Pre-trained Transformer), introduced by OpenAI in 2018, applied the transformer to language modeling at scale. GPT-1 had 117 million parameters and was trained on BookCorpus (about 7,000 books). It demonstrated that a single pre-trained language model could be fine-tuned for diverse NLP tasks, outperforming specialized models on several benchmarks. The key insight was that the language modeling objective, despite being "just" next-word prediction, forces the model to learn representations useful for understanding and generating language in general.
BERT (2018) introduced a different pre-training approach: instead of predicting the next word left-to-right, BERT masks random words in the input and predicts them from the surrounding bidirectional context. This bidirectional approach produces representations better suited for understanding tasks (classification, question answering, NER) because each word's representation incorporates context from both directions. BERT set new records on 11 NLP benchmarks simultaneously, establishing pre-trained language models as the dominant paradigm in NLP.
The Scaling Revolution
GPT-2 (2019) scaled to 1.5 billion parameters and demonstrated that larger language models produce qualitatively different capabilities. GPT-2 could generate coherent multi-paragraph text, complete stories, write simple code, and perform basic arithmetic, none of which GPT-1 could do reliably. OpenAI initially withheld the full model out of concern about misuse potential, particularly for generating fake news and spam. This was one of the first major public discussions about the dual-use nature of language models.
GPT-3 (2020) scaled to 175 billion parameters and 300 billion training tokens, producing a model that could perform tasks it was never explicitly trained on by describing them in natural language. Given the prompt "Translate English to French: cheese =>" GPT-3 output "fromage" without any translation-specific training. This capability, called in-context learning or few-shot learning, was an emergent behavior that appeared when the model reached sufficient scale. GPT-3 demonstrated that a sufficiently large language model becomes a general-purpose language processing engine, capable of translation, summarization, question answering, code generation, and creative writing through prompting alone.
Scaling laws, formalized by Kaplan et al. at OpenAI in 2020 and refined by Chinchilla researchers at DeepMind in 2022, describe the predictable relationship between model size, training data size, compute budget, and model performance. Loss (the model's prediction error) decreases as a smooth power law with increases in parameters and training tokens. The Chinchilla finding was particularly influential: optimal performance at a given compute budget requires scaling model parameters and training tokens roughly equally. This meant that many existing models were "over-parameterized and under-trained," trained on too little data relative to their size. The finding redirected the field toward training smaller models on more data, producing models like LLaMA (65 billion parameters trained on 1.4 trillion tokens) that matched GPT-3's performance with far fewer parameters.
What Large Language Models Can Do
By 2026, the capabilities of frontier language models are extensive. Code generation: models write functioning code in dozens of programming languages, debug existing code, and explain code behavior. Mathematical reasoning: models solve algebra, calculus, and logic problems, often showing their work step by step. Creative writing: models produce poetry, fiction, marketing copy, and technical documentation with control over style, tone, and audience. Analysis: models summarize documents, compare viewpoints, identify patterns in data, and evaluate arguments. Instruction following: models execute complex, multi-step instructions that specify format, content, style, and constraints simultaneously.
Emergent abilities are capabilities that appear abruptly as models scale past certain size thresholds. Chain-of-thought reasoning, where the model solves problems step by step rather than jumping to the answer, is dramatically more effective in models with 100+ billion parameters than in smaller models. Theory of mind, the ability to reason about what other agents know and believe, shows improvement at scale. These emergent abilities are not programmed or explicitly trained; they arise from the increasing capacity of larger models to represent complex patterns in their training data.
The boundary between language models and general AI systems is blurring. Models that combine language understanding with vision (processing images and text together), tool use (calling APIs, executing code, searching the web), and planning (breaking complex tasks into steps and executing them sequentially) are moving beyond text prediction toward general-purpose AI assistants. Whether these capabilities constitute genuine understanding or sophisticated pattern matching remains one of the most debated questions in AI research.
Limitations and Open Questions
Hallucination remains the most practically important limitation. Language models generate text that is fluent and confident but sometimes factually wrong. A model might cite a paper that does not exist, attribute a quote to the wrong person, or state a statistic with precision that is entirely fabricated. The model generates what is statistically plausible, not what is verified as true. Retrieval-augmented generation, fact-checking pipelines, and improved training techniques have reduced hallucination rates, but they have not eliminated the problem.
Reasoning limitations become apparent on problems that require precise logical deduction, multi-step planning, or systematic search. Language models can solve many reasoning problems by recognizing patterns similar to those in their training data, but they struggle with novel problems that require genuine algorithmic thinking. A model might solve a Sudoku puzzle by pattern-matching against solved puzzles it has seen, but it cannot solve arbitrarily hard puzzles that require backtracking search. The extent to which language models perform genuine reasoning versus sophisticated retrieval and pattern matching is an active research question.
The computational and environmental cost of large language models is substantial. Training a frontier model requires thousands of GPUs running for months, consuming energy equivalent to hundreds of households' annual consumption. Inference costs, while lower per query, aggregate to enormous sums at the scale of billions of daily queries. Research into more efficient architectures, quantization, distillation, and sparse models aims to reduce these costs, but the trend toward ever-larger models continues to push compute requirements upward.
Language models learn to predict the next word from massive text corpora, and at sufficient scale, this simple objective produces systems capable of translation, reasoning, code generation, and conversation, making them the foundation of modern NLP.