How Does AI Generate Text?
Tokens, Not Words
Language models do not operate on words directly. They operate on tokens, which are subword units created by a tokenizer. Common words like "the" or "and" are single tokens. Less common words are split into pieces: "unforgettable" might become "un," "for," "get," "table." Very rare words, technical jargon, and non-English text are split into even smaller pieces, sometimes individual characters.
GPT-4 uses a vocabulary of roughly 100,000 tokens. This vocabulary was constructed by analyzing the training data to find the most efficient encoding, the set of subword units that minimizes the total number of tokens needed to represent the entire training corpus. The byte-pair encoding (BPE) algorithm is the standard method for building these vocabularies.
Tokenization matters for generation because the model's computational cost scales with token count, not word count. A 500-word English response might be approximately 650 to 700 tokens. The same content in a language with fewer common words in the vocabulary (like Japanese or Arabic) might require more tokens, making generation slower and more expensive.
The Prediction Step
At each step of generation, the model takes the entire sequence so far (the user's prompt plus any tokens already generated) and processes it through dozens of transformer layers. Each layer applies self-attention (letting every position attend to every previous position) and a feedforward network (applying learned transformations). The final layer outputs a vector of numbers, one for each token in the vocabulary.
These numbers are called logits, and they represent the model's raw confidence in each possible next token. A softmax function converts logits into probabilities that sum to 1. After processing "The capital of France is," the model might assign probability 0.85 to "Paris," 0.03 to "a," 0.02 to "the," and tiny fractions to the remaining 99,997 tokens.
The probabilities reflect the patterns the model learned during training. It has seen "The capital of France is Paris" thousands of times in its training data, so "Paris" gets a high probability. But the model does not look up this fact in a database. The probability emerges from the interaction of billions of learned parameters, distributed across the network's layers.
Sampling Strategies
Once the model produces a probability distribution, a sampling strategy decides which token to actually select. This choice has a major impact on the quality and character of the generated text.
Greedy decoding always picks the highest-probability token. This produces deterministic, highly predictable text. It tends to be repetitive because the model keeps choosing the "safest" continuation. Ask it to write a story and it will produce the most generic, cliched version possible because cliches are, by definition, the most common patterns in the training data.
Temperature sampling introduces controlled randomness. The temperature parameter (a number between 0 and 2, typically) adjusts the probability distribution before sampling. Temperature below 1 makes the distribution sharper, concentrating probability on the top tokens and reducing randomness. Temperature above 1 flattens the distribution, giving lower-probability tokens a better chance of being selected. Temperature of exactly 1 samples directly from the model's learned distribution.
In practice, most deployed systems use temperatures between 0.5 and 0.9. Lower temperatures produce more focused, factual text. Higher temperatures produce more creative, varied text but with a higher risk of incoherent or factually incorrect outputs. The optimal temperature depends on the use case: code generation benefits from low temperature (you want the most likely correct syntax), while creative writing benefits from higher temperature (you want surprising word choices).
Top-k sampling restricts the selection to the k most probable tokens. If k is 40, the model only considers the top 40 tokens at each step, redistributing their probabilities to sum to 1 and then sampling. This prevents the model from ever selecting very low-probability tokens that would produce nonsensical text, while still allowing diversity among the plausible options.
Top-p (nucleus) sampling is a more adaptive version. Instead of a fixed number of tokens, it includes the smallest set of tokens whose cumulative probability exceeds p (typically 0.9 or 0.95). When the model is confident, this might include only 3 or 4 tokens. When the model is uncertain, it might include hundreds. This adaptive behavior makes top-p generally more reliable than top-k across different contexts within the same generation.
Most production systems combine temperature with top-p sampling. The temperature controls the overall randomness, while top-p prevents catastrophic selections from the extreme tail of the distribution.
The Autoregressive Loop
Text generation is autoregressive, meaning each generated token becomes input for the next prediction. The model generates "Paris," appends it to the sequence, then processes the full sequence "The capital of France is Paris" to predict the next token (perhaps a period). Then "The capital of France is Paris." becomes the input for the next token, and so on.
This autoregressive property means errors compound. If the model generates an incorrect word early in a response, all subsequent tokens are conditioned on that error. The model cannot go back and fix previous tokens; it can only continue forward. This is why language models sometimes commit to an incorrect claim and then elaborate on it with internally consistent but factually wrong details. Once "The capital of France is Lyon" is generated, the model will cheerfully produce supporting text about Lyon because it generates based on what it has already written.
Generation stops when the model produces a special end-of-sequence token (indicating it considers the response complete) or when it reaches a maximum token limit. The model learns when to stop during training by observing where responses end in its training data. Well-trained models learn appropriate stopping points for different types of content: a short factual answer stops after a sentence, while a detailed explanation continues for several paragraphs.
Context Windows and Attention
The context window is the maximum number of tokens the model can consider at once. GPT-3 had a context window of 4,096 tokens. GPT-4 supports 8,192 or 128,000 tokens depending on the version. Claude supports up to 200,000 tokens. The context window limits how much text the model can "see" when making each prediction.
Self-attention is what allows the model to use its full context window effectively. At each prediction step, the attention mechanism lets the model focus on the most relevant parts of the input, regardless of distance. When generating the end of a long document, the model can attend to a relevant sentence from the beginning of the document, thousands of tokens earlier. Without attention, information from the beginning of long sequences would fade and become inaccessible.
Longer context windows are computationally expensive because the attention computation scales quadratically with sequence length. Processing a 100,000-token sequence requires roughly 625 times more attention computation than processing a 4,000-token sequence. Various optimizations (FlashAttention, sparse attention, sliding window attention) reduce this cost, making long-context generation practical.
Why Generated Text Sounds Coherent
The remarkable coherence of AI-generated text emerges from two factors: the depth of patterns learned during training and the autoregressive conditioning on previously generated tokens.
During training on trillions of tokens, the model learns not just which words follow which, but the deeper structure of language. It learns that paragraphs should develop a single idea, that arguments should have premises and conclusions, that stories need characters with consistent motivations, and that technical explanations should build from simpler to more complex concepts. These structural patterns are encoded in the model's parameters just as firmly as vocabulary and grammar.
The autoregressive loop reinforces coherence because each new token is generated in the context of everything already written. If the model has been writing about quantum physics for three paragraphs, the probability distribution for the next token is heavily weighted toward quantum physics terminology and concepts. The model does not decide to write a coherent paragraph; coherence emerges because each token is statistically likely to continue the patterns established by previous tokens.
AI generates text through repeated next-token prediction: the model produces a probability distribution over its vocabulary, a sampling strategy selects one token, and that token feeds back as input for the next prediction. Sampling parameters like temperature and top-p control the balance between creativity and coherence. The coherence of long outputs emerges from the autoregressive loop, where each token is conditioned on the full sequence generated so far, combined with the deep structural patterns the model learned during training.