How AI Generates Text

Updated May 2026

AI generates text by predicting one token at a time using a language model. At each step, the model computes a probability distribution over its entire vocabulary, and a decoding strategy selects which token to output. The choice of decoding strategy, whether greedy, beam search, or stochastic sampling with temperature and top-k controls, determines whether the output is predictable and safe or creative and diverse. This autoregressive process, repeated hundreds or thousands of times, produces everything from code and emails to poetry and scientific analysis.

Autoregressive Generation

Modern text generation is autoregressive: the model generates one token at a time, feeding each generated token back as input for the next prediction. Given the prompt "The weather today is," the model computes probabilities for every token in its vocabulary: "sunny" might get 15%, "cloudy" might get 8%, "beautiful" might get 6%, and so on. After selecting a token (say "sunny"), the model processes the extended sequence "The weather today is sunny" and predicts the next token. This process repeats until the model generates a stop token or reaches a maximum length.

Each generation step requires a full forward pass through the transformer model. For a model with 70 billion parameters, each forward pass involves trillions of floating-point operations. Generating a 500-token response requires 500 sequential forward passes, which is why text generation is slower than classification or other single-pass tasks. KV-cache optimization stores the intermediate computations from previous tokens so they do not need to be recomputed at each step, reducing the cost of each subsequent step to processing only the new token. Without KV-caching, generating long text would be prohibitively slow.

The probability distribution at each step is called the logits (raw model outputs) or the probabilities (after applying softmax normalization). The logits for a vocabulary of 50,000 tokens form a 50,000-dimensional vector, where each dimension represents the model's assessment of how likely that token is to come next. The softmax function converts these logits into probabilities that sum to 1. Everything that follows in text generation is about choosing how to select from this distribution.

Deterministic Decoding Strategies

Greedy Decoding

Greedy decoding always selects the token with the highest probability. It is fast, deterministic (the same input always produces the same output), and produces the locally most likely token at each step. The problem is that the locally most likely sequence of tokens is not always the globally most likely sequence. Greedy decoding can get trapped in repetitive loops ("The weather today is nice. The weather today is nice. The weather today is...") because each repetition is locally plausible given the pattern established by previous repetitions. It also produces generic, predictable text because it always chooses the safest option, never exploring less probable but potentially better continuations.

Beam Search

Beam search addresses greedy decoding's local optimality by maintaining multiple candidate sequences (beams) simultaneously. With a beam width of 5, the algorithm tracks the 5 most probable partial sequences at each step. Each beam is extended with the top-k possible next tokens, producing up to 5 * k candidate sequences, which are pruned back to 5 based on cumulative probability. After the sequences reach the stop token or maximum length, the highest-scoring complete sequence is returned.

Beam search produces more globally optimal sequences than greedy decoding and is standard for machine translation, where finding the most probable translation is the goal. However, beam search has its own problems for open-ended generation. It tends to produce text that is bland and repetitive because high-probability sequences are often generic. A beam search continuation of a story prompt might produce "He went to the store. He bought some food. He went home." because each sentence is individually probable, even though the resulting text is tedious. For creative, conversational, and open-ended generation, stochastic sampling methods are preferred.

Stochastic Sampling Strategies

Temperature Sampling

Temperature is a parameter that controls the randomness of the probability distribution before sampling. The logits are divided by the temperature value before applying softmax. A temperature of 1.0 leaves the distribution unchanged. A temperature below 1.0 (say 0.3) sharpens the distribution, making high-probability tokens even more likely and low-probability tokens nearly impossible. A temperature above 1.0 (say 1.5) flattens the distribution, making the probabilities more uniform and allowing lower-probability tokens a greater chance of being selected.

Low temperature (0.1 to 0.5) produces conservative, focused, predictable text. It is appropriate for factual tasks, code generation, and situations where accuracy is more important than creativity. The output is nearly deterministic, with the model almost always selecting the highest-probability token. High temperature (0.8 to 1.5) produces more diverse, creative, surprising text. It is appropriate for creative writing, brainstorming, and situations where variety matters. Very high temperature (above 1.5) produces increasingly incoherent text as the distribution flattens to near-uniform and the model selects tokens almost randomly.

Top-k Sampling

Top-k sampling restricts the candidate pool to the k tokens with the highest probabilities, then samples from this truncated distribution. With k=50, only the 50 most probable tokens are considered, and all others receive zero probability. This prevents the model from selecting extremely unlikely tokens that would produce nonsensical output while still allowing diversity among the plausible options. Top-k was used in the original GPT-2 release and remains a widely available parameter.

The limitation of top-k is that the appropriate value of k varies depending on the distribution. When the model is highly confident (the top token has 95% probability), even k=10 might include many unlikely tokens. When the model is uncertain (the distribution is flat), k=50 might exclude perfectly reasonable options. A fixed k does not adapt to the model's confidence at each step.

Nucleus Sampling (Top-p)

Nucleus sampling, introduced by Holtzman et al. in 2019, addresses top-k's inflexibility by dynamically adjusting the candidate pool based on cumulative probability. Instead of a fixed number of tokens, top-p includes the smallest set of tokens whose cumulative probability exceeds the threshold p. With p=0.9, the algorithm sorts tokens by probability, cumulatively adds probabilities from highest to lowest, and stops when the cumulative sum reaches 0.9. If the model is very confident, this might include only 2 or 3 tokens. If the model is uncertain, it might include 200 tokens. The threshold adapts to the distribution automatically.

Top-p=0.9 has become a widely used default for conversational AI because it balances coherence and diversity well. It allows enough variation to avoid repetition and generic text while preventing the selection of tokens so unlikely that they would break coherence. In practice, temperature and top-p are often combined: temperature controls the overall sharpness of the distribution, and top-p truncates the tail. A common setting is temperature=0.7 with top-p=0.95.

Controlling Generated Output

Repetition Penalties

Neural text generation has a persistent tendency toward repetition. Without intervention, models frequently repeat phrases, sentences, or even entire paragraphs. Repetition penalty reduces the probability of tokens that have already appeared in the generated text. A penalty of 1.0 means no change, values above 1.0 penalize repetition, and higher values penalize more aggressively. A value of 1.2 is a common default. Frequency penalty and presence penalty are variants: frequency penalty scales the penalization by how many times a token has appeared, while presence penalty applies a flat penalty for any token that has appeared at least once. These penalties encourage the model to use diverse vocabulary and avoid getting stuck in loops.

Structured Output

Many applications require generated text to follow a specific format: JSON objects, XML documents, SQL queries, or function calls with defined schemas. Constrained decoding forces the model to generate only tokens that are valid according to a grammar or schema. At each step, tokens that would violate the structure receive zero probability. This guarantees that the output is valid JSON, for example, without relying on the model's learned tendency to produce well-formed structures. Libraries like Outlines and Guidance implement grammar-constrained decoding for popular language models.

System Prompts and Instructions

The most common method of controlling output is the system prompt: a set of instructions prepended to the model's input that specifies the desired format, style, content, and behavior. A system prompt might specify "Respond in exactly three bullet points," "Write in the style of a news reporter," or "Only discuss topics related to cooking." Instruction-tuned models (those trained with RLHF or similar alignment techniques) follow system prompts much more reliably than base language models. The system prompt shapes the probability distributions at every generation step, biasing the model toward outputs that comply with the instructions.

Generation Quality and Evaluation

Evaluating the quality of generated text is one of the hardest problems in NLP. Unlike classification (where accuracy is unambiguous) or translation (where BLEU provides a rough metric), open-ended generation has no single correct output. A well-generated paragraph might differ completely from what a human would write and still be excellent. Human evaluation, where raters judge fluency, coherence, relevance, and factual accuracy, remains the gold standard but is expensive, slow, and subjective.

Perplexity measures how surprised the model is by a test dataset: lower perplexity means the model assigns higher probabilities to the actual text, indicating better language modeling. But perplexity correlates imperfectly with generation quality. A model with low perplexity can still produce repetitive, boring, or incoherent text depending on the decoding strategy. Perplexity measures the model's understanding of language, not the quality of its generative output.

LLM-as-judge evaluation uses a separate language model to rate generated text on specified criteria. This approach scales better than human evaluation and correlates well with human judgments for many quality dimensions. The judge model is given the generated text, the original prompt, and a rubric (rate on a 1-5 scale for accuracy, helpfulness, and safety), and it produces numerical scores and explanations. The risk is that model judges share biases with the models they evaluate, potentially creating a feedback loop that reinforces rather than corrects systematic issues.

Key Takeaway

Text generation works by predicting one token at a time from a probability distribution, with decoding strategies like temperature, top-k, and nucleus sampling controlling the tradeoff between predictable, safe output and creative, diverse text.