Tokenization in NLP Explained

Updated May 2026
Tokenization is the process of breaking text into smaller units called tokens that a language model can process. It is the very first step in any NLP pipeline, converting raw strings of characters into sequences of integers that feed into neural networks. Modern tokenizers use subword algorithms like Byte Pair Encoding (BPE) that split text into pieces smaller than words but larger than characters, balancing vocabulary size against the ability to represent any possible input.

Why Tokenization Matters

Neural networks process numbers, not text. Before a language model can do anything with a sentence, that sentence must be converted into a sequence of numerical IDs. Tokenization is the mapping that makes this possible. The tokenizer defines a fixed vocabulary of text pieces, assigns each piece an integer ID, and splits any input text into a sequence of those pieces. The quality of this mapping directly affects everything the model can do. A bad tokenizer wastes the model's capacity on representing simple words, struggles with rare vocabulary, or creates inconsistent representations that make learning harder.

Tokenization also determines the effective context length of a model. When a model has a context window of 8,192 tokens, how much text that represents depends entirely on the tokenizer. A tokenizer that represents common English words as single tokens can fit roughly 6,000 words in that window. A character-level tokenizer would fit only about 8,192 characters, roughly 1,500 words. The choice of tokenizer affects cost, speed, and capability because most language model APIs charge per token, processing time scales with token count, and the model's ability to reason across long documents depends on how efficiently those documents are tokenized.

Word-Level Tokenization

The simplest approach splits text on whitespace and punctuation. The sentence "The cat sat on the mat." becomes ["The", "cat", "sat", "on", "the", "mat", "."]. This is intuitive and produces tokens that correspond to linguistic units humans recognize. However, word-level tokenization has a fundamental problem: the vocabulary must include every word the model might encounter. English has over 170,000 words in common use, and specialized domains add thousands more. Medical text includes terms like "pharyngolaryngectomy." Technical text includes compound terms like "containerization." Social media includes creative spellings and neologisms that change weekly.

A word-level tokenizer must either include every possible word in its vocabulary, which is impossible, or designate some words as unknown. The standard approach replaces any word not in the vocabulary with a special [UNK] (unknown) token. This means the model literally cannot see those words. If "pharyngolaryngectomy" is not in the vocabulary, it becomes [UNK], and the model has no information about what the word means, looks like, or relates to. For languages with rich morphology, the problem is even worse. Finnish has over 200 inflected forms of common verbs. Turkish can form words with dozens of suffixes. Word-level tokenization would require enormous vocabularies for these languages, and would still produce [UNK] for productive word formations.

Character-Level Tokenization

The opposite extreme tokenizes text one character at a time. The vocabulary is tiny: 26 lowercase letters, 26 uppercase, 10 digits, punctuation marks, and special characters add up to fewer than 200 tokens. Character-level tokenization never produces unknown tokens because any text can be represented as a sequence of characters. It handles misspellings, neologisms, and any language that uses the same character set without any modification.

The drawback is sequence length. A 500-word document contains roughly 3,000 characters. A model with an 8,192-token context window could only process about 8,192 characters at the character level, roughly 1,300 words. More fundamentally, characters carry almost no semantic information individually. The letter "c" means nothing on its own. The model must learn to compose characters into meaningful units, then compose those units into words, phrases, and sentences. This works, but it requires deeper networks and more training data because the model is learning at a lower level of abstraction. In practice, character-level models are rarely used for general NLP because the computational cost per word of text is too high.

Subword Tokenization: The Modern Standard

Subword tokenization finds the sweet spot between word-level and character-level approaches. It maintains a vocabulary of typically 30,000 to 100,000 tokens that includes common words as single tokens, frequent word parts (prefixes, suffixes, stems) as subword tokens, and individual characters as a fallback. The word "unfortunately" might be a single token because it appears frequently enough. The word "uncharacteristically" might be split into "un", "character", "istic", "ally" because the full word is rare but each piece is common. A completely novel word like "bitcoinification" would be split into "bitcoin", "ification" or similar pieces, preserving meaningful subword structure.

Byte Pair Encoding (BPE)

BPE, originally a data compression algorithm from 1994, was adapted for NLP tokenization in 2015. The algorithm starts with a vocabulary of individual characters and iteratively merges the most frequent pair of adjacent tokens into a new token. Starting from characters, the algorithm might first merge "t" and "h" into "th" because that pair appears most frequently in the training corpus. Then it might merge "th" and "e" into "the" because "the" is the most common English word. The process continues for a predetermined number of merge operations, building up a vocabulary of increasingly long, increasingly common token pieces.

GPT-2, GPT-3, GPT-4, and LLaMA all use variants of BPE. GPT-4's tokenizer has roughly 100,000 tokens. Common English words like "the," "and," "computer," and "processing" are single tokens. Less common words are split into familiar pieces. Technical jargon, foreign language text, and unusual strings are split into smaller pieces, down to individual characters if necessary. The tokenizer is trained on the same text corpus as the language model, so the merge operations reflect the actual frequency distribution of the training data.

WordPiece

WordPiece, developed by Google for their machine translation system in 2016 and later used by BERT, uses a similar iterative approach but selects merges based on likelihood rather than frequency. Instead of merging the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training data under a language model. In practice, this produces similar results to BPE with slightly different merge decisions. WordPiece tokens that begin a word are unmarked, while tokens that continue a word are prefixed with "##". The word "tokenization" might become ["token", "##ization"], making it clear that "##ization" is a continuation, not a standalone word.

Unigram and SentencePiece

The Unigram algorithm, developed by Taku Kudo at Google in 2018, takes the opposite approach from BPE. Instead of starting with characters and merging, it starts with a large candidate vocabulary and iteratively removes tokens that contribute least to the overall likelihood of the training data. The result is a vocabulary where every token is justified by its contribution to efficient text representation. SentencePiece is a library that implements both BPE and Unigram as language-independent tokenizers that operate directly on raw text, without assuming spaces between words. This makes it particularly suitable for languages like Japanese, Chinese, and Thai that do not use spaces.

Vocabulary Size Tradeoffs

Choosing vocabulary size involves a fundamental tradeoff. Larger vocabularies mean more words are represented as single tokens, reducing sequence length and giving the model more direct access to word-level semantics. But larger vocabularies also mean more parameters in the embedding layer (each token needs its own embedding vector), longer training times, and sparser training signal for rare tokens. A vocabulary of 250,000 tokens would represent most English words as single tokens but would contain many tokens that appear too rarely for the model to learn good representations of them.

Smaller vocabularies force more splitting, increasing sequence lengths and making the model work harder to compose meaning from pieces. But every token in a small vocabulary appears frequently enough for the model to learn a rich representation of it. The sweet spot for most current models falls between 32,000 and 100,000 tokens. BERT uses 30,522 tokens. GPT-2 uses 50,257. LLaMA uses 32,000. GPT-4 uses roughly 100,000. The trend toward larger vocabularies reflects the move toward multilingual models that need to efficiently represent text in many languages, each contributing its own set of common words and character sequences.

How Tokenization Affects Model Behavior

Tokenization creates subtle effects that ripple through model performance. Arithmetic is a well-known example. The number "1,234,567" might be tokenized as ["1", ",", "234", ",", "567"] or ["12", "345", "67"] depending on the tokenizer. The model must learn to perform addition and multiplication on numbers that are split at arbitrary positions, which is fundamentally harder than operating on digits individually or on complete numbers. This is one reason large language models struggle with precise arithmetic despite their general intelligence.

Tokenization also affects how models handle different languages. English, which dominates most training corpora, gets efficient tokenization: common English words are single tokens. Text in low-resource languages may be split into many more tokens per word, making the model's effective context window much shorter for those languages. The same 100-word passage might require 120 tokens in English but 350 tokens in Telugu or Amharic. This tokenization imbalance contributes to the performance gap between languages in multilingual models and directly affects the cost of processing non-English text in commercial APIs.

Code tokenization presents its own challenges. Programming languages have different conventions than natural language: indentation is meaningful in Python, curly braces delimit blocks in C-family languages, and variable names follow camelCase or snake_case conventions. Tokenizers trained primarily on natural language may split variable names and keywords in ways that do not align with their semantic structure. Specialized code tokenizers or tokenizers trained on mixed text-and-code corpora handle these patterns more efficiently.

Key Takeaway

Subword tokenization splits text into pieces that balance vocabulary size against sequence length, enabling models to handle any input while keeping common words as single tokens for efficiency.