How to Preprocess Text for NLP

Updated May 2026
Text preprocessing transforms raw, messy text into clean, standardized input that NLP models can work with effectively. It includes cleaning HTML and encoding artifacts, tokenizing text into meaningful units, normalizing case and format, removing noise like stopwords and punctuation, and reducing words to their base forms through stemming or lemmatization. The right preprocessing pipeline depends on your model: classical methods require extensive preprocessing, while modern transformers handle most normalization internally and need only minimal cleaning.

Raw text collected from the real world is messy. Web pages contain HTML tags, JavaScript, and CSS mixed with content. User-generated text includes typos, inconsistent capitalization, emojis, URLs, and non-standard abbreviations. Documents scanned with OCR contain recognition errors. Databases store text with encoding inconsistencies where the same character appears in different Unicode representations. Preprocessing converts all of this into a consistent format that downstream models can process reliably. Skipping preprocessing or doing it carelessly introduces noise that degrades model performance, sometimes dramatically.

Clean and Normalize Raw Text

The first step addresses the source-specific artifacts that contaminate raw text. Web-scraped text needs HTML tag removal (using libraries like BeautifulSoup or lxml rather than regular expressions, which fail on malformed HTML). JavaScript and CSS blocks, navigation menus, cookie banners, and footer boilerplate should be stripped to isolate the actual content. Common Python libraries like trafilatura and readability specialize in extracting clean article text from web pages.

Unicode normalization standardizes characters that have multiple representations. The letter "e" with an accent can be encoded as a single character or as "e" followed by a combining accent mark. Unicode NFC normalization converts these to a canonical form so that identical text always has identical byte representation. This prevents situations where the same word appears as two different tokens because of invisible encoding differences. Python's unicodedata.normalize() function handles this.

Encoding errors produce garbled text when UTF-8 bytes are interpreted as Latin-1 or vice versa. The string "cafe" might appear as "cafÃ(c)" or "caf\xe9" when encoding is mishandled. The ftfy library ("fixes text for you") automatically detects and repairs common encoding errors. Running ftfy early in the pipeline prevents these artifacts from propagating into tokenization and modeling.

Tokenize the Text

Tokenization splits continuous text into discrete units for processing. For classical NLP pipelines, word tokenization splits on whitespace and punctuation: "I can't believe it!" becomes ["I", "ca", "n't", "believe", "it", "!"]. The treatment of contractions, hyphenated words, and punctuation varies by tokenizer. spaCy's tokenizer handles English contractions intelligently, splitting "can't" into "ca" and "n't" (which preserves the negation) rather than "can" and "'t".

Sentence tokenization splits text into sentences, which is needed for tasks that operate at the sentence level and for creating training examples. Simple splitting on periods fails because periods appear in abbreviations ("Dr. Smith"), decimal numbers ("3.14"), and ellipses ("wait..."). NLTK's Punkt tokenizer uses an unsupervised algorithm trained on text patterns to identify sentence boundaries, handling these cases with reasonable accuracy. spaCy's sentence detection uses a trained model that considers context rather than relying on punctuation rules alone.

For transformer models, the tokenizer is part of the model itself and should not be replaced with a custom tokenizer. BERT uses WordPiece, GPT models use BPE, and each model's tokenizer has been trained alongside the model. Using a different tokenizer produces token IDs that the model has never seen, producing garbage outputs. When working with transformers, use the model's built-in tokenizer and limit preprocessing to the cleaning and normalization steps that happen before tokenization.

Apply Text Normalization

Case normalization converts all text to lowercase, eliminating the distinction between "Apple," "apple," and "APPLE." For bag-of-words and TF-IDF models, lowercasing reduces vocabulary size and consolidates word counts. The tradeoff is losing information: "Apple" (the company) and "apple" (the fruit) become indistinguishable. Named entity recognition accuracy drops if the model relies on capitalization cues. For transformer models, case sensitivity is handled internally and lowercasing should generally not be applied to the input.

Contraction expansion converts "can't" to "cannot," "I'm" to "I am," and "they've" to "they have." This is useful for classical models where "cannot" and "can't" would otherwise be treated as completely different words. A simple dictionary of contraction mappings handles standard English contractions. Informal text includes non-standard contractions ("gonna," "wanna," "gotta") that may also benefit from expansion depending on the application.

Number handling depends on the task. For topic classification, replacing all numbers with a generic token like "NUM" reduces vocabulary without losing much signal. For information extraction, preserving exact numbers is essential. Date and time normalization converts various formats ("January 5," "1/5/2026," "Jan 5th") to a standard representation. Currency normalization handles "$5M," "$5,000,000," and "five million dollars" as equivalent. These normalizations are most important for classical models; transformer models learn to handle format variation during pre-training.

Remove Noise and Filter Content

Stopword removal eliminates high-frequency, low-information words like "the," "is," "at," "and," "a." These words constitute 20% to 30% of typical English text but carry little topical information. Removing them reduces feature dimensionality for bag-of-words models and focuses the model on content words. Standard stopword lists (NLTK provides 179 English stopwords, spaCy provides 326) cover articles, prepositions, common pronouns, and auxiliary verbs.

Stopword removal must be applied carefully. The phrase "to be or not to be" becomes meaningless after stopword removal. The word "not" is a stopword in many lists but carries critical sentiment information. For sentiment analysis, negation words should be retained. For question answering, question words ("who," "what," "where") should be retained. For transformer models, stopword removal is generally unnecessary and potentially harmful because the model's attention mechanism learns to assign appropriate weight to each token, including function words that contribute to meaning.

Domain-specific noise removal addresses patterns unique to the data source. Email preprocessing removes signatures, disclaimers, and forwarded message headers. Social media preprocessing removes @mentions, #hashtags (or converts them to words by removing the # symbol), URLs, and retweet markers. Code preprocessing might remove comments and whitespace or preserve them depending on the task. Each domain requires its own noise patterns, discovered by inspecting representative samples of the data.

Apply Morphological Reduction

Stemming reduces words to approximate root forms by stripping suffixes. The Porter stemmer, the most widely used algorithm, reduces "running" to "run," "easily" to "easili," and "studies" to "studi." The Snowball stemmer handles multiple languages. Stemming is fast, requires no dictionary, and effectively consolidates word variants. The tradeoff is imprecision: "university" and "universe" both stem to "univers" despite having unrelated meanings, and stems like "easili" are not real words, which can confuse debugging and interpretation.

Lemmatization reduces words to their dictionary base form (lemma) using morphological analysis and a vocabulary. "Running" becomes "run," "better" becomes "good," "studies" becomes "study," and "was" becomes "be." Unlike stemming, lemmatization produces real words and correctly handles irregular forms. It requires a dictionary or model (spaCy and NLTK's WordNet lemmatizer are common choices) and is slower than stemming, but the accuracy improvement is significant. Lemmatization also requires knowing the word's part of speech: "saw" lemmatizes to "see" as a verb but remains "saw" as a noun.

For transformer models, neither stemming nor lemmatization should be applied. The model's subword tokenizer handles morphological variation naturally. The tokens for "running," "runs," and "ran" share subword pieces that encode their relationship, and the model's pre-training has already learned these morphological patterns. Applying stemming or lemmatization to transformer input degrades performance by destroying information the model can use.

Validate and Inspect the Output

After preprocessing, inspect a random sample of processed texts to verify quality. Check that important content was preserved, that cleaning did not remove meaningful text, and that the output is consistent across documents. Common issues include over-aggressive HTML stripping that removes content inside tags, encoding fixes that introduce new errors, and stopword lists that remove domain-important terms.

Compare token distributions before and after preprocessing. A significant drop in vocabulary size confirms that normalization is consolidating variants. A significant drop in average document length might indicate that too much content is being removed. Statistics on token frequencies help identify preprocessing artifacts: if a noise token like "[deleted]" or "RT" appears among the most frequent terms, the cleaning step missed a pattern.

Version and document your preprocessing pipeline. Different preprocessing choices produce different results, and reproducing results requires reproducing the exact preprocessing. Record which library versions were used, which stopword list was applied, what cleaning patterns were implemented, and any domain-specific rules. This documentation is essential for debugging model performance issues, which frequently trace back to preprocessing rather than model architecture.

When to Skip Preprocessing

The rise of transformer models has reduced the need for traditional preprocessing dramatically. BERT, GPT, and similar models were trained on raw text with minimal cleaning. Their subword tokenizers handle out-of-vocabulary words, their attention mechanisms learn to weight stopwords appropriately, and their pre-trained representations already capture morphological relationships. For these models, the only preprocessing that consistently helps is source-specific cleaning: removing HTML tags, fixing encoding errors, and stripping non-content boilerplate. Traditional steps like stopword removal, stemming, and lowercasing typically hurt performance because they destroy information the model can use.

Classical models (Naive Bayes, logistic regression, SVM with TF-IDF features) still benefit substantially from full preprocessing. These models have no internal mechanism for learning that "running" and "runs" are related, so stemming or lemmatization provides a significant accuracy boost. Stopword removal reduces dimensionality and focuses the model on discriminative content words. For these models, the full preprocessing pipeline remains essential.

Key Takeaway

Text preprocessing cleans and standardizes raw text for NLP models, with classical models benefiting from extensive preprocessing (stemming, stopwords, normalization) while transformer models need only minimal cleaning since they handle most normalization internally.