How Text Classification Works

Updated May 2026
Text classification is the NLP task of assigning one or more category labels to a piece of text based on its content. It is the most widely deployed NLP application in the world, filtering billions of spam emails daily, routing customer support tickets, categorizing news articles, detecting toxic content, and identifying user intent in chatbots. Modern text classification fine-tunes pre-trained transformer models on labeled examples, achieving accuracy above 95% on most standard benchmarks with as few as a thousand training examples.

Text classification is conceptually simple: given a text input, predict which category it belongs to. But the range of applications built on this simple formulation is enormous. Every email spam filter is a text classifier deciding between "spam" and "not spam." Every content moderation system is a text classifier deciding between "safe," "potentially harmful," and "violating." Every chatbot intent recognizer is a text classifier mapping user messages to intents like "check balance," "report problem," or "cancel subscription." The simplicity of the formulation, combined with the practical value of automating these decisions, makes text classification the workhorse of applied NLP.

Define Your Categories and Collect Labeled Data

The first decision is what categories to classify into. This seems obvious but frequently causes problems in practice. Categories must be mutually exclusive (for single-label classification) or clearly defined (for multi-label classification where a text can belong to multiple categories). An email classification system might use "primary," "social," "promotional," "updates," and "spam." A customer support system might use 20 to 50 intent categories like "billing question," "technical issue," "account access," and "feature request."

Category granularity involves tradeoffs. Fewer categories are easier to classify accurately but less useful. A system that classifies support tickets as just "complaint" or "question" achieves high accuracy but provides little routing information. A system with 200 fine-grained categories provides precise routing but requires much more training data and achieves lower accuracy because the distinctions between closely related categories become subtle. Most production systems use 10 to 50 categories, sometimes organized hierarchically: first classify into broad categories, then into subcategories within each broad category.

Labeled data collection can use existing signals (star ratings for sentiment, folder labels for email classification, department routing for support tickets) or require manual annotation. Manual annotation is expensive: professional annotators cost $15 to $40 per hour, and each example takes 10 to 60 seconds to label depending on complexity. Active learning reduces annotation costs by having the model identify the examples it is most uncertain about and prioritizing those for human labeling, focusing annotation effort where it will improve the model most.

Convert Text to Numerical Features

Classical approaches represent text as feature vectors using bag-of-words or TF-IDF representations. Each document becomes a sparse vector where each dimension corresponds to a word in the vocabulary, and the value represents the word's frequency (bag-of-words) or importance-weighted frequency (TF-IDF). These representations discard word order but retain topic information: documents about sports contain different words than documents about finance, and the word frequencies reflect these topical differences clearly enough for many classification tasks.

N-gram features extend bag-of-words to capture short phrases. Instead of just counting individual words, the system also counts word pairs (bigrams) or triples (trigrams). The bigram "not good" captures negative sentiment that the individual words "not" and "good" do not. The trigram "New York Times" identifies a specific entity that the individual words do not. N-gram features dramatically increase the feature space (a vocabulary of 50,000 words produces 2.5 billion possible bigrams), so feature selection or hashing is necessary to keep the representation manageable.

Modern approaches skip explicit feature engineering entirely. A pre-trained transformer like BERT processes the raw text and produces a contextualized representation that implicitly encodes syntax, semantics, entity information, sentiment, topic, and style. The classification head on top of the transformer is a simple linear layer that maps this rich representation to class probabilities. The transformer handles all the feature extraction that classical approaches required explicit engineering for, and it does so using representations learned from billions of words of training text.

Train and Evaluate the Classifier

With classical features, common classifiers include Naive Bayes (fast, works well with small data, assumes feature independence), logistic regression (fast, interpretable, handles large feature spaces well), support vector machines (strong performance, especially with TF-IDF features), and random forests (handles non-linear relationships, resistant to overfitting). For TF-IDF features with moderate-sized datasets, logistic regression and SVMs typically perform best, achieving 85% to 92% accuracy on standard topic classification benchmarks.

With transformer models, the standard approach is fine-tuning: load a pre-trained model, add a classification layer, and train on labeled examples using cross-entropy loss. The pre-trained weights are adjusted during fine-tuning, but with a very small learning rate (2e-5 to 5e-5) to preserve the pre-trained knowledge. Training typically runs for 2 to 5 epochs. With sufficient labeled data (1,000+ examples per class), fine-tuned transformers achieve 93% to 97% accuracy on most classification tasks, consistently outperforming classical approaches by 3 to 8 percentage points.

Few-shot and zero-shot classification have become practical with large language models. Zero-shot classification asks the model to classify text into categories it has never been explicitly trained on, using only the category names as guidance. A large language model can classify a news article as "politics," "sports," "technology," or "entertainment" without any labeled training examples because it understands these category concepts from pre-training. Few-shot classification provides a handful of labeled examples (2 to 10 per class) in the prompt, achieving accuracy that approaches fine-tuned models for many tasks. This eliminates the data collection bottleneck for applications where labeled data is scarce or expensive.

Optimize for Production

Class imbalance is the most common practical problem in text classification. In spam detection, only 2% to 5% of emails might be spam. In fraud detection, fewer than 0.1% of transactions might be fraudulent. A model that simply predicts "not spam" for everything achieves 95%+ accuracy but catches zero spam. Techniques for handling imbalance include oversampling the minority class (duplicating or synthetically generating minority examples), undersampling the majority class (discarding majority examples), adjusting class weights in the loss function (penalizing minority class errors more heavily), and using evaluation metrics like F1 score that account for both precision and recall rather than overall accuracy.

Production inference speed often matters more than peak accuracy. A content moderation system processing millions of social media posts needs to classify each post in milliseconds. Model distillation trains a smaller, faster model to mimic the predictions of a large, accurate model. The student model achieves 95% to 99% of the teacher's accuracy at 5x to 20x faster inference speed. ONNX Runtime and TensorRT optimize model execution for specific hardware. Caching predictions for common or repeated inputs avoids redundant computation.

Monitoring and retraining keep the classifier accurate over time. Language evolves, user behavior shifts, and the distribution of input text changes. A classifier trained on 2024 data may perform poorly on 2026 data because new topics emerge, slang changes, and communication patterns shift. Logging model predictions alongside ground truth (obtained through sampling and manual review) enables ongoing accuracy measurement. When accuracy drops below a threshold, retraining on fresh data restores performance. A/B testing new model versions against the current production model ensures that updates actually improve performance before full deployment.

Types of Text Classification

Binary classification assigns text to one of two categories: spam or not spam, positive or negative, relevant or irrelevant. Binary tasks are the most common in practice and the easiest to achieve high accuracy on because the model only needs to learn one decision boundary. Multi-class classification assigns text to one of several mutually exclusive categories. Topic classification (sports, politics, technology, science) is a typical multi-class problem. As the number of classes increases, accuracy typically decreases because the distinctions between similar classes become subtler and the training data per class becomes sparser.

Multi-label classification allows text to belong to multiple categories simultaneously. A news article about an AI company's stock price might be labeled "technology," "business," and "artificial intelligence." Multi-label classification requires a different modeling approach: instead of a softmax output that assigns probabilities across mutually exclusive classes, the model uses independent sigmoid outputs for each class, with each output predicting the probability that the text belongs to that class. The threshold for assigning a label (typically 0.5) can be tuned per class to balance precision and recall for each category.

Hierarchical classification organizes categories into a tree structure. A product review might first be classified as "electronics" versus "clothing" versus "books," then the electronics category is further classified into "phones," "laptops," "headphones." Hierarchical classification can use a cascade of classifiers (one per level) or a single flat classifier with hierarchical constraints (if the model predicts "phones," it implicitly predicts "electronics"). The hierarchical approach is useful when the category space is large and naturally organized, such as product catalogs, library classification systems, or medical diagnosis codes.

Key Takeaway

Text classification assigns category labels to documents using learned patterns from training data, with modern transformer models achieving over 95% accuracy and zero-shot approaches eliminating the need for labeled data in many applications.