NLP Evaluation Metrics
Classification Metrics
Accuracy
Accuracy is the simplest metric: the percentage of predictions that are correct. If a spam detector correctly classifies 950 out of 1,000 emails, its accuracy is 95%. Accuracy is intuitive and easy to compute, but it can be deeply misleading for imbalanced datasets. If only 2% of emails are spam, a model that labels everything as "not spam" achieves 98% accuracy while catching zero spam. This makes accuracy useless for evaluating the model's actual purpose, which is detecting the rare spam messages. For balanced datasets where each class appears with roughly equal frequency, accuracy is a reasonable primary metric. For imbalanced datasets, which are the norm in real-world NLP, it should always be supplemented with precision, recall, and F1.
Precision, Recall, and F1 Score
Precision measures the accuracy of positive predictions: of all the emails the model labeled as spam, what fraction actually were spam? A precision of 0.90 means that 90% of the model's spam predictions are correct, and 10% are false alarms (legitimate emails incorrectly flagged). High precision means few false positives. Precision matters most when false positives are costly: flagging a legitimate email as spam might cause someone to miss an important message.
Recall measures the coverage of positive predictions: of all the emails that actually were spam, what fraction did the model catch? A recall of 0.80 means the model catches 80% of spam and misses 20%. High recall means few false negatives. Recall matters most when false negatives are costly: missing a phishing email might lead to a security breach.
Precision and recall are inherently in tension. Increasing the threshold for labeling something as spam increases precision (fewer false alarms) but decreases recall (more spam gets through). Decreasing the threshold increases recall but decreases precision. F1 score is the harmonic mean of precision and recall: F1 = 2 * (precision * recall) / (precision + recall). It provides a single number that balances both concerns. An F1 of 0.85 means the model achieves a reasonable balance between catching spam and avoiding false alarms. The harmonic mean penalizes extreme imbalances: a model with 1.0 precision and 0.01 recall gets an F1 of only 0.02, reflecting that it is practically useless despite its perfect precision.
For multi-class classification, precision, recall, and F1 can be computed per class (how well does the model handle each category?) and then averaged. Macro-average computes the metric for each class independently and takes the unweighted average, treating all classes equally regardless of frequency. Micro-average pools all predictions across classes before computing the metric, weighting each prediction equally (which means frequent classes dominate). Weighted-average weights each class by its frequency in the dataset. The choice between these averaging methods depends on whether rare classes are as important as frequent ones.
The Confusion Matrix
A confusion matrix displays all prediction outcomes in a grid. For binary classification, it shows true positives (correctly predicted positive), true negatives (correctly predicted negative), false positives (incorrectly predicted positive), and false negatives (incorrectly predicted negative). For multi-class classification, the matrix has one row per true class and one column per predicted class, with cell (i,j) showing how many examples of class i were predicted as class j. The diagonal shows correct predictions; off-diagonal cells show errors and reveal which classes the model confuses with each other.
Sequence Labeling Metrics
Named entity recognition and part-of-speech tagging are evaluated at the entity level, not the token level. An entity is considered correctly predicted only if both its boundaries and its type match the gold standard exactly. If the gold standard labels "New York City" as a location and the model labels only "New York" as a location, this counts as both a false positive ("New York") and a false negative ("New York City"). Partial matches receive no credit under strict evaluation. The CoNLL evaluation script, used as the standard for NER evaluation, computes entity-level precision, recall, and F1.
Span-level F1 on the CoNLL-2003 English NER benchmark provides a standard point of comparison. Models from 2003 achieved F1 around 88%. BiLSTM-CRF models from 2016 achieved around 91%. Fine-tuned BERT models from 2018 achieved around 93%. Current state-of-the-art exceeds 94%. These numbers represent performance on well-edited news text; performance on informal text, domain-specific text, and low-resource languages is substantially lower.
Generation Metrics
BLEU
BLEU (Bilingual Evaluation Understudy) is the standard metric for machine translation. It measures the overlap of n-grams (sequences of 1, 2, 3, and 4 words) between the machine-generated translation and one or more human reference translations. BLEU ranges from 0 to 100, where higher is better. A BLEU score of 40+ generally indicates fluent, accurate translation for European language pairs. Professional human translators typically achieve BLEU scores of 50 to 60 when compared to other professional translations, providing a rough human ceiling.
BLEU has well-known limitations. It only rewards exact n-gram matches, so "automobile" versus "car" receives zero credit despite being semantically equivalent. It cannot evaluate whether the translation preserves meaning, only whether it uses the same words as the reference. Short translations receive a brevity penalty, but the penalty is coarse. Despite these limitations, BLEU correlates reasonably well with human quality judgments at the system level (comparing one MT system to another) and remains the most reported metric in machine translation research due to its simplicity, speed, and standardization.
ROUGE
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the standard metric for text summarization. ROUGE-1 measures unigram overlap between the generated summary and reference summaries. ROUGE-2 measures bigram overlap. ROUGE-L measures the longest common subsequence. Unlike BLEU, which emphasizes precision (are the generated words present in the reference?), ROUGE emphasizes recall (are the reference words present in the generated text?). This makes sense for summarization: a good summary should cover the key information in the reference, even if it uses different wording for some of it.
ROUGE suffers from the same fundamental limitation as BLEU: it rewards surface-level word overlap rather than semantic similarity. A summary that paraphrases the reference using entirely different vocabulary would receive a low ROUGE score despite being an excellent summary. ROUGE also cannot detect factual errors: a summary that contains hallucinated facts may achieve high ROUGE if it otherwise overlaps well with the reference.
BERTScore
BERTScore addresses the vocabulary mismatch problem by comparing generated and reference text in embedding space rather than at the surface level. Each token in the generated text is matched to its most similar token in the reference text using cosine similarity of their BERT embeddings. The average similarity across all tokens provides the BERTScore. "Automobile" and "car" receive a high similarity score because their BERT embeddings are close, even though they share no characters. BERTScore correlates more strongly with human quality judgments than BLEU or ROUGE for most tasks, but it is slower to compute (it requires running BERT on every evaluation example) and less interpretable.
Perplexity
Perplexity measures how well a language model predicts a test dataset. It is the exponentiated average negative log-likelihood per token: a model that assigns high probability to the actual test tokens has low perplexity, indicating good language modeling. A perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 equally likely options at each step. Lower perplexity is better. GPT-2 achieved a perplexity of roughly 35 on WikiText-103. GPT-3 reduced this to around 20. Modern frontier models achieve single-digit perplexities on standard benchmarks.
Perplexity is specific to language modeling and does not directly measure downstream task performance. A model with lower perplexity does not necessarily produce better summaries, translations, or classifications. It measures the model's understanding of language distribution, not its ability to use that understanding for specific tasks. However, perplexity correlates with downstream performance strongly enough that it remains the primary metric for comparing language models during pre-training.
Human Evaluation
Human evaluation remains the gold standard for assessing NLP output quality because automated metrics capture only a fraction of what makes text good. Human evaluators can assess fluency (does the text read naturally?), coherence (does it make sense?), relevance (does it address the intended topic?), factual accuracy (are the stated facts correct?), and appropriateness (is the tone and style suitable?) in ways that no automated metric fully captures.
Direct Assessment asks evaluators to rate output on a continuous scale (typically 0 to 100) for specific quality dimensions. Comparative evaluation asks evaluators to rank or compare two or more outputs, determining which is better. A/B testing presents two model outputs side by side and asks which is preferred overall. Likert-scale evaluation asks evaluators to rate on a fixed scale (1 to 5) for specific criteria. Each method has tradeoffs: direct assessment provides more granular signal but is more cognitively demanding, while comparative evaluation is easier for evaluators but provides only relative rather than absolute quality information.
Inter-annotator agreement measures how consistently different evaluators rate the same output. Cohen's kappa and Krippendorff's alpha quantify agreement beyond chance. High agreement (kappa > 0.8) indicates that the evaluation criteria are clear and the task is well-defined. Low agreement (kappa < 0.4) suggests that the criteria are ambiguous, the task is inherently subjective, or the evaluators need better training. For creative text evaluation, agreement is inherently lower because quality judgments are more subjective.
Choosing the Right Metric
The right metric depends on what matters for your application. For spam detection, precision matters if false positives are costly (legitimate emails lost), recall matters if false negatives are costly (spam getting through), and F1 balances both. For medical text classification, recall for detecting critical conditions should be weighted heavily because missing a diagnosis is far worse than a false alarm. For machine translation, BLEU provides a standardized comparison across systems, but COMET or human evaluation should supplement it for final quality assessment.
For generation tasks, no single metric is sufficient. ROUGE measures content coverage, BERTScore measures semantic similarity, perplexity measures language modeling quality, and human evaluation measures overall quality. Reporting multiple metrics provides a more complete picture than any single number. The NLP community's increasing emphasis on multi-metric evaluation and mandatory human evaluation for generation tasks reflects the recognition that automated metrics alone cannot capture what makes generated text good.
Different NLP tasks require different metrics: F1 score for classification, BLEU for translation, ROUGE for summarization, perplexity for language models, with human evaluation remaining essential for any task where automated metrics cannot fully capture output quality.