Named Entity Recognition Explained
What NER Does and Why It Matters
Consider the sentence: "Apple announced that Tim Cook would visit the European Commission in Brussels on March 15, 2026 to discuss a $2.5 billion agreement." A NER system identifies "Apple" as an organization, "Tim Cook" as a person, "European Commission" as an organization, "Brussels" as a location, "March 15, 2026" as a date, and "$2.5 billion" as a monetary value. Each entity is both detected (its boundaries are identified) and classified (its type is determined). This extraction converts a free-text sentence into structured data that can be stored in a database, linked to other information about the same entities, and queried programmatically.
NER is essential because the world's information is overwhelmingly stored as unstructured text. Medical records contain patient names, drug names, dosages, and diagnoses embedded in clinical notes. Legal documents mention parties, dates, jurisdictions, and monetary amounts throughout thousands of pages. News articles reference people, organizations, and locations in every paragraph. Without NER, extracting this information requires human readers, which does not scale. With NER, organizations can automatically process millions of documents, extracting structured entities that power downstream analytics, compliance monitoring, and decision-making.
The standard entity types defined by the MUC (Message Understanding Conference) and CoNLL (Conference on Natural Language Learning) benchmarks include PER (person), ORG (organization), LOC (location), and MISC (miscellaneous). Many applications extend this to include more specific types: DATE, TIME, MONEY, PERCENT, PRODUCT, EVENT, LAW, LANGUAGE, and domain-specific types like GENE, PROTEIN, DRUG, or DISEASE in biomedical NER. The OntoNotes benchmark defines 18 entity types, and some specialized systems recognize hundreds of fine-grained types.
The BIO Tagging Scheme
NER is formulated as a sequence labeling task: each token in the input receives a label. The most common labeling scheme is BIO (Beginning, Inside, Outside). A token that starts an entity is labeled B-TYPE, where TYPE is the entity class. Tokens that continue an entity are labeled I-TYPE. Tokens that are not part of any entity are labeled O. For the sentence "Tim Cook visited Brussels," the labels would be: Tim=B-PER, Cook=I-PER, visited=O, Brussels=B-LOC.
The BIO scheme handles multi-word entities and adjacent entities of the same type. Without the B/I distinction, the system could not tell whether "New York Times" is one entity or two ("New York" + "Times"). The B tag on "New" marks the start of a new entity, and the I tags on "York" and "Times" mark their continuation. When two entities of the same type appear consecutively, like "Washington Lincoln" referring to two people, the B tag on "Lincoln" signals the start of a new entity rather than a continuation of "Washington."
Extended schemes like BIOES add S (single-token entity) and E (end of entity) tags, providing more explicit boundary information. In practice, BIOES schemes produce slightly better accuracy on some benchmarks because the end tag gives the model an explicit signal about entity boundaries, but the improvement over BIO is typically small (less than 1 F1 point on CoNLL-2003).
How NER Systems Work
Rule-Based Approaches
The earliest NER systems used hand-crafted rules. Capitalized words following titles like "Mr." or "Dr." were tagged as persons. Words followed by "Inc.," "Corp.," or "Ltd." were tagged as organizations. Patterns like "January 15, 2026" matched date templates. Gazetteers, lists of known entities like country names, city names, and company names, provided lookup-based recognition. These systems could achieve reasonable precision (when they tagged something, it was usually correct) but suffered from low recall (they missed many entities that did not match their patterns). Maintaining and updating the rule sets as language evolved was labor-intensive.
Statistical Sequence Models
The shift to statistical NER began with Hidden Markov Models (HMMs) and peaked with Conditional Random Fields (CRFs) in the 2000s. CRFs model the probability of an entire label sequence given the input tokens, considering not just each token individually but the dependencies between adjacent labels. A CRF learns that I-PER is much more likely to follow B-PER than to follow B-ORG, and that B-PER is more likely after "Mr." than after "the." Features fed to the CRF include the token itself, its capitalization pattern, prefixes and suffixes, part-of-speech tag, and context words. CRFs achieved F1 scores around 89% on CoNLL-2003, the standard English NER benchmark, and remained the state of the art through the early 2010s.
Neural NER
Neural NER models replaced hand-crafted features with learned representations. The BiLSTM-CRF architecture, which became dominant from 2015 to 2018, uses a bidirectional LSTM to produce contextual representations of each token, followed by a CRF layer that models label dependencies. The LSTM learns to extract relevant features automatically from the raw text, eliminating the need for feature engineering. Character-level representations, computed by a separate small LSTM or CNN over the characters of each token, capture morphological information like capitalization patterns and suffixes without explicit feature design.
Transformer-based NER, using BERT and similar pre-trained models, pushed performance further. The approach is straightforward: pass the input text through a pre-trained transformer, take the contextualized representation of each token from the final layer, and pass it through a classification head that predicts the BIO label. Fine-tuning BERT on the CoNLL-2003 NER dataset achieves F1 scores above 93%, a substantial improvement over BiLSTM-CRF models. The pre-trained transformer already understands word meaning, syntactic structure, and entity-related patterns from its pre-training on billions of words of text, so only a small amount of NER-specific training data is needed to achieve strong performance.
Challenges in Named Entity Recognition
Ambiguity is pervasive in NER. "Washington" can be a person (George Washington), a state, a city, or a sports team. "Apple" can be a company or a fruit. "Jordan" can be a person, a country, or a brand. Context determines the correct interpretation, and modern transformer models handle these cases well for entities that appear frequently in the training data. But rare or novel entities remain challenging: a newly founded startup, a person who shares a name with a common word, or a location mentioned for the first time in the text.
Nested entities occur when one entity contains another. "The University of California, Berkeley" contains the entity "California" (location) and "Berkeley" (location) within the larger entity "University of California, Berkeley" (organization). Standard flat NER with BIO tagging cannot represent nested entities because each token receives only one label. Specialized nested NER approaches use span-based classification (enumerate all possible text spans and classify each one) or multi-layer tagging (separate tags for each level of nesting), but these are more complex and slower than flat NER.
Domain adaptation is a persistent practical challenge. A NER model trained on news text recognizes politicians, countries, and companies well but struggles with gene names in biomedical text or technical terms in legal documents. Biomedical NER is a substantial subfield in its own right, with specialized models, datasets, and evaluation metrics. Gene names are particularly difficult because they overlap with common English words (BRCA1, p53, "hedgehog" as both an animal and a gene), are constantly being created as new genes are discovered, and follow inconsistent naming conventions across organisms and research communities.
Low-resource and informal text degrades NER performance significantly. Social media text with inconsistent capitalization, abbreviations, hashtags, and non-standard grammar violates many of the patterns NER models learn from formal text. "omg just saw @elonmusk at the tesla factory in fremont" challenges every standard NER signal: no capitalization on "tesla" or "fremont," informal language, and platform-specific notation like @mentions. Models trained exclusively on news text lose 10 to 20 F1 points when applied to social media without adaptation.
NER in Practice
Search engines use NER to understand queries and documents. When a user searches for "Einstein Nobel Prize year," the search engine recognizes "Einstein" as a person and "Nobel Prize" as an entity, enabling it to find the specific fact rather than returning generic pages about Nobel Prizes. Knowledge panels, the structured information boxes that appear alongside search results, are populated by NER-extracted entities linked to knowledge graph entries.
Customer support systems use NER to extract product names, order numbers, dates, and account identifiers from customer messages, routing requests to the appropriate team and pre-populating support tickets with relevant information. Legal document review uses NER to identify parties, jurisdictions, dates, monetary amounts, and referenced statutes across thousands of documents during litigation discovery. Intelligence analysis uses NER to build entity networks from large collections of intercepted communications, identifying relationships between people, organizations, and locations.
In scientific research, NER extracts chemical compound names from journal articles, gene and protein names from biomedical literature, and geographical locations from ecological field reports. These extracted entities enable automated knowledge base construction, literature mining, and cross-referencing between studies that mention the same entities using different names or abbreviations.
NER converts unstructured text into structured entity data by classifying each token as part of a person, organization, location, or other entity type, with modern transformer models achieving F1 scores above 93% on standard benchmarks.