Information Extraction from Text

Updated May 2026
Information extraction (IE) is the NLP task of automatically converting unstructured text into structured data by identifying entities, relationships, and events mentioned in documents. Given a news article stating "Google acquired Fitbit for $2.1 billion in January 2021," an IE system extracts the structured fact: {acquirer: Google, target: Fitbit, price: $2.1B, date: 2021-01}. This capability powers knowledge graph construction, business intelligence, scientific literature mining, and any application that needs to convert human-readable documents into machine-queryable databases.

The Information Extraction Pipeline

Information extraction typically proceeds through a cascade of increasingly complex tasks. Named entity recognition (NER) identifies the entities mentioned in text: people, organizations, locations, dates, monetary values, and other types. Coreference resolution determines which mentions refer to the same entity: "Google," "the company," and "it" all refer to the same organization. Relation extraction identifies the relationships between entities: Google acquired Fitbit, the price was $2.1 billion, the date was January 2021. Event extraction identifies what happened, constructing structured representations of events with their participants, times, locations, and outcomes.

Each stage builds on the outputs of previous stages, and errors propagate forward. If NER fails to recognize "Fitbit" as an organization, the relation extraction system cannot identify the acquisition relationship, and the event extraction system cannot construct a complete event representation. This cascading error problem motivates two approaches: making each stage as accurate as possible through better models and more training data, and developing joint models that perform multiple extraction tasks simultaneously, allowing them to share information and correct each other's errors.

Relation Extraction

Relation extraction identifies semantic relationships between entities mentioned in text. Given a sentence and a pair of entities, the system predicts which relationship, if any, holds between them. Common relation types include "works for" (person, organization), "located in" (entity, location), "born in" (person, location), "founded" (person, organization), "married to" (person, person), and "subsidiary of" (organization, organization). The number of relation types ranges from a handful in simple schemas to hundreds in comprehensive knowledge bases like Wikidata, which defines over 10,000 property types.

Supervised relation extraction trains a classifier on annotated sentences where entity pairs and their relationships are labeled. The model learns to extract features from the text between and around the entities, including the path through the dependency parse tree, the words and their order between the entities, and the broader sentence context. Transformer-based relation extraction encodes the sentence with the entity positions marked by special tokens, then classifies the relationship using the entity representations from the final transformer layer. On the TACRED benchmark (41 relation types), state-of-the-art models achieve F1 scores around 75%, reflecting the difficulty of the task and the noise in the training data.

Distant supervision, introduced by Mintz et al. in 2009, generates training data automatically by aligning a knowledge base with text. If Wikidata states that Barack Obama was born in Honolulu, then any sentence mentioning both "Barack Obama" and "Honolulu" is assumed to express the "born in" relation. This assumption is noisy, because many such sentences might discuss Obama visiting Honolulu rather than being born there, but it generates large training sets without manual annotation. Noise-aware training methods that tolerate some proportion of incorrect labels partially address the label noise problem, producing models that are more robust than those trained on the noisy labels directly.

Event Extraction

Event extraction identifies structured representations of things that happened. An event has a type (acquisition, earthquake, election, arrest), a trigger (the word or phrase that signals the event, like "acquired" or "arrested"), and arguments (the entities that participate in the event with specific roles). From the sentence "Police arrested John Smith in Chicago on Tuesday for fraud," event extraction produces: {type: Arrest, trigger: "arrested", agent: "Police", defendant: "John Smith", location: "Chicago", time: "Tuesday", charge: "fraud"}.

Event extraction is more complex than relation extraction because it involves identifying the event trigger, classifying its type, and assigning argument roles, all of which interact. The ACE (Automatic Content Extraction) dataset defines 33 event types and 35 argument roles and has been the primary benchmark since 2005. State-of-the-art models achieve F1 scores around 70% to 75% on ACE event detection and 55% to 60% on argument role classification, reflecting the difficulty of this deeply structured extraction task.

Document-level event extraction extends beyond individual sentences to extract events described across multiple sentences or even paragraphs. A financial acquisition might be described across an entire news article: the acquirer is named in the first paragraph, the acquisition price in the third, and the expected completion date in the fifth. Document-level extraction must aggregate information from these scattered mentions into a single coherent event representation. Template filling, where a predefined template (e.g., an acquisition template with slots for buyer, target, price, date, and advisor) is populated from information anywhere in the document, is the classic formulation of this task.

Open Information Extraction

Open information extraction (OpenIE) abandons predefined relation schemas and instead extracts all relational statements from text as (subject, predicate, object) triples. From "Einstein developed the theory of relativity in 1905," OpenIE extracts (Einstein, developed, the theory of relativity) and (Einstein, developed [in], 1905). OpenIE systems do not need training data specific to any relation type because they extract any relationship expressed in text, using syntactic patterns and dependency parsing to identify the subject, predicate, and object of each clause.

The advantage of OpenIE is coverage: it can extract relationships that no predefined schema anticipated. The disadvantage is that the extracted relations are not normalized. "Founded," "established," "started," and "created" might all describe the same relationship but appear as different predicates in OpenIE output. Without normalization, aggregating and querying the extracted information requires handling these synonyms, which partially reintroduces the schema definition problem that OpenIE was designed to avoid. Canonical form mapping, where extracted predicates are grouped into equivalence classes, addresses this but requires additional processing.

Knowledge Graph Construction

The ultimate goal of many IE systems is constructing or extending a knowledge graph: a structured database of entities (nodes) and their relationships (edges). Google's Knowledge Graph, Wikidata, DBpedia, and proprietary enterprise knowledge graphs all use information extraction to populate their structures from text. The process involves extracting entities and relations from documents, resolving entities across sources (determining that "Microsoft Corp," "Microsoft Corporation," and "MSFT" all refer to the same entity), and integrating the extracted facts into the graph with appropriate confidence scores.

Entity linking connects extracted entity mentions to their corresponding entries in the knowledge graph. The mention "Apple" in a technology article should link to the Apple Inc. entity, not the fruit. Entity linking uses context (surrounding words), prior probability (Apple Inc. is mentioned more frequently in text than apple the fruit), and entity embeddings (learned representations that capture entity properties) to resolve ambiguous mentions. Modern entity linking systems achieve over 90% accuracy on standard benchmarks, though performance drops for rare entities, emerging entities, and highly ambiguous mentions.

Knowledge graph completion uses the extracted and linked information to infer new facts. If the graph contains (Einstein, born_in, Ulm) and (Ulm, located_in, Germany), link prediction models can infer (Einstein, born_in_country, Germany) even if that fact was never explicitly extracted from text. Graph embedding models like TransE, RotatE, and ComplEx learn vector representations of entities and relations that capture the graph's structure, enabling prediction of missing edges with reasonable accuracy. This automated completion reduces the amount of text that must be processed to build a comprehensive knowledge base.

Applications

Business intelligence uses IE to monitor competitors, track industry trends, and extract structured data from financial filings, news articles, and earnings call transcripts. A system might extract all acquisition events from business news, building a database of who acquired whom, for how much, and when, that analysts can query and visualize without reading thousands of articles. SEC filings, patent applications, and regulatory documents are rich sources of structured information that IE systems can process at scale.

Biomedical literature mining extracts drug-gene interactions, protein-protein interactions, disease-symptom relationships, and clinical trial results from millions of published papers. The PubMed database contains over 36 million biomedical citations, far more than any researcher can read. IE systems that extract structured relationships from these papers enable drug repurposing (identifying existing drugs that might treat new diseases), adverse event detection (identifying side effects from case reports), and hypothesis generation (finding unexpected connections between biological entities).

Legal document analysis extracts clauses, obligations, parties, dates, and conditions from contracts. A corporate legal department might review thousands of vendor contracts to identify which ones contain non-standard liability terms, which expire within the next six months, or which reference specific regulatory requirements. Manual review of this volume is prohibitively expensive. IE systems that extract the relevant structured information enable automated compliance checking and contract management at scale.

Key Takeaway

Information extraction converts unstructured text into structured data by identifying entities, their relationships, and the events they participate in, enabling automated construction of knowledge bases and databases from document collections.