Question Answering Systems

Updated May 2026
Question answering (QA) systems automatically find or generate answers to natural language questions. Extractive QA highlights the answer span within a given passage, while generative QA produces answers in its own words. Modern QA systems combine retrieval, where relevant documents are found from a large corpus, with reading comprehension, where the model extracts or generates answers from those documents. This architecture powers search engine answer boxes, virtual assistants, enterprise knowledge bases, and conversational AI systems.

Types of Question Answering

Extractive QA

Extractive QA takes a question and a passage as input, then identifies the exact text span in the passage that answers the question. Given the passage "Albert Einstein was born in Ulm, Germany on March 14, 1879" and the question "Where was Einstein born?", the system highlights "Ulm, Germany" as the answer. The model does not generate any new text; it selects the start and end positions of the answer within the provided passage.

Technically, the model produces two probability distributions: one over all tokens in the passage for the answer start position, and one for the answer end position. The answer span is the highest-scoring start-end pair where the end position comes after the start position. Fine-tuning BERT for extractive QA requires only adding two linear layers (one for start prediction, one for end prediction) on top of the pre-trained transformer and training on question-passage-answer triples. This simple architecture achieves remarkable results: on the SQuAD 1.1 benchmark, fine-tuned BERT achieves an F1 score of 93.2%, matching or exceeding the estimated human performance of 91.2%.

SQuAD (Stanford Question Answering Dataset) has been the primary benchmark for extractive QA since 2016. SQuAD 1.1 contains 107,785 question-answer pairs based on Wikipedia articles, where every question has an answer in the passage. SQuAD 2.0 added 53,775 unanswerable questions, where the passage does not contain the information needed to answer. This addition was critical because real-world QA systems must distinguish between questions they can answer and questions they cannot, rather than always extracting some span regardless of relevance. Models must learn to output "unanswerable" when the passage lacks the needed information.

Generative QA

Generative QA produces answers in the model's own words rather than extracting verbatim text. This is necessary when the answer requires synthesis across multiple sentences, paraphrasing, or reasoning that goes beyond what any single text span contains. The question "How does photosynthesis work?" requires a synthesized explanation, not a highlighted span. Generative QA models use encoder-decoder architectures (like T5 or BART) or decoder-only architectures (like GPT) to produce free-text answers conditioned on the question and any provided context.

The boundary between generative QA and conversational AI has blurred substantially since the advent of large language models. When a user asks ChatGPT, Claude, or Gemini a factual question, the system is performing generative QA, producing an answer from its parametric knowledge (information encoded in model weights during pre-training) or from retrieved context (documents fetched by a retrieval system and included in the prompt). The quality of answers depends on the model's training data, the accuracy of its parametric knowledge, and whether retrieval augmentation provides relevant, up-to-date evidence.

Open-Domain vs. Closed-Domain QA

Closed-domain QA (also called reading comprehension) takes both a question and a relevant passage as input. The model only needs to find or generate the answer from the provided text. Open-domain QA takes only a question as input, without a pre-selected passage. The system must first identify which documents from a large corpus are relevant, then extract or generate the answer from those documents. This two-stage architecture, called "retriever-reader," is much more challenging because the retrieval step must find the right information among potentially billions of documents.

The retriever component in open-domain QA uses either sparse retrieval (TF-IDF or BM25 keyword matching) or dense retrieval (embedding-based similarity search). Dense retrieval encodes both the question and every passage in the corpus as dense vectors using a neural encoder, then finds the passages whose vectors are most similar to the question vector. Dense Passage Retrieval (DPR), introduced by Facebook AI in 2020, showed that dense retrieval substantially outperforms BM25 for QA because it can match questions to passages that are semantically relevant even when they share few keywords. The question "What is the boiling point of water?" matches a passage about "water reaches 100 degrees Celsius at standard pressure" even though the passage does not contain the word "boiling."

How QA Models Understand Questions

Understanding a question requires identifying what type of answer is expected. "When" questions expect dates or time expressions. "Where" questions expect locations. "How many" questions expect numbers. "Why" questions expect explanations or causal chains. "Who" questions expect person or organization names. This question type recognition is not explicitly programmed in modern models; transformer-based QA systems learn to distinguish question types implicitly during training. The attention mechanism connects the question words to the appropriate answer type patterns in the passage.

Multi-hop questions require reasoning across multiple pieces of information. "What country is the birthplace of the person who developed the theory of relativity?" requires first identifying the person (Albert Einstein), then finding the birthplace (Ulm, Germany), then determining the country (Germany). Each step depends on the output of the previous step. Multi-hop QA is significantly harder than single-hop QA, with current models achieving 60% to 75% accuracy on benchmarks like HotpotQA compared to 90%+ on single-hop benchmarks. The challenge is not just finding multiple facts but performing the correct reasoning chain that connects them.

Questions requiring numerical reasoning present another challenge. "How many more goals did Ronaldo score than Messi in 2023?" requires finding two numbers in the context, performing subtraction, and formatting the result. Large language models handle simple arithmetic well but struggle with complex calculations, comparisons across large tables, and multi-step numerical reasoning. Augmenting QA systems with code execution capabilities (where the model writes and runs a calculation rather than performing arithmetic in its weights) has shown strong improvements for numerical questions.

Retrieval-Augmented Generation for QA

Retrieval-augmented generation (RAG) has become the dominant architecture for production QA systems. The approach combines a retrieval component that finds relevant documents with a generative model that produces answers grounded in those documents. The retriever searches a document index (using dense or sparse methods), retrieves the top-k most relevant passages (typically 3 to 10), and these passages are concatenated with the question in the generative model's input. The model then generates an answer that draws on the retrieved evidence.

RAG solves several problems with pure parametric QA (answering from model weights alone). Parametric knowledge has a training cutoff date: a model trained on data through 2025 cannot answer questions about 2026 events. Parametric knowledge is difficult to update: correcting a factual error requires retraining or fine-tuning the entire model. Parametric models hallucinate: they generate plausible-sounding but incorrect answers because the information was absent, distorted, or conflicting in the training data. RAG addresses all three issues by grounding answers in retrieved evidence that can be updated without retraining, verified against source documents, and refreshed with current information.

The quality of a RAG system depends heavily on the retrieval component. If the retriever fails to find the relevant document, the generative model either produces an incorrect answer from parametric knowledge or correctly states that it does not have enough information (if properly trained to do so). Chunking strategy (how documents are split into passages for indexing), embedding model quality (how well semantic similarity is captured in the vector space), and re-ranking (scoring retrieved passages for relevance before sending them to the generator) are all critical engineering decisions that affect end-to-end QA accuracy more than the choice of generative model.

QA in Practice

Search engines have incorporated QA features prominently since Google's Featured Snippets launched in 2014. For factual queries, the search engine extracts an answer directly from a web page and displays it above the regular search results. This changed user behavior: for many factual questions, users get their answer without clicking any result. By 2026, AI-powered search features generate synthesized answers that draw on multiple sources, citing the sources alongside the generated response. This represents a shift from search-as-retrieval (find the right page) to search-as-QA (provide the answer directly).

Enterprise QA systems let employees ask questions about internal documentation, policies, and procedures in natural language. Instead of searching through a company wiki, an employee can ask "What is our policy on remote work for employees outside the US?" and receive a direct answer sourced from the relevant HR document. These systems use RAG architectures with the company's document corpus as the knowledge base. Security and access control add complexity: the QA system must only answer from documents the requesting user is authorized to see.

Medical QA systems assist clinicians by answering questions about drug interactions, treatment protocols, and diagnostic criteria. PubMedQA and BioASQ provide benchmarks for biomedical QA, and specialized models fine-tuned on medical text achieve accuracy comparable to medical students on standardized exam questions. These systems must handle the critical distinction between informing clinicians (providing relevant evidence for their decision-making) and recommending treatment (which carries regulatory and liability implications). Most medical QA systems are designed as decision support tools that present evidence rather than make recommendations.

Key Takeaway

Question answering systems find or generate answers to natural language questions, with modern RAG architectures combining document retrieval with generative models to provide accurate, sourced, and up-to-date responses.