Topic Modeling Explained
What Topic Models Discover
Given a collection of 10,000 news articles with no labels or categories, a topic model might discover that the articles contain themes like {election, candidate, voter, poll, campaign}, {stock, market, investor, earnings, trading}, {player, game, season, score, team}, and {temperature, storm, forecast, rainfall, drought}. These topics emerge from the data: the algorithm notices that "election" and "candidate" tend to appear together in the same documents, while "stock" and "investor" appear together in different documents. No one told the algorithm that politics and finance are separate topics. It discovered these themes by analyzing word co-occurrence patterns across thousands of documents.
Each document is represented as a mixture of topics. A single news article might be 60% politics, 30% economics, and 10% foreign policy, reflecting that it discusses the economic implications of a political decision involving international trade. This mixed-membership representation is more nuanced than assigning each document to a single category. Real documents often cover multiple themes, and topic models capture this naturally. The topic proportions for each document provide a compact representation of its content that can be used for search, recommendation, trend analysis, and visualization.
The practical value of topic modeling is making large text collections interpretable. A researcher studying 50,000 academic papers cannot read them all, but a topic model can reveal the major research themes, how those themes have evolved over time, which papers bridge multiple topics, and which topics are growing or declining. Market researchers analyzing 100,000 customer reviews can identify the main themes customers discuss (quality, pricing, shipping, customer service) and track how sentiment differs across these themes. Intelligence analysts can identify emerging topics in large collections of intercepted communications. The common pattern is converting an unmanageable volume of unstructured text into a structured thematic overview.
Latent Dirichlet Allocation (LDA)
LDA, introduced by David Blei, Andrew Ng, and Michael Jordan in 2003, is the foundational topic model and remains the most widely used. LDA assumes a generative process for documents: an author first decides the topic proportions for the document (say 40% topic A and 60% topic B), then for each word in the document, first chooses a topic according to these proportions, then chooses a word according to the chosen topic's word distribution. LDA reverses this generative process: given the observed words, it infers the topic proportions for each document and the word distributions for each topic.
Mathematically, LDA places a Dirichlet prior on both the document-topic distributions and the topic-word distributions. The Dirichlet distribution is a probability distribution over probability distributions, parameterized by a concentration parameter alpha that controls how concentrated or dispersed the topic mixtures are. A low alpha produces documents dominated by a few topics, while a high alpha produces documents that mix many topics evenly. Inference, the process of estimating the latent topic structure from observed data, uses either variational inference (fast, approximate) or Gibbs sampling (slower, more accurate). Both methods iteratively assign words to topics and refine the topic and document distributions until convergence.
The key hyperparameter is the number of topics k, which must be specified in advance. Choosing k requires balancing granularity (more topics provide finer distinctions) against interpretability (fewer topics are easier to understand and label). Common approaches to selecting k include running models with different k values and evaluating using coherence metrics (which measure whether the top words in each topic are semantically related), perplexity on held-out data (lower perplexity indicates better fit), and human judgment (domain experts inspect topics at various k values and choose the most interpretable set). For most collections, k between 10 and 100 produces useful results.
Non-Negative Matrix Factorization (NMF)
NMF is a linear algebra approach to topic modeling that factorizes the document-term matrix into two non-negative matrices: a document-topic matrix and a topic-term matrix. The document-term matrix has one row per document and one column per word, with cell values representing word frequency or TF-IDF weight. NMF decomposes this matrix into a product of a documents-by-topics matrix (how much each document relates to each topic) and a topics-by-terms matrix (how strongly each word is associated with each topic). The non-negativity constraint ensures that all values are positive, which means topic representations are additive, making them more interpretable than methods that allow negative values.
NMF is faster than LDA for large vocabularies and produces topics that are often more specific and interpretable, particularly for short text. It does not have a probabilistic interpretation like LDA, so it cannot estimate the uncertainty of topic assignments or generate synthetic documents. In practice, both LDA and NMF produce comparable results on most datasets, and the choice between them often comes down to computational constraints (NMF is faster) and whether a probabilistic framework is needed (LDA provides it).
Neural Topic Models
Neural topic models replace LDA's probabilistic inference with neural network architectures. The Neural Variational Document Model (NVDM) and its successors use variational autoencoders (VAEs) to learn topic representations. The encoder maps a document's bag-of-words representation to a latent topic vector, and the decoder reconstructs the word distribution from the topic vector. The latent vector serves the same role as LDA's topic proportions, but the neural network can learn non-linear relationships between words and topics that LDA's linear model cannot capture.
BERTopic, introduced in 2022, combines transformer embeddings with clustering to produce topics. Documents are encoded using a pre-trained sentence transformer, producing dense embeddings. UMAP (Uniform Manifold Approximation and Projection) reduces the dimensionality of these embeddings. HDBSCAN clustering groups similar document embeddings together. Each cluster is treated as a topic, and a class-based TF-IDF procedure extracts the most representative words for each cluster. BERTopic produces more coherent and interpretable topics than LDA on many benchmarks because it leverages the semantic understanding of pre-trained transformers rather than relying on word co-occurrence statistics alone.
Top2Vec is a similar approach that uses document embeddings, dimensionality reduction, and density-based clustering, but it automatically determines the number of topics rather than requiring it as a hyperparameter. The number of dense areas in the reduced embedding space determines the number of topics, which can then be merged hierarchically for coarser granularity. This eliminates the need for the k-selection experiments that LDA requires.
Evaluating Topic Quality
Topic coherence measures how semantically related the top words in a topic are. A coherent topic like {hospital, patient, doctor, treatment, diagnosis} is more interpretable than an incoherent topic like {hospital, market, algorithm, river, Tuesday}. The most widely used coherence metric is CV, which measures the co-occurrence of top topic words in a reference corpus using normalized pointwise mutual information (NPMI). Higher coherence scores indicate more interpretable topics. Coherence is computed for each topic individually and averaged across all topics.
Topic diversity measures how different the topics are from each other. A model that produces 20 topics where 10 of them overlap heavily is less useful than a model with 15 distinct topics. Topic diversity is typically measured as the percentage of unique words across all topics' top-N word lists. A diversity of 1.0 means every topic has completely different top words, while low diversity indicates redundant topics.
Human evaluation remains important despite automated metrics. Domain experts inspect topic word lists and representative documents, assign labels to each topic (a topic is useful if a human can label it concisely), and assess whether the topics align with known thematic structure in the data. The gap between coherence scores and human usefulness judgments can be substantial: a statistically coherent topic might not correspond to any meaningful concept, while a meaningful concept might be split across multiple topics. Automated metrics guide hyperparameter selection, but human evaluation determines whether the topics are actually useful for the intended application.
Applications
Academic research uses topic modeling extensively for literature analysis. Researchers studying the evolution of a scientific field apply topic models to all published papers in relevant journals over several decades, revealing how research themes have emerged, grown, merged, and declined. Citation networks combined with topic models show how ideas flow between research communities. Grant agencies use topic models to identify emerging research areas and ensure balanced funding across themes.
Social media analysis applies topic models to tweets, posts, and comments to understand public discourse. During elections, topic models reveal the issues that voters discuss most frequently and how these issues shift over time. Brand managers track the topics associated with their products, identifying emerging concerns before they become crises. Public health researchers analyze social media to detect outbreaks, track vaccine sentiment, and understand health information seeking behavior.
Legal discovery uses topic models to organize large document collections during litigation. A lawsuit might involve millions of emails and documents. Topic modeling identifies the major themes in the collection, enabling attorneys to prioritize review of documents related to relevant themes rather than reviewing every document sequentially. This can reduce review costs by 50% to 80% compared to sequential review.
Content recommendation systems use topic modeling to understand what a user is interested in based on the topics present in content they have previously consumed. If a user frequently reads articles about machine learning and Python programming, the system recommends other articles with high proportions of these topics. This topic-based approach complements collaborative filtering (recommending what similar users liked) by providing content-level understanding of user preferences.
Topic modeling discovers hidden themes in document collections without labeled data, representing each document as a mixture of topics and each topic as a cluster of related words, making large text collections interpretable and searchable by theme.