Word Embeddings Explained

Updated May 2026

Word embeddings are dense numerical vectors that represent words in a continuous mathematical space where similar words are positioned near each other. They solve the fundamental problem of converting human language into numbers that preserve meaning, enabling neural networks to reason about words based on their semantic relationships rather than treating them as arbitrary symbols. Word2Vec, GloVe, and contextual models like BERT each represent a different generation of this technology.

The Problem Embeddings Solve

Before embeddings, the standard way to represent words numerically was one-hot encoding. Each word in the vocabulary gets a vector with a single 1 at its assigned position and 0s everywhere else. With a vocabulary of 50,000 words, each vector has 50,000 dimensions, with only one non-zero entry. This representation is catastrophically wasteful: it uses 50,000 numbers to encode something that could be expressed as a single integer. Worse, it contains no information about relationships between words. The vectors for "cat" and "kitten" are just as different as the vectors for "cat" and "helicopter," both pairs have the same distance and zero similarity in the one-hot space.

This means a model trained with one-hot vectors must learn from scratch that "cat" and "kitten" are related. Every fact the model learns about "cat" provides zero information about "kitten" because their representations share nothing. With 50,000 words in the vocabulary, the number of pairwise relationships the model must learn independently is staggering. Embeddings solve this by representing words as dense vectors of a few hundred dimensions, where semantic similarity corresponds to geometric proximity. A model that learns something about "cat" automatically generalizes to "kitten" because their vectors point in similar directions.

Word2Vec: Learning Meaning from Context

Word2Vec, published by Tomas Mikolov and colleagues at Google in 2013, was the breakthrough that made word embeddings a standard tool in NLP. The core insight is deceptively simple: words that appear in similar contexts tend to have similar meanings. The words "dog" and "cat" both appear near words like "pet," "food," "veterinarian," and "cute." The words "dog" and "quantum" rarely share context words. By training a neural network to predict either a word from its context (Continuous Bag of Words, or CBOW) or context words from a target word (Skip-gram), Word2Vec forces the network to represent words with similar contexts using similar vectors.

The Skip-gram model works as follows. Given a target word like "coffee," the model tries to predict nearby words that might appear within a window of, say, five positions: "morning," "drink," "cup," "black," "espresso." The model has two parameter matrices: an input embedding matrix and an output embedding matrix. The input embedding for "coffee" is multiplied by the output embeddings for all vocabulary words, and the result is passed through a softmax function to produce probability estimates for which words are likely neighbors. Training adjusts both matrices to make the predicted probabilities match the actual co-occurrence patterns in the training corpus.

The resulting embeddings display remarkable algebraic properties. The most famous example is the analogy king - man + woman = queen. Subtracting the "man" vector from "king" and adding the "woman" vector produces a point in the embedding space closest to "queen." This works because the difference between "king" and "man" captures the concept of royalty, and adding that concept to "woman" reaches "queen." Similar relationships work for country-capital pairs (France - Paris + Tokyo = Japan), verb tenses (walking - walked + swam = swimming), and many other semantic relationships. These properties emerge from the training process without any explicit instruction.

Word2Vec embeddings are typically 100 to 300 dimensions. Training on a large corpus (billions of words) produces stable, useful embeddings in a matter of hours on a single machine. Pre-trained Word2Vec vectors trained on Google News (3 billion words, 3 million vocabulary entries) were downloaded and used by thousands of researchers and practitioners, establishing the paradigm of pre-trained representations that would eventually lead to BERT and GPT.

GloVe: Global Vectors from Co-occurrence Statistics

GloVe (Global Vectors for Word Representation), published by Jeffrey Pennington and colleagues at Stanford in 2014, achieves similar results through a different mechanism. Instead of training a predictive model, GloVe constructs a word-word co-occurrence matrix from the training corpus, counting how often each pair of words appears near each other. It then factorizes this matrix into low-dimensional vectors, optimizing the vectors so that the dot product of any two word vectors equals the logarithm of their co-occurrence count. This means that words appearing frequently together get similar vectors, directly encoding distributional similarity.

GloVe's advantage over Word2Vec is that it uses global co-occurrence statistics rather than local context windows, potentially capturing relationships that span larger text regions. In practice, the resulting embeddings are very similar in quality to Word2Vec embeddings, and both methods are considered essentially interchangeable for most applications. GloVe pre-trained vectors trained on Common Crawl (840 billion tokens) and Wikipedia are freely available and widely used.

FastText: Embeddings for Word Pieces

FastText, published by Facebook's AI Research lab in 2016, extends the Word2Vec approach to subword units. Each word is represented not just by its own vector but by the sum of vectors for all its character n-grams. The word "where" with n-grams of size 3 would include the character sequences "whe," "her," "ere," plus the full word "where" and special boundary markers. The word's embedding is the average of all these n-gram vectors.

This design has two major advantages. First, it produces meaningful embeddings for words the model has never seen during training, because those words share n-grams with familiar words. The unseen word "wherefrom" shares n-grams with "where" and "from," so its embedding is a sensible combination of those words' meanings. Second, it handles morphologically rich languages better than word-level methods. In Turkish, a single verb root can generate hundreds of inflected forms. FastText represents all of them effectively because they share character n-grams corresponding to the root and common suffixes. FastText pre-trained vectors are available for 157 languages, making it the most widely deployed embedding method for non-English NLP.

The Limitation of Static Embeddings

Word2Vec, GloVe, and FastText all produce static embeddings: each word gets a single vector regardless of context. The word "bank" has the same embedding whether it appears in "river bank" or "bank account." The word "play" has the same vector in "play music," "play sports," and "a theatrical play." This is a fundamental limitation because word meaning is deeply context-dependent. Humans effortlessly distinguish these senses using surrounding words, but static embeddings collapse all senses into a single averaged representation.

In practice, this averaging works reasonably well because the dominant sense of most words dominates the training data and therefore dominates the embedding. The vector for "bank" is closer to financial terms because the financial sense appears more frequently in text than the river bank sense. But for genuinely ambiguous words and for tasks that require fine-grained semantic discrimination, static embeddings lose critical information.

Contextual Embeddings: BERT, GPT, and Beyond

Contextual embeddings, introduced by ELMo in 2018 and refined by BERT and GPT, generate different vectors for the same word in different contexts. In BERT, the word "bank" in "I deposited money at the bank" receives a different vector than "bank" in "We walked along the river bank." The model processes the entire sentence through multiple transformer layers, and each layer produces a new set of vectors that incorporate increasingly complex contextual information. By the final layer, each token's vector represents not the word in isolation but the word's specific meaning in this particular sentence.

BERT generates contextual embeddings by processing text through 12 transformer layers (base model) or 24 layers (large model). Each layer applies self-attention, allowing every token to attend to every other token, followed by a feed-forward network. The input embedding (a static vector similar to Word2Vec) is progressively transformed through these layers into a rich, context-sensitive representation. Research has shown that earlier layers capture syntactic information (part of speech, grammatical relations) while later layers capture semantic information (word sense, entity type, sentiment).

GPT models also produce contextual embeddings, but they differ from BERT in a crucial way: GPT uses causal (left-to-right) attention, meaning each token can only attend to tokens that precede it. BERT uses bidirectional attention, meaning each token attends to the full context in both directions. For understanding tasks like classification and question answering, BERT's bidirectional context is superior. For generation tasks where the model must predict the next word, GPT's causal architecture is necessary because the model cannot look at tokens it has not yet generated.

Modern embedding models like those used in semantic search and retrieval-augmented generation systems produce sentence-level or passage-level embeddings by pooling the token-level contextual representations into a single vector per text segment. These models are trained specifically so that semantically similar texts have similar embeddings, enabling efficient similarity search across millions of documents using approximate nearest neighbor algorithms.

Practical Applications of Embeddings

Search and information retrieval is one of the largest commercial applications. Traditional keyword search matches exact terms: a search for "automobile repair" would not find a page about "car fixing." Semantic search using embeddings matches by meaning: the embedding for "automobile repair" is close to the embedding for "car fixing," so both results appear. Every major search engine now incorporates embedding-based semantic matching alongside traditional keyword matching.

Recommendation systems use embeddings to suggest content similar to what a user has engaged with. If a user reads articles with embeddings clustered in the "machine learning" region of the embedding space, the system recommends other articles in that region. Collaborative filtering embeddings represent both users and items in the same space, so a user's position implicitly encodes their preferences. Product embeddings power "similar items" recommendations on e-commerce sites, where products with nearby vectors are shown as alternatives or complements.

Clustering and visualization use embeddings to organize large text collections. Projecting high-dimensional embeddings into 2D or 3D using techniques like t-SNE or UMAP produces maps where semantically related documents cluster together. Topic discovery algorithms like k-means clustering operate on document embeddings to automatically identify themes in large corpora without predefined categories. These techniques are used in academic literature review, market research, social media analysis, and intelligence analysis.

Key Takeaway

Word embeddings convert language into numerical vectors where semantic similarity maps to geometric proximity, with modern contextual models producing different vectors for the same word depending on its surrounding context.

The Problem Embeddings Solve

Word2Vec: Learning Meaning from Context

GloVe: Global Vectors from Co-occurrence Statistics

FastText: Embeddings for Word Pieces

The Limitation of Static Embeddings

Contextual Embeddings: BERT, GPT, and Beyond

Practical Applications of Embeddings

Related Articles

Tokenization in NLP

How Semantic Search Works

What Are Language Models

How Transformers Work