Word Embeddings Explained
The Problem Embeddings Solve
Before embeddings, the standard way to represent words numerically was one-hot encoding. Each word in the vocabulary gets a vector with a single 1 at its assigned position and 0s everywhere else. With a vocabulary of 50,000 words, each vector has 50,000 dimensions, with only one non-zero entry. This representation is catastrophically wasteful: it uses 50,000 numbers to encode something that could be expressed as a single integer. Worse, it contains no information about relationships between words. The vectors for "cat" and "kitten" are just as different as the vectors for "cat" and "helicopter," both pairs have the same distance and zero similarity in the one-hot space.
This means a model trained with one-hot vectors must learn from scratch that "cat" and "kitten" are related. Every fact the model learns about "cat" provides zero information about "kitten" because their representations share nothing. With 50,000 words in the vocabulary, the number of pairwise relationships the model must learn independently is staggering. Embeddings solve this by representing words as dense vectors of a few hundred dimensions, where semantic similarity corresponds to geometric proximity. A model that learns something about "cat" automatically generalizes to "kitten" because their vectors point in similar directions.
Word2Vec: Learning Meaning from Context
Word2Vec, published by Tomas Mikolov and colleagues at Google in 2013, was the breakthrough that made word embeddings a standard tool in NLP. The core insight is deceptively simple: words that appear in similar contexts tend to have similar meanings. The words "dog" and "cat" both appear near words like "pet," "food," "veterinarian," and "cute." The words "dog" and "quantum" rarely share context words. By training a neural network to predict either a word from its context (Continuous Bag of Words, or CBOW) or context words from a target word (Skip-gram), Word2Vec forces the network to represent words with similar contexts using similar vectors.
The Skip-gram model works as follows. Given a target word like "coffee," the model tries to predict nearby words that might appear within a window of, say, five positions: "morning," "drink," "cup," "black," "espresso." The model has two parameter matrices: an input embedding matrix and an output embedding matrix. The input embedding for "coffee" is multiplied by the output embeddings for all vocabulary words, and the result is passed through a softmax function to produce probability estimates for which words are likely neighbors. Training adjusts both matrices to make the predicted probabilities match the actual co-occurrence patterns in the training corpus.
The resulting embeddings display remarkable algebraic properties. The most famous example is the analogy king - man + woman = queen. Subtracting the "man" vector from "king" and adding the "woman" vector produces a point in the embedding space closest to "queen." This works because the difference between "king" and "man" captures the concept of royalty, and adding that concept to "woman" reaches "queen." Similar relationships work for country-capital pairs (France - Paris + Tokyo = Japan), verb tenses (walking - walked + swam = swimming), and many other semantic relationships. These properties emerge from the training process without any explicit instruction.
Word2Vec embeddings are typically 100 to 300 dimensions. Training on a large corpus (billions of words) produces stable, useful embeddings in a matter of hours on a single machine. Pre-trained Word2Vec vectors trained on Google News (3 billion words, 3 million vocabulary entries) were downloaded and used by thousands of researchers and practitioners, establishing the paradigm of pre-trained representations that would eventually lead to BERT and GPT.
GloVe: Global Vectors from Co-occurrence Statistics
GloVe (Global Vectors for Word Representation), published by Jeffrey Pennington and colleagues at Stanford in 2014, achieves similar results through a different mechanism. Instead of training a predictive model, GloVe constructs a word-word co-occurrence matrix from the training corpus, counting how often each pair of words appears near each other. It then factorizes this matrix into low-dimensional vectors, optimizing the vectors so that the dot product of any two word vectors equals the logarithm of their co-occurrence count. This means that words appearing frequently together get similar vectors, directly encoding distributional similarity.
GloVe's advantage over Word2Vec is that it uses global co-occurrence statistics rather than local context windows, potentially capturing relationships that span larger text regions. In practice, the resulting embeddings are very similar in quality to Word2Vec embeddings, and both methods are considered essentially interchangeable for most applications. GloVe pre-trained vectors trained on Common Crawl (840 billion tokens) and Wikipedia are freely available and widely used.
FastText: Embeddings for Word Pieces
FastText, published by Facebook's AI Research lab in 2016, extends the Word2Vec approach to subword units. Each word is represented not just by its own vector but by the sum of vectors for all its character n-grams. The word "where" with n-grams of size 3 would include the character sequences "whe," "her," "ere," plus the full word "where" and special boundary markers. The word's embedding is the average of all these n-gram vectors.
This design has two major advantages. First, it produces meaningful embeddings for words the model has never seen during training, because those words share n-grams with familiar words. The unseen word "wherefrom" shares n-grams with "where" and "from," so its embedding is a sensible combination of those words' meanings. Second, it handles morphologically rich languages better than word-level methods. In Turkish, a single verb root can generate hundreds of inflected forms. FastText represents all of them effectively because they share character n-grams corresponding to the root and common suffixes. FastText pre-trained vectors are available for 157 languages, making it the most widely deployed embedding method for non-English NLP.
The Limitation of Static Embeddings
Word2Vec, GloVe, and FastText all produce static embeddings: each word gets a single vector regardless of context. The word "bank" has the same embedding whether it appears in "river bank" or "bank account." The word "play" has the same vector in "play music," "play sports," and "a theatrical play." This is a fundamental limitation because word meaning is deeply context-dependent. Humans effortlessly distinguish these senses using surrounding words, but static embeddings collapse all senses into a single averaged representation.
In practice, this averaging works reasonably well because the dominant sense of most words dominates the training data and therefore dominates the embedding. The vector for "bank" is closer to financial terms because the financial sense appears more frequently in text than the river bank sense. But for genuinely ambiguous words and for tasks that require fine-grained semantic discrimination, static embeddings lose critical information.
Contextual Embeddings: BERT, GPT, and Beyond
Contextual embeddings, introduced by ELMo in 2018 and refined by BERT and GPT, generate different vectors for the same word in different contexts. In BERT, the word "bank" in "I deposited money at the bank" receives a different vector than "bank" in "We walked along the river bank." The model processes the entire sentence through multiple transformer layers, and each layer produces a new set of vectors that incorporate increasingly complex contextual information. By the final layer, each token's vector represents not the word in isolation but the word's specific meaning in this particular sentence.
BERT generates contextual embeddings by processing text through 12 transformer layers (base model) or 24 layers (large model). Each layer applies self-attention, allowing every token to attend to every other token, followed by a feed-forward network. The input embedding (a static vector similar to Word2Vec) is progressively transformed through these layers into a rich, context-sensitive representation. Research has shown that earlier layers capture syntactic information (part of speech, grammatical relations) while later layers capture semantic information (word sense, entity type, sentiment).
GPT models also produce contextual embeddings, but they differ from BERT in a crucial way: GPT uses causal (left-to-right) attention, meaning each token can only attend to tokens that precede it. BERT uses bidirectional attention, meaning each token attends to the full context in both directions. For understanding tasks like classification and question answering, BERT's bidirectional context is superior. For generation tasks where the model must predict the next word, GPT's causal architecture is necessary because the model cannot look at tokens it has not yet generated.
Modern embedding models like those used in semantic search and retrieval-augmented generation systems produce sentence-level or passage-level embeddings by pooling the token-level contextual representations into a single vector per text segment. These models are trained specifically so that semantically similar texts have similar embeddings, enabling efficient similarity search across millions of documents using approximate nearest neighbor algorithms.
Practical Applications of Embeddings
Search and information retrieval is one of the largest commercial applications. Traditional keyword search matches exact terms: a search for "automobile repair" would not find a page about "car fixing." Semantic search using embeddings matches by meaning: the embedding for "automobile repair" is close to the embedding for "car fixing," so both results appear. Every major search engine now incorporates embedding-based semantic matching alongside traditional keyword matching.
Recommendation systems use embeddings to suggest content similar to what a user has engaged with. If a user reads articles with embeddings clustered in the "machine learning" region of the embedding space, the system recommends other articles in that region. Collaborative filtering embeddings represent both users and items in the same space, so a user's position implicitly encodes their preferences. Product embeddings power "similar items" recommendations on e-commerce sites, where products with nearby vectors are shown as alternatives or complements.
Clustering and visualization use embeddings to organize large text collections. Projecting high-dimensional embeddings into 2D or 3D using techniques like t-SNE or UMAP produces maps where semantically related documents cluster together. Topic discovery algorithms like k-means clustering operate on document embeddings to automatically identify themes in large corpora without predefined categories. These techniques are used in academic literature review, market research, social media analysis, and intelligence analysis.
Word embeddings convert language into numerical vectors where semantic similarity maps to geometric proximity, with modern contextual models producing different vectors for the same word depending on its surrounding context.