Dimensionality Reduction Explained
Why High Dimensions Are Problematic
The curse of dimensionality describes a set of phenomena that make high-dimensional data increasingly difficult to work with. As the number of features grows, the volume of the feature space increases exponentially. Data points that seemed close in low dimensions become sparse and equidistant in high dimensions. A dataset of 10,000 points that densely fills a 2D square barely scratches the surface of a 100-dimensional space.
This sparsity has practical consequences. Distance-based algorithms like KNN and K-means lose their effectiveness because "nearest" stops being meaningful. Classification boundaries become harder to learn because the model needs exponentially more data to cover the space. Noise features dilute the signal from informative features, degrading model performance. Training time increases with the number of features for most algorithms.
Dimensionality reduction addresses all of these problems. By projecting data onto a lower-dimensional subspace that captures the important structure, you get denser data, more meaningful distances, faster training, and often better model performance because the noise has been removed.
PCA: Principal Component Analysis
PCA is the most common dimensionality reduction technique. It finds the directions of maximum variance in the data and projects the data onto those directions. The first principal component is the direction along which the data varies most. The second is the direction of maximum variance perpendicular to the first. The third is perpendicular to both, and so on.
Mathematically, PCA computes the eigenvectors and eigenvalues of the data's covariance matrix. The eigenvectors define the directions (principal components), and the eigenvalues indicate how much variance each direction captures. Sorting by eigenvalue in descending order gives the components in order of importance.
The practical workflow is: standardize the features (PCA is sensitive to scale), compute principal components, examine the explained variance ratio (how much variance each component captures), and choose enough components to retain a target level of variance (commonly 95% or 99%). If the first 20 components explain 95% of the variance in a 500-feature dataset, you have compressed 25x with only 5% information loss.
Strengths: PCA is fast (O(n * d^2) for n samples and d features), deterministic (same data always produces the same result), well-understood mathematically, and the transformed features are uncorrelated, which benefits many downstream algorithms. It is the default preprocessing step before applying algorithms sensitive to dimensionality.
Limitations: PCA only captures linear relationships. If the important structure in the data lies along curves or manifolds, PCA will miss it. The principal components are linear combinations of all original features, making them hard to interpret. You know the first component explains 40% of variance, but it is a weighted combination of all 500 features, not a single meaningful quantity.
t-SNE: Visualization of High-Dimensional Data
t-SNE (t-distributed Stochastic Neighbor Embedding) is designed specifically for creating 2D or 3D visualizations of high-dimensional data. It models pairwise similarities in the original space as probability distributions and finds a low-dimensional arrangement of points that preserves those similarities as closely as possible.
The algorithm converts distances between points into probabilities: nearby points get high probabilities, distant points get low probabilities. It then arranges points in 2D and adjusts their positions to minimize the divergence between the high-dimensional and low-dimensional probability distributions. The result is a map where clusters of similar points in the original space appear as clusters in the 2D plot.
t-SNE excels at revealing cluster structure that is invisible in linear projections like PCA. It has produced iconic visualizations of MNIST digits forming distinct clusters, of single-cell RNA-seq data revealing cell types, and of word embeddings showing semantic neighborhoods.
Limitations are significant. t-SNE is non-deterministic (different runs produce different layouts). It does not preserve global structure well (distances between clusters are not meaningful, only relative positions within clusters are). It is slow, scaling as O(n^2) or O(n*log(n)) with the Barnes-Hut approximation. It cannot transform new data points without rerunning on the entire dataset. And the perplexity parameter strongly affects the result, requiring experimentation to find a good value (typically between 5 and 50).
t-SNE is strictly a visualization tool, not a preprocessing step for modeling. Using t-SNE features as input to a classifier is almost always a bad idea because the features are unstable and distort distances.
UMAP: The Modern Alternative
UMAP (Uniform Manifold Approximation and Projection) is a newer technique that has largely replaced t-SNE for high-dimensional visualization and has broader applicability. It assumes the data lies on a manifold (a smooth, possibly curved surface) embedded in the high-dimensional space and finds a low-dimensional representation that preserves the manifold structure.
UMAP is faster than t-SNE, especially on large datasets. It better preserves global structure (clusters that are far apart in the original space tend to be far apart in the UMAP projection). It can transform new data points using the learned mapping. And it is not restricted to 2-3 dimensions; reducing to 10 or 50 dimensions for use as preprocessing before modeling is practical and often effective.
The key parameters are n_neighbors (how many nearby points influence the embedding, controlling the balance between local and global structure) and min_dist (how tightly points are packed together, controlling visual density). Typical values are n_neighbors between 15 and 200, and min_dist between 0.0 and 0.5.
Other Techniques
Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction technique. Unlike PCA, which maximizes variance, LDA maximizes class separation. It finds directions that best discriminate between labeled classes, making it ideal as a preprocessing step before classification. LDA can reduce dimensions to at most (number of classes - 1), so binary classification reduces to 1 dimension.
Autoencoders are neural networks that learn compressed representations. The network has a bottleneck layer with fewer neurons than the input. Data passes through the bottleneck and is reconstructed at the output. The bottleneck representation is a nonlinear dimensionality reduction. Variational autoencoders (VAEs) add probabilistic structure, enabling generation of new data points from the compressed space.
Feature selection is a different approach: instead of transforming features, it selects a subset of original features to keep. Methods include correlation-based filtering (remove features correlated with others), mutual information (keep features most informative about the target), and recursive feature elimination (repeatedly train a model and remove the least important features).
Choosing the Right Method
For preprocessing before modeling: start with PCA. It is fast, deterministic, and works well as a general-purpose reducer. If the data has nonlinear structure, try UMAP with a moderate number of output dimensions.
For visualization: use UMAP as the default. It is faster than t-SNE, preserves global structure better, and produces cleaner visualizations. Use t-SNE when you specifically need to examine fine-grained local structure or when comparing results with published t-SNE visualizations.
For supervised problems: consider LDA if you want maximum class separation, or use feature importance from a random forest to select the most informative features directly.
Dimensionality reduction compresses high-dimensional data by removing noise and redundancy. PCA is the fast, linear default for preprocessing. UMAP is the modern standard for visualization and nonlinear reduction. The right choice depends on whether you need a preprocessing step (PCA), a visualization (UMAP or t-SNE), or supervised feature extraction (LDA).