K-Nearest Neighbors (KNN) Algorithm Explained

Updated May 2026
K-nearest neighbors is a machine learning algorithm that predicts by finding the K training examples most similar to a new input and using their labels to make the prediction. For classification, it takes a majority vote among the K neighbors. For regression, it averages their values. KNN has no training phase at all; it stores the entire dataset and computes all predictions at query time, making it conceptually the simplest ML algorithm while being surprisingly effective on many problems.

How KNN Works

The algorithm is intuitive enough to explain in one paragraph. When a new data point arrives, KNN computes the distance between that point and every point in the training set. It sorts these distances, picks the K closest neighbors, and uses their labels. For classification with K=5, if three neighbors are "yes" and two are "no," the prediction is "yes." For regression with K=5, if the five nearest houses sold for $300K, $320K, $310K, $340K, and $290K, the prediction is their average: $312K.

The algorithm's name contains its only parameter. K determines how many neighbors influence the prediction. A small K (like 1 or 3) makes the model sensitive to local patterns, including noise. A large K (like 50 or 100) smooths the predictions by averaging over more neighbors, which can wash out local structure. The optimal K depends on the dataset and is found through cross-validation.

K=1 is an interesting special case. The prediction is simply the label of the single closest training point. This creates a Voronoi tessellation of the feature space, where each training point "owns" the region of space closest to it. KNN with K=1 has zero training error (every training point's nearest neighbor is itself) but high variance, making it prone to overfitting.

Distance Metrics

The choice of distance metric defines what "nearest" means, and it matters more than most beginners expect.

Euclidean distance is the straight-line distance between two points, calculated as the square root of the sum of squared differences across all features. It is the default in most implementations and works well when features are on similar scales and all equally important. In two dimensions, Euclidean distance is the familiar distance formula from geometry.

Manhattan distance sums the absolute differences across features. It measures distance as if you could only move along axes (like navigating city blocks). Manhattan distance is less sensitive to outlier features because it does not square the differences, making large differences in one feature less dominant.

Minkowski distance generalizes both. With parameter p=2 it is Euclidean, with p=1 it is Manhattan. Other values of p create different distance geometries that may suit specific data shapes.

Cosine similarity measures the angle between two vectors rather than the distance. It is standard for text data, where documents are represented as high-dimensional word frequency vectors. Two documents about the same topic will point in similar directions regardless of their length, so cosine similarity captures topical similarity better than Euclidean distance.

Feature scaling is critical for distance-based algorithms. If one feature is measured in thousands (income in dollars) and another in single digits (number of children), the high-magnitude feature will dominate the distance calculation. Standardizing features to zero mean and unit variance, or normalizing them to the 0-1 range, ensures all features contribute proportionally.

Choosing K

K is the primary hyperparameter, and its optimal value varies by dataset. Small K values (1-5) capture fine-grained local patterns but are noisy. Large K values (20-100) produce smoother predictions but may blur boundaries between classes.

A useful rule of thumb is to start with K equal to the square root of the number of training samples. For 10,000 samples, try K=100. Then use cross-validation to test values around that starting point, plotting accuracy against K to find the elbow where performance stabilizes.

For binary classification, odd values of K prevent ties. With K=4, a 2-2 split provides no clear prediction. With K=5, there is always a majority. Some implementations break ties by distance-weighting, giving closer neighbors more influence, but using odd K is simpler.

Weighted KNN assigns each neighbor a weight inversely proportional to its distance. A neighbor that is 0.1 units away contributes more to the prediction than one that is 2.0 units away. This reduces the sensitivity to K because distant neighbors have minimal influence regardless of whether they are included.

Strengths

Zero training time. KNN stores the data and does nothing else during the "training" phase. All computation happens at prediction time. This is advantageous when you need to update the model frequently with new data, because adding new points requires no retraining.

No assumptions about data distribution. Unlike linear regression (assumes linearity) or naive Bayes (assumes feature independence), KNN makes no assumptions about the underlying data distribution. It can model arbitrarily complex decision boundaries simply by having enough neighbors.

Naturally handles multi-class problems. Unlike SVMs, which require one-vs-one or one-vs-rest workarounds, KNN handles any number of classes natively. The majority vote works identically with 2 or 200 classes.

Easy to understand and implement. The algorithm can be explained to anyone who understands the concept of similarity. It requires no matrix algebra, no gradient descent, no backpropagation. This makes it an excellent starting point for learning ML concepts and a useful baseline for benchmarking.

Weaknesses

Slow prediction time. Every prediction requires computing the distance to every training point. With 1 million training samples and 100 features, that is 100 million multiplications per prediction. KD-trees and ball trees can reduce this to O(log n) for low-dimensional data, but in high dimensions (more than about 20 features), these spatial data structures degrade to near-linear performance.

Curse of dimensionality. In high-dimensional spaces, the concept of "nearest" breaks down. As dimensions increase, all points become roughly equidistant from each other, and the notion of a meaningful neighborhood disappears. With 1,000 features, the 5 "nearest" neighbors might not be meaningfully closer than the 500th nearest. This limits KNN's effectiveness on high-dimensional data without dimensionality reduction.

Sensitive to irrelevant features. Every feature contributes equally to the distance calculation (unless you engineer custom weights). If half your features are irrelevant noise, they add random distance that obscures the signal from the useful features. Feature selection is important preprocessing for KNN.

Memory-intensive. The entire training dataset must be stored and available at prediction time. For very large datasets, this can be impractical. Techniques like instance selection (removing redundant training points) can help but add complexity.

Key Takeaway

KNN is the simplest ML algorithm conceptually: find the most similar training examples and copy their labels. It requires no training, makes no distributional assumptions, and handles complex boundaries naturally. Its main limitations are slow prediction on large datasets, poor performance in high dimensions, and sensitivity to irrelevant features and feature scaling.