Clustering Algorithms Explained

Updated May 2026

Clustering algorithms group data points into clusters based on similarity, without any labeled examples to guide them. Points within a cluster are more similar to each other than to points in other clusters. This is the most common form of unsupervised learning, used for customer segmentation, anomaly detection, image compression, document organization, and exploratory data analysis. The main algorithms, K-means, DBSCAN, and hierarchical clustering, each make different assumptions about what a "cluster" looks like.

K-Means Clustering

K-means is the most widely used clustering algorithm because of its simplicity and speed. The algorithm partitions data into exactly K clusters, where K is specified in advance by the user. Each cluster is defined by its centroid, the mean position of all points in the cluster.

The algorithm works iteratively. Step 1: Initialize K centroids, either randomly or using the smarter K-means++ method that spreads initial centroids apart. Step 2: Assign each data point to the nearest centroid based on Euclidean distance. Step 3: Recalculate each centroid as the mean of all points assigned to it. Step 4: Repeat steps 2 and 3 until assignments stop changing or a maximum iteration count is reached.

K-means converges quickly, typically in 10-50 iterations on most datasets. The time complexity is O(n * K * d * i) where n is the number of points, K is the number of clusters, d is the number of dimensions, and i is the number of iterations. For a dataset of 1 million points with 10 features and K=5, K-means finishes in seconds.

The main weakness is that K-means assumes clusters are spherical and roughly equal in size. It partitions the space into Voronoi cells around the centroids, which are always convex regions. If the true clusters are elongated, curved, or wildly different in size, K-means will produce misleading results. It also requires specifying K upfront, which is often unknown.

The elbow method helps choose K. Run K-means for K=1 through K=20 and plot the total within-cluster sum of squares (inertia) against K. The inertia always decreases as K increases, but the rate of decrease typically slows sharply at the "elbow," suggesting the optimal number of clusters. The silhouette score provides a more rigorous approach: it measures how similar each point is to its own cluster compared to the nearest neighboring cluster, with scores ranging from -1 (wrong cluster) to 1 (well-matched).

DBSCAN: Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters based on density rather than distance to centroids. A cluster is a dense region of points separated from other dense regions by sparse areas. This allows DBSCAN to discover clusters of arbitrary shape, including rings, curves, and irregular blobs that K-means cannot handle.

DBSCAN has two parameters. Epsilon (eps) is the maximum distance between two points for them to be considered neighbors. MinPts is the minimum number of neighbors a point must have within epsilon to be considered a core point. A core point and all points reachable from it through chains of core points form a cluster.

Points that are within epsilon of a core point but have too few neighbors themselves are border points, assigned to the cluster of the nearest core point. Points that are not within epsilon of any core point are classified as noise, outliers that do not belong to any cluster. This noise detection is a unique advantage: DBSCAN is the only common clustering algorithm that naturally identifies outliers.

DBSCAN does not require specifying the number of clusters. It discovers however many clusters exist in the data. This is valuable when you have no prior knowledge of how many groups to expect. The algorithm also handles non-spherical clusters naturally, because clusters are defined by density connectivity rather than distance to a center.

The weakness is sensitivity to the epsilon parameter. Too small and every point is noise. Too large and everything merges into one cluster. The k-distance plot helps: compute the distance to the kth nearest neighbor for every point (where k = MinPts), sort these distances, and look for an elbow. The epsilon value at the elbow separates dense clusters from sparse noise.

Hierarchical Clustering

Hierarchical clustering builds a tree of clusters (called a dendrogram) that shows how clusters merge or split at every level of granularity. You do not need to specify the number of clusters in advance; instead, you cut the dendrogram at the desired level to produce any number of clusters from 1 to n.

Agglomerative (bottom-up) is the most common approach. Start with each point as its own cluster. At each step, merge the two closest clusters. Repeat until all points are in one cluster. The "closest" clusters are determined by a linkage criterion: single linkage uses the minimum distance between any two points in the clusters, complete linkage uses the maximum, average linkage uses the mean, and Ward linkage minimizes the total within-cluster variance.

The dendrogram visualization is the primary advantage. You can see the entire clustering structure at once and choose how many clusters to use based on where natural gaps appear. If there is a large jump in merge distance at a certain level, that suggests a natural cluster boundary.

The main weakness is computational cost. Agglomerative clustering is O(n^3) in the naive implementation and O(n^2 log n) with optimized data structures. This makes it impractical for datasets with more than about 50,000 points. For larger datasets, use K-means or DBSCAN and reserve hierarchical clustering for smaller datasets where the dendrogram visualization provides value.

Other Clustering Approaches

Gaussian Mixture Models (GMM) assume the data is generated from a mixture of several Gaussian distributions. Each cluster is modeled as a Gaussian with its own mean and covariance matrix. GMMs are more flexible than K-means because they allow elliptical clusters rather than only spherical ones, and they provide soft assignments (probabilities of belonging to each cluster) rather than hard assignments.

Mean Shift is a non-parametric algorithm that finds cluster centers by iteratively shifting points toward regions of higher density. It automatically determines the number of clusters based on a bandwidth parameter. It works well for irregularly shaped clusters but is slow on large datasets.

Spectral Clustering uses the eigenvalues of a similarity matrix to reduce dimensionality before clustering in the lower-dimensional space. It can find clusters that are not convex and handles data where the clusters are defined by connectivity rather than distance. It is effective on graph-structured data but requires computing eigenvectors of an n-by-n matrix, limiting scalability.

Choosing the Right Algorithm

Start with K-means for its speed and simplicity. If the clusters are spherical and roughly equal-sized, K-means will work well. If you do not know K, use the elbow method or silhouette score to estimate it.

Switch to DBSCAN if you suspect non-spherical clusters, need outlier detection, or do not want to specify the number of clusters. Be prepared to tune epsilon carefully.

Use hierarchical clustering when you want to visualize the clustering structure at multiple levels, or when the dataset is small enough (under 50,000 points) to make computation tractable.

Use GMMs when you need soft cluster assignments or when clusters are elliptical rather than spherical.

Key Takeaway

K-means is fastest and best for spherical clusters. DBSCAN handles arbitrary shapes and detects outliers without requiring the number of clusters. Hierarchical clustering provides a full tree of cluster relationships. The right choice depends on cluster shape, dataset size, and whether you know how many clusters to expect.

K-Means Clustering

DBSCAN: Density-Based Clustering

Hierarchical Clustering

Other Clustering Approaches

Choosing the Right Algorithm

Related Articles

Types of Machine Learning

Dimensionality Reduction

K-Nearest Neighbors

What Is an Algorithm?

Scatter Plots