Clustering is a way to automatically discover natural groupings in a collection of observations based on their similarities and differences. One of its common uses is to classify new observations into previously learned groups.
The basic unit of measure of 'similarity' and 'difference' between observations in clustering is the distance between observations in the coordinate system of their measurements. Observations that are very similar to each other have small distances between them, whereas observations that are very different from each other have large distances. The coordinate system that contains all of the observations has as many axes or dimensions as there are measured variables in each observation. As I've alluded to in previous posts, observations may have too many dimensions for a human to easily visualize and make sense of.
A 'cluster' is a set of observations that bunch up together in the coordinate system of measurements. To give a concrete example, let's say that we have a population of objects, and we measure their heights and weights (Figure A). This distribution of observations has clearly visible structure. What can we make of that structure? This data set has six natural clusters, which can also be reasonably grouped into two larger clusters, each containing three smaller clusters.
I've used the k-means clustering algorithm to analyze this data set. The way k-means works is to start with a defined number k of clusters, drop random cluster centers into the data space, find all the data points that are closest to each center, calculate an average center of mass for each set (k averages or means), relocate cluster centers to those means, and repeat until the cluster centers converge on final locations. It's rather crude. Figure B shows how we can rationally choose how many clusters to find using a goodness-of-fit measure that has inflection points (marked with asterisks) where 'good' numbers of clusters are discovered. Plots of two and six clusters discovered by k-means are shown in Figures C and D, respectively. Colors indicate cluster membership.
The author's affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE's concurrence with, or support for, the positions, opinions, or viewpoints expressed by the author.