In this paper, I describe the statistic known as the silhouette value which may be used to help identify the appropriate number of groups in a segmentation (or clustering) and assess the quality of a segmentation.
Three data sets were constructed using the genRandomClust function in the clusterGeneration package of R. Each data set consists of 200 cases and six variables and are comprised of “true” clusters with approximate sizes of 100, 60, and 40.
The three data sets are labeled as follows:
- D1: Very strong separation of segments
- D2: Strong separation of segments
- D3: Moderate separation of segments
The terms “very strong,” “strong,” and “moderate” are relative terms and don’t have any real absolute meaning. Such judgments of the strength of separation will depend on the particular context.
A fourth “unstructured” data set, D4, was constructed by randomly permuting the values for each of the six variables of D2. Thus, the univariate distributions of the six variables are identical for D2 and D4.
Segmentation was carried out using k-means clustering, but the silhouettes described below may be used with any segmentation derived by any means (e.g., hierarchical clustering, partitioning around medoids), as long as one can compute distances (or dissimilarities) between all pairs of cases.