Silhouette Values for Segmentation Evaluation

In this paper, I describe the statistic known as the silhouette value which may be used to help identify the appropriate number of groups in a segmentation (or clustering) and assess the quality of a segmentation.

Three data sets were constructed using the genRandomClust function in the clusterGeneration package of R. Each data set consists of 200 cases and six variables and are comprised of “true” clusters with approximate sizes of 100, 60, and 40.

The three data sets are labeled as follows:
- D1: Very strong separation of segments
- D2: Strong separation of segments
- D3: Moderate separation of segments

The terms “very strong,” “strong,” and “moderate” are relative terms and don’t have any real absolute meaning. Such judgments of the strength of separation will depend on the particular context.

A fourth “unstructured” data set, D4, was constructed by randomly permuting the values for each of the six variables of D2. Thus, the univariate distributions of the six variables are identical for D2 and D4.

Segmentation was carried out using k-means clustering, but the silhouettes described below may be used with any segmentation derived by any means (e.g., hierarchical clustering, partitioning around medoids), as long as one can compute distances (or dissimilarities) between all pairs of cases.

... click here to read the full article

LMG Weights in Key Driver Analysis

When attempting to assess the relative importance of predictors (or drivers, in the context of key driver analysis) of an outcome variable, the use of regression coefficients as importance measures becomes problematic when predictors are (sometimes substantially) correlated with one another. Here we consider one alternative to (or really an extension of) standard regression analysis to assess the relative importance of predictors. Following Grömping (2007), we refer to the resulting coefficients/weights as LMG weights. These are also equivalently known as Shapley value coefficients (Lipovetsky & Conklin, 2001; Shapley, 1953) and general dominance weights (Budescu, 1993). To understand the LMG approach, let’s consider a situation where we want to predict a variable Y from three potential drivers, X1, X2, and X3. For example, Y might be a measure of overall customer satisfaction with a product and the drivers might be satisfaction with price, ease of use, and customer service.

The LMG weights approach considers every possible ordering of the three predictor variables as they are entered into a multiple regression model with Y as the dependent variable. With our three predictors – X1, X2, X3 – there are six possible orderings: 123, 132, 213, 231, 312, and 321.

For each ordering, we look at the change in the R2 of the regression model as each variable is entered into the model. For example, for the ordering 312, we first consider the R2 value when X3 is entered into the model. Since this is a one-predictor model, the change in R2 is simply the square of the simple bivariate correlation of X3 with Y. Next we add X1 to the model so that we are now regressing Y onto both X3 and X1. We associate with X1 the change in R2 moving from the one-predictor model (with X3) to the two-predictor model (with X3 and X1). Finally, we add X2 to the model and associate with X2 the change in R2 when moving from the two-predictor model to the full three-predictor model.