23 K-means clustering
PCA and MDS are both ways of exploring “structure” in data with many variables. These methods both arrange observations across a plane as an approximation of the underlying structure in the data. K-means is another method for illustrating structure, but the goal is quite different: each point is assigned to one of \(k\) different clusters, according to their proximity to each other. You, as the analyst, must specify the number of clusters in advance. Sometimes there is an obviously correct number of clusters, other times you construct many different clusterings and select the one which best represents the data. Clustering can be combined with PCA and MDS and used as labels on those ordination visualizations. There are many methods of clustering; we will only look at one in this course.
23.1 Make clusters from points
Let’s start by making 3 clusters from the quantitative variables in the Palmer penguins data. We omit missing data and scale each variable to have mean 0 and standard deviation 1. We don’t want the different scales each variable is measured on to give more weight to one variable compared to the other. (We might choose to give one variable more weight, but we don’t want that to be simply a consequence of the units used to report the variables.)
penguin_q <- penguins %>% select(species,
flipper_length_mm, bill_length_mm, bill_depth_mm, body_mass_g) %>%
na.omit()
kclust1 <- kmeans(penguin_q %>% select(-species) %>% scale(),
centers = 3)
The kclust object reports the means (centroid) of each cluster and the membership of each observation (1-3). We also get sum of squared deviations between the cluster mean and each observation. This is then compared to the total sums of squares, which is the sum of squared deviations from the mean if there was only one cluster. In our example, 72% of the total sums of squares is within clusters and the remainder (28%) is between clusters. This tells us that the three main clusters account for almost 75% of the squared deviations. This measure is the same one used in linear regression, which minimizes the total sum of squared deviations between the regression line and each observation.
kclust1
## K-means clustering with 3 clusters of sizes 123, 87, 132
##
## Cluster means:
## flipper_length_mm bill_length_mm bill_depth_mm body_mass_g
## 1 1.1571696 0.6562677 -1.0983711 1.0901639
## 2 -0.2857869 0.6600059 0.8157307 -0.3737654
## 3 -0.8899121 -1.0465260 0.4858415 -0.7694891
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [38] 3 3 3 3 3 2 3 3 3 3 3 2 3 3 3 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 2 3 3 3 2 3
## [75] 2 3 3 3 2 3 2 3 3 3 3 3 3 3 3 3 2 3 3 3 2 3 3 3 2 3 2 3 3 3 3 3 3 3 2 3 2
## [112] 3 2 3 2 3 3 3 3 3 3 3 2 3 3 3 3 3 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2
## [297] 3 2 2 2 2 2 2 2 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2
## [334] 2 2 2 2 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 143.1502 112.9852 122.1477
## (between_SS / total_SS = 72.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
These data can be presented in an easy to use table with tidy
:
tidy(kclust1) %>% kable(digits = 2)
flipper_length_mm | bill_length_mm | bill_depth_mm | body_mass_g | size | withinss | cluster |
---|---|---|---|---|---|---|
1.16 | 0.66 | -1.10 | 1.09 | 123 | 143.15 | 1 |
-0.29 | 0.66 | 0.82 | -0.37 | 87 | 112.99 | 2 |
-0.89 | -1.05 | 0.49 | -0.77 | 132 | 122.15 | 3 |
Categorical variables (species) can be added to the resuls using augment
as we have done with other models. We can then pick two variables and plot them along with the cluster identity and species. Here the colours show the cluster number and the shapes show the species. We can see that the clusters do not perfectly separate out observations by species, but there is a strong similarity. This means that the variables we selected are mostly, but not completely, sufficient to distingush the three species. Remember that we used 4 variables for the classification, not just the two shown here.
kc1 <- augment(kclust1, penguin_q)
ggplot(kc1, aes(x=bill_length_mm, y = bill_depth_mm, color=.cluster, shape=species)) +
geom_point() + theme_bw()
You can add the center of each cluster to the plot, but since the clustering was done on data scaled to have mean 0 and standard deviation 1, you must be careful to scale the data here as well.
kc1 <- augment(kclust1, penguin_q)
ggplot(kc1, aes(x = scale(bill_length_mm),
y = scale(bill_depth_mm))) +
geom_point(aes(color = .cluster, shape = species)) +
geom_point(data = tidy(kclust1), aes(color = cluster),
size = 10, shape = "o", show.legend = FALSE) + theme_bw()
We can make a table of the number of each species in each cluster using summarize
. Adélie penguins make up most of cluster 1, with 8 observations appearing in cluster 3. Chinstrap penguins make up most of cluster 3, with 5 observations in cluster 1. All the Gentoo penguins are in cluster 2 and there are no other species of penguin in this cluster. This result should seem reasonable given all the previous exploration we have done on this data.
augment(kclust1, data = penguin_q) %>%
count(.cluster, species)
## # A tibble: 5 x 3
## .cluster species n
## <fct> <fct> <int>
## 1 1 Gentoo 123
## 2 2 Adelie 24
## 3 2 Chinstrap 63
## 4 3 Adelie 127
## 5 3 Chinstrap 5
23.2 Example 2
Let’s take a look at a dataset with a lot more groups. The mpg
dataset has several quantitative variables that might be useful for classifying cars (displ, cyl, cty, hwy). We can make a series of clusterings of these data and see how well they correspond to the class of the car (compact, subcompact, etc).
## # A tibble: 4 x 7
## displ cyl cty hwy size withinss cluster
## <dbl> <dbl> <dbl> <dbl> <int> <dbl> <fct>
## 1 1.29 1.31 -1.01 -0.976 70 47.5 1
## 2 -0.974 -1.14 0.756 0.736 67 21.5 2
## 3 -1.32 -1.17 2.37 2.11 14 14.9 3
## 4 -0.0752 0.00914 -0.159 -0.127 83 61.8 4
Compare these clusters to the class of the cars. There are many classes of each car in some clusters.
augment(kclust2, mpg) %>% count(.cluster, class)
## # A tibble: 19 x 3
## .cluster class n
## <fct> <chr> <int>
## 1 1 2seater 5
## 2 1 midsize 2
## 3 1 pickup 20
## 4 1 subcompact 5
## 5 1 suv 38
## 6 2 compact 29
## 7 2 midsize 16
## 8 2 minivan 1
## 9 2 pickup 1
## 10 2 subcompact 14
## 11 2 suv 6
## 12 3 compact 5
## 13 3 subcompact 9
## 14 4 compact 13
## 15 4 midsize 23
## 16 4 minivan 10
## 17 4 pickup 12
## 18 4 subcompact 7
## 19 4 suv 18
If we use glance
we get to see how the sums of squares are partitioned. I’ll add a fifth variable that measures the proportion of sums of squares within compared to the total.
glance(kclust2) %>% mutate(proportion_within = tot.withinss / totss)
## # A tibble: 1 x 5
## totss tot.withinss betweenss iter proportion_within
## <dbl> <dbl> <dbl> <int> <dbl>
## 1 932 146. 786. 2 0.156
How can we decide how many clusters to make? Let’s compute the proportion of sums of squares accounted for by the clustering for 2, 3, 4, etc., clusters. We will use functions in broom
to accomplish this including nest
and map
.
kclusts2 <-
tibble(k = 1:9) %>%
mutate(
kclust = map(k, ~kmeans(mpg %>% select(displ, cyl, cty, hwy) %>% scale,
centers = .x)),
tidied = map(kclust, tidy),
glanced = map(kclust, glance),
augmented = map(kclust, augment, mpg)
)
That one object has a lot of information – the tidy and glance results on 9 clusterings plus the augmented data for each. Let’s extract them into 3 different tables to make the results easier to work with.
clusters2 <- kclusts2 %>% unnest(cols = c(tidied))
assignments2 <- kclusts2 %>% unnest(cols = c(augmented))
clusterings2 <- kclusts2 %>% unnest(cols = c(glanced))
We can score each cluster according to the sums of squares as follows:
clusterings2 %>% mutate(proportion_within = tot.withinss / totss) %>%
ggplot(aes(x = k, y = proportion_within)) +
geom_line() + geom_point() + theme_bw()
Seven clusters is slightly worse than 6, so that’s probably too many. Most of the decrease happens with 4 or 5 clusters, so that may be enough.
Let’s plot the clusters. I’ll show the number of cylinders using a symbol; there are only 4 to show, so that won’t be too many to look at.
p1 <- ggplot(assignments2, aes(x = cty, y = displ)) +
geom_point(aes(color = .cluster, shape=factor(cyl)), alpha = 0.8) +
facet_wrap(~ k) + theme_bw()
p1
We can try to make a facetted plot, showing car classification and clusters. There are 7 classes and 9 clusters, so that would make 63 facets, which is a lot to look at. I’ll select just the classifications with 4, 5, and 6 clusters.
p2 <- assignments2 %>% filter(k >= 4, k <= 6) %>%
ggplot(aes(x = cty, y = displ)) +
geom_point(aes(color = .cluster, shape = factor(cyl)), alpha = 0.8) +
facet_grid(k ~ class, scales = "free") + theme_bw()
p2
We can see that most classes of cars (except 2 seaters) get divided up into multiple clusters. So a clustering of these 4 quantiative variables does not separate out cars according to their class.