k-means clustering

Andrew Irwin, a.irwin@dal.ca

2024-03-07

Plan

  • What is k-means clustering and why do we use it?

  • Demonstration

What is k-means?

  • A tool to place observations in discrete groups

  • Based on a distance calculated on several variables

  • You decide how many groups to use (but there is guidance)

  • There are variations on the algorithm and other methods (not discussed in this course)

Sample data

gapminder_ss <- gapminder |>
  filter(year == 2007) |> select(-year)
gapminder_ss |> select(-country, -continent) |> arrange(-lifeExp) |>
  kable() |> kable_styling(full_width = FALSE)
lifeExp pop gdpPercap
82.603 127467972 31656.0681
82.208 6980412 39724.9787
81.757 301931 36180.7892
81.701 7554661 37506.4191
81.235 20434176 34435.3674
80.941 40448191 28821.0637
80.884 9031088 33859.7484
80.745 6426679 25523.2771
80.657 61083916 30470.0167
80.653 33390141 36319.2350
80.546 58147733 28569.7197
80.204 4115771 25185.0091
80.196 4627926 49357.1902
79.972 4553009 47143.1796
79.829 8199783 36126.4927
79.762 16570613 36797.9333
79.483 10706290 27538.4119
79.441 10392226 33692.6051
79.425 60776238 33203.2613
79.406 82400996 32170.3744
79.313 5238460 33207.0844
78.885 4109086 40675.9964
78.782 4133884 9645.0614
78.746 3942491 19328.7090
78.623 49044790 23348.1397
78.553 16284741 13171.6388
78.400 23174294 28718.2768
78.332 5468120 35278.4187
78.273 11416987 8948.1029
78.242 301139947 42951.6531
78.098 10642836 20509.6478
77.926 2009245 25768.2576
77.588 2505559 47306.9898
76.486 10228744 22833.3085
76.442 798094 7670.1226
76.423 3600523 5937.0295
76.384 3447496 10611.4630
76.195 108700891 11977.5750
75.748 4493312 14619.2227
75.640 3204897 22316.1929
75.635 708573 29796.0483
75.563 38518241 15389.9247
75.537 3242173 9809.1856
75.320 40301927 12779.3796
74.994 13755680 6873.2623
74.852 4552198 7446.2988
74.663 5447502 18678.3144
74.543 684736 9253.8961
74.249 85262356 2441.5764
74.241 24821286 12451.6558
74.143 19314747 4184.5481
74.002 10150265 9786.5347
73.952 6036914 12057.4993
73.923 10276158 7092.9230
73.747 26084662 11415.8057
73.422 4018332 3025.3498
73.338 9956108 18008.9444
73.005 7322858 10680.7928
72.961 1318683096 4959.1149
72.899 5675356 2749.3210
72.889 44227550 7006.5804
72.801 1250882 10956.9911
72.777 27601038 21654.8319
72.567 2780132 7320.8803
72.535 6053193 4519.4612
72.476 22276056 10808.4756
72.396 20378239 3970.0954
72.390 190010647 9065.8008
72.301 33333216 6223.3675
72.235 9319622 6025.3748
71.993 3921278 10461.0587
71.878 6939688 5728.3535
71.777 71158647 8458.2764
71.752 6667147 4172.8385
71.688 91077287 3190.4810
71.421 28674757 7408.9056
71.338 80264543 5581.1810
71.164 33757175 3820.1752
70.964 69453570 11605.7145
70.650 223547000 3540.6516
70.616 65068149 7458.3963
70.259 12572928 5186.0500
70.198 7483763 3548.3308
69.819 1056608 18008.5092
67.297 23301725 1593.0655
66.803 2874127 3095.7723
65.554 9119152 3822.1371
65.528 199579 1598.4351
65.483 169270617 2605.9476
65.152 710960 986.1479
64.698 1110396331 2452.2104
64.164 3270065 1803.1515
64.062 150448339 1391.2538
63.785 28901790 1091.3598
63.062 12267493 1712.4721
62.698 22211743 2280.7699
62.069 47761980 944.0000
60.916 8502814 1201.6372
60.022 22873338 1327.6089
59.723 14131858 1713.7787
59.545 27499638 4471.0619
59.448 1688359 752.7497
59.443 19167654 1044.7701
58.556 42292929 2602.3950
58.420 5701579 882.9699
58.040 4906585 641.3695
56.867 12894865 619.6769
56.735 1454867 13206.4845
56.728 8078314 1441.2849
56.007 9947814 942.6542
55.322 3800610 3632.5578
54.791 496374 2082.4816
54.467 12031795 1042.5816
54.110 35610177 1463.2493
52.947 76511887 690.8056
52.906 2055080 4811.0604
52.517 38139640 1107.4822
52.295 14326203 1217.0330
51.579 551201 12154.0897
51.542 29170398 1056.3801
50.728 1639131 12569.8518
50.651 10238807 1704.0637
50.430 17696293 2042.0952
49.580 8390505 430.0707
49.339 43997828 9269.6578
48.328 18013409 1544.7501
48.303 13327079 759.3499
48.159 9118773 926.1411
46.859 135031164 2013.9773
46.462 64606759 277.5519
46.388 1472041 579.2317
46.242 8860588 863.0885
45.678 3193942 414.5073
44.741 4369038 706.0165
43.828 31889923 974.5803
43.487 12311143 469.7093
42.731 12420476 4797.2313
42.592 2012649 1569.3314
42.568 6144562 862.5408
42.384 11746035 1271.2116
42.082 19951656 823.6856
39.613 1133066 4513.4806

Distribution of data

Distribution of data

Data preprocessing

  • Approximately normally distributed, same variance
gapminder_scaled <- gapminder_ss |> 
  mutate(s_logPop = scale(log10(pop)), 
         s_logGDP = scale(log10(gdpPercap)),
         s_lifeExp = scale(lifeExp))

Perform k-means

kclust1 <- kmeans(gapminder_scaled |>
                    select(s_lifeExp, s_logGDP, s_logPop),
                  centers = 5)
tidy(kclust1) |> kable()
s_lifeExp s_logGDP s_logPop size withinss cluster
0.5116874 0.1286984 -0.5512334 29 14.49961 1
0.5939644 0.4625594 1.2265903 30 28.33482 2
0.9774023 1.2253082 -0.4343692 29 14.26527 3
-1.1522049 -1.1373776 0.2695103 41 36.88434 4
-1.0586255 -0.5008071 -1.4819352 13 16.45012 5

Display clustering

kclust1 |> augment(gapminder_scaled) |>
  ggplot(aes(y = lifeExp, x = log10(pop), color = .cluster, shape = continent)) + 
  geom_point(size=2) + theme(aspect.ratio = .8)

How good is the clustering?

Second example: penguins

Second example: penguins

penguins_scaled <- penguins |> na.omit() |>
  mutate(s_flipper_length = scale(flipper_length_mm), 
         s_bill_length = scale(bill_length_mm),
         s_bill_depth = scale(bill_depth_mm),
         s_body_mass = scale(body_mass_g))

Perform k-means

set.seed(1)
kclust1 <- kmeans(penguins_scaled |>
                    select(s_flipper_length, s_body_mass, s_bill_length, s_bill_depth),
                  centers = 3)
tidy(kclust1) |> kable(digits = 3)
s_flipper_length s_body_mass s_bill_length s_bill_depth size withinss cluster
-0.289 -0.384 0.671 0.804 85 109.481 1
1.161 1.100 0.654 -1.101 119 139.468 2
-0.880 -0.762 -1.045 0.486 129 120.703 3

Display clustering

kclust1 |> augment(penguins_scaled) |>
  ggplot(aes(y = s_bill_depth, x = s_bill_length, color = .cluster, shape = species)) + 
  geom_point() + theme(aspect.ratio = .8)

Alternate display

kclust1 |> augment(penguins_scaled) |>
  unite("spp_clust", species, .cluster) |>
  ggplot(aes(y = s_bill_depth, x = s_bill_length, color = spp_clust)) + 
  geom_point() + theme(aspect.ratio = .8)

How good is the clustering?

Four species of penguins?

When to use k-means?

  • When the goal is to make categories from continuous variables

  • Each observation is placed in a cluster

  • Based on distance between observations, calculated across all variables

    • Usually best to scale (and maybe transform in other ways) each variable
  • Select the number of clusters based on partitioning of “sums of squares”

Further reading

Task

  • Practice k-means