class: center, middle, inverse, title-slide # Data Visualization ## k-means clustering ### Andrew Irwin,
a.irwin@dal.ca
### Math & Stats, Dalhousie University ### 2021-03-05 (updated: 2021-02-19) --- class: middle # Plan * What is k-means clustering and why do we use it? * Demonstration --- class: middle ### What is k-means? * A tool to place observations in discrete groups * Based on a distance calculated on several variables * You decide how many groups to use (but there is guidance) * There are variations on the algorithm and other methods (not discussed in this course) --- ## Sample data ```r gapminder_ss <- gapminder %>% filter(year == 2007) %>% select(-year) gapminder_ss %>% select(-country, -continent) %>% arrange(-lifeExp) %>% kable() %>% kable_styling(full_width = FALSE) ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> lifeExp </th> <th style="text-align:right;"> pop </th> <th style="text-align:right;"> gdpPercap </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 82.603 </td> <td style="text-align:right;"> 127467972 </td> <td style="text-align:right;"> 31656.0681 </td> </tr> <tr> <td style="text-align:right;"> 82.208 </td> <td style="text-align:right;"> 6980412 </td> <td style="text-align:right;"> 39724.9787 </td> </tr> <tr> <td style="text-align:right;"> 81.757 </td> <td style="text-align:right;"> 301931 </td> <td style="text-align:right;"> 36180.7892 </td> </tr> <tr> <td style="text-align:right;"> 81.701 </td> <td style="text-align:right;"> 7554661 </td> <td style="text-align:right;"> 37506.4191 </td> </tr> <tr> <td style="text-align:right;"> 81.235 </td> <td style="text-align:right;"> 20434176 </td> <td style="text-align:right;"> 34435.3674 </td> </tr> <tr> <td style="text-align:right;"> 80.941 </td> <td style="text-align:right;"> 40448191 </td> <td style="text-align:right;"> 28821.0637 </td> </tr> <tr> <td style="text-align:right;"> 80.884 </td> <td style="text-align:right;"> 9031088 </td> <td style="text-align:right;"> 33859.7484 </td> </tr> <tr> <td style="text-align:right;"> 80.745 </td> <td style="text-align:right;"> 6426679 </td> <td style="text-align:right;"> 25523.2771 </td> </tr> <tr> <td style="text-align:right;"> 80.657 </td> <td style="text-align:right;"> 61083916 </td> <td style="text-align:right;"> 30470.0167 </td> </tr> <tr> <td style="text-align:right;"> 80.653 </td> <td style="text-align:right;"> 33390141 </td> <td style="text-align:right;"> 36319.2350 </td> </tr> <tr> <td style="text-align:right;"> 80.546 </td> <td style="text-align:right;"> 58147733 </td> <td style="text-align:right;"> 28569.7197 </td> </tr> <tr> <td style="text-align:right;"> 80.204 </td> <td style="text-align:right;"> 4115771 </td> <td style="text-align:right;"> 25185.0091 </td> </tr> <tr> <td style="text-align:right;"> 80.196 </td> <td style="text-align:right;"> 4627926 </td> <td style="text-align:right;"> 49357.1902 </td> </tr> <tr> <td style="text-align:right;"> 79.972 </td> <td style="text-align:right;"> 4553009 </td> <td style="text-align:right;"> 47143.1796 </td> </tr> <tr> <td style="text-align:right;"> 79.829 </td> <td style="text-align:right;"> 8199783 </td> <td style="text-align:right;"> 36126.4927 </td> </tr> <tr> <td style="text-align:right;"> 79.762 </td> <td style="text-align:right;"> 16570613 </td> <td style="text-align:right;"> 36797.9333 </td> </tr> <tr> <td style="text-align:right;"> 79.483 </td> <td style="text-align:right;"> 10706290 </td> <td style="text-align:right;"> 27538.4119 </td> </tr> <tr> <td style="text-align:right;"> 79.441 </td> <td style="text-align:right;"> 10392226 </td> <td style="text-align:right;"> 33692.6051 </td> </tr> <tr> <td style="text-align:right;"> 79.425 </td> <td style="text-align:right;"> 60776238 </td> <td style="text-align:right;"> 33203.2613 </td> </tr> <tr> <td style="text-align:right;"> 79.406 </td> <td style="text-align:right;"> 82400996 </td> <td style="text-align:right;"> 32170.3744 </td> </tr> <tr> <td style="text-align:right;"> 79.313 </td> <td style="text-align:right;"> 5238460 </td> <td style="text-align:right;"> 33207.0844 </td> </tr> <tr> <td style="text-align:right;"> 78.885 </td> <td style="text-align:right;"> 4109086 </td> <td style="text-align:right;"> 40675.9964 </td> </tr> <tr> <td style="text-align:right;"> 78.782 </td> <td style="text-align:right;"> 4133884 </td> <td style="text-align:right;"> 9645.0614 </td> </tr> <tr> <td style="text-align:right;"> 78.746 </td> <td style="text-align:right;"> 3942491 </td> <td style="text-align:right;"> 19328.7090 </td> </tr> <tr> <td style="text-align:right;"> 78.623 </td> <td style="text-align:right;"> 49044790 </td> <td style="text-align:right;"> 23348.1397 </td> </tr> <tr> <td style="text-align:right;"> 78.553 </td> <td style="text-align:right;"> 16284741 </td> <td style="text-align:right;"> 13171.6388 </td> </tr> <tr> <td style="text-align:right;"> 78.400 </td> <td style="text-align:right;"> 23174294 </td> <td style="text-align:right;"> 28718.2768 </td> </tr> <tr> <td style="text-align:right;"> 78.332 </td> <td style="text-align:right;"> 5468120 </td> <td style="text-align:right;"> 35278.4187 </td> </tr> <tr> <td style="text-align:right;"> 78.273 </td> <td style="text-align:right;"> 11416987 </td> <td style="text-align:right;"> 8948.1029 </td> </tr> <tr> <td style="text-align:right;"> 78.242 </td> <td style="text-align:right;"> 301139947 </td> <td style="text-align:right;"> 42951.6531 </td> </tr> <tr> <td style="text-align:right;"> 78.098 </td> <td style="text-align:right;"> 10642836 </td> <td style="text-align:right;"> 20509.6478 </td> </tr> <tr> <td style="text-align:right;"> 77.926 </td> <td style="text-align:right;"> 2009245 </td> <td style="text-align:right;"> 25768.2576 </td> </tr> <tr> <td style="text-align:right;"> 77.588 </td> <td style="text-align:right;"> 2505559 </td> <td style="text-align:right;"> 47306.9898 </td> </tr> <tr> <td style="text-align:right;"> 76.486 </td> <td style="text-align:right;"> 10228744 </td> <td style="text-align:right;"> 22833.3085 </td> </tr> <tr> <td style="text-align:right;"> 76.442 </td> <td style="text-align:right;"> 798094 </td> <td style="text-align:right;"> 7670.1226 </td> </tr> <tr> <td style="text-align:right;"> 76.423 </td> <td style="text-align:right;"> 3600523 </td> <td style="text-align:right;"> 5937.0295 </td> </tr> <tr> <td style="text-align:right;"> 76.384 </td> <td style="text-align:right;"> 3447496 </td> <td style="text-align:right;"> 10611.4630 </td> </tr> <tr> <td style="text-align:right;"> 76.195 </td> <td style="text-align:right;"> 108700891 </td> <td style="text-align:right;"> 11977.5750 </td> </tr> <tr> <td style="text-align:right;"> 75.748 </td> <td style="text-align:right;"> 4493312 </td> <td style="text-align:right;"> 14619.2227 </td> </tr> <tr> <td style="text-align:right;"> 75.640 </td> <td style="text-align:right;"> 3204897 </td> <td style="text-align:right;"> 22316.1929 </td> </tr> <tr> <td style="text-align:right;"> 75.635 </td> <td style="text-align:right;"> 708573 </td> <td style="text-align:right;"> 29796.0483 </td> </tr> <tr> <td style="text-align:right;"> 75.563 </td> <td style="text-align:right;"> 38518241 </td> <td style="text-align:right;"> 15389.9247 </td> </tr> <tr> <td style="text-align:right;"> 75.537 </td> <td style="text-align:right;"> 3242173 </td> <td style="text-align:right;"> 9809.1856 </td> </tr> <tr> <td style="text-align:right;"> 75.320 </td> <td style="text-align:right;"> 40301927 </td> <td style="text-align:right;"> 12779.3796 </td> </tr> <tr> <td style="text-align:right;"> 74.994 </td> <td style="text-align:right;"> 13755680 </td> <td style="text-align:right;"> 6873.2623 </td> </tr> <tr> <td style="text-align:right;"> 74.852 </td> <td style="text-align:right;"> 4552198 </td> <td style="text-align:right;"> 7446.2988 </td> </tr> <tr> <td style="text-align:right;"> 74.663 </td> <td style="text-align:right;"> 5447502 </td> <td style="text-align:right;"> 18678.3144 </td> </tr> <tr> <td style="text-align:right;"> 74.543 </td> <td style="text-align:right;"> 684736 </td> <td style="text-align:right;"> 9253.8961 </td> </tr> <tr> <td style="text-align:right;"> 74.249 </td> <td style="text-align:right;"> 85262356 </td> <td style="text-align:right;"> 2441.5764 </td> </tr> <tr> <td style="text-align:right;"> 74.241 </td> <td style="text-align:right;"> 24821286 </td> <td style="text-align:right;"> 12451.6558 </td> </tr> <tr> <td style="text-align:right;"> 74.143 </td> <td style="text-align:right;"> 19314747 </td> <td style="text-align:right;"> 4184.5481 </td> </tr> <tr> <td style="text-align:right;"> 74.002 </td> <td style="text-align:right;"> 10150265 </td> <td style="text-align:right;"> 9786.5347 </td> </tr> <tr> <td style="text-align:right;"> 73.952 </td> <td style="text-align:right;"> 6036914 </td> <td style="text-align:right;"> 12057.4993 </td> </tr> <tr> <td style="text-align:right;"> 73.923 </td> <td style="text-align:right;"> 10276158 </td> <td style="text-align:right;"> 7092.9230 </td> </tr> <tr> <td style="text-align:right;"> 73.747 </td> <td style="text-align:right;"> 26084662 </td> <td style="text-align:right;"> 11415.8057 </td> </tr> <tr> <td style="text-align:right;"> 73.422 </td> <td style="text-align:right;"> 4018332 </td> <td style="text-align:right;"> 3025.3498 </td> </tr> <tr> <td style="text-align:right;"> 73.338 </td> <td style="text-align:right;"> 9956108 </td> <td style="text-align:right;"> 18008.9444 </td> </tr> <tr> <td style="text-align:right;"> 73.005 </td> <td style="text-align:right;"> 7322858 </td> <td style="text-align:right;"> 10680.7928 </td> </tr> <tr> <td style="text-align:right;"> 72.961 </td> <td style="text-align:right;"> 1318683096 </td> <td style="text-align:right;"> 4959.1149 </td> </tr> <tr> <td style="text-align:right;"> 72.899 </td> <td style="text-align:right;"> 5675356 </td> <td style="text-align:right;"> 2749.3210 </td> </tr> <tr> <td style="text-align:right;"> 72.889 </td> <td style="text-align:right;"> 44227550 </td> <td style="text-align:right;"> 7006.5804 </td> </tr> <tr> <td style="text-align:right;"> 72.801 </td> <td style="text-align:right;"> 1250882 </td> <td style="text-align:right;"> 10956.9911 </td> </tr> <tr> <td style="text-align:right;"> 72.777 </td> <td style="text-align:right;"> 27601038 </td> <td style="text-align:right;"> 21654.8319 </td> </tr> <tr> <td style="text-align:right;"> 72.567 </td> <td style="text-align:right;"> 2780132 </td> <td style="text-align:right;"> 7320.8803 </td> </tr> <tr> <td style="text-align:right;"> 72.535 </td> <td style="text-align:right;"> 6053193 </td> <td style="text-align:right;"> 4519.4612 </td> </tr> <tr> <td style="text-align:right;"> 72.476 </td> <td style="text-align:right;"> 22276056 </td> <td style="text-align:right;"> 10808.4756 </td> </tr> <tr> <td style="text-align:right;"> 72.396 </td> <td style="text-align:right;"> 20378239 </td> <td style="text-align:right;"> 3970.0954 </td> </tr> <tr> <td style="text-align:right;"> 72.390 </td> <td style="text-align:right;"> 190010647 </td> <td style="text-align:right;"> 9065.8008 </td> </tr> <tr> <td style="text-align:right;"> 72.301 </td> <td style="text-align:right;"> 33333216 </td> <td style="text-align:right;"> 6223.3675 </td> </tr> <tr> <td style="text-align:right;"> 72.235 </td> <td style="text-align:right;"> 9319622 </td> <td style="text-align:right;"> 6025.3748 </td> </tr> <tr> <td style="text-align:right;"> 71.993 </td> <td style="text-align:right;"> 3921278 </td> <td style="text-align:right;"> 10461.0587 </td> </tr> <tr> <td style="text-align:right;"> 71.878 </td> <td style="text-align:right;"> 6939688 </td> <td style="text-align:right;"> 5728.3535 </td> </tr> <tr> <td style="text-align:right;"> 71.777 </td> <td style="text-align:right;"> 71158647 </td> <td style="text-align:right;"> 8458.2764 </td> </tr> <tr> <td style="text-align:right;"> 71.752 </td> <td style="text-align:right;"> 6667147 </td> <td style="text-align:right;"> 4172.8385 </td> </tr> <tr> <td style="text-align:right;"> 71.688 </td> <td style="text-align:right;"> 91077287 </td> <td style="text-align:right;"> 3190.4810 </td> </tr> <tr> <td style="text-align:right;"> 71.421 </td> <td style="text-align:right;"> 28674757 </td> <td style="text-align:right;"> 7408.9056 </td> </tr> <tr> <td style="text-align:right;"> 71.338 </td> <td style="text-align:right;"> 80264543 </td> <td style="text-align:right;"> 5581.1810 </td> </tr> <tr> <td style="text-align:right;"> 71.164 </td> <td style="text-align:right;"> 33757175 </td> <td style="text-align:right;"> 3820.1752 </td> </tr> <tr> <td style="text-align:right;"> 70.964 </td> <td style="text-align:right;"> 69453570 </td> <td style="text-align:right;"> 11605.7145 </td> </tr> <tr> <td style="text-align:right;"> 70.650 </td> <td style="text-align:right;"> 223547000 </td> <td style="text-align:right;"> 3540.6516 </td> </tr> <tr> <td style="text-align:right;"> 70.616 </td> <td style="text-align:right;"> 65068149 </td> <td style="text-align:right;"> 7458.3963 </td> </tr> <tr> <td style="text-align:right;"> 70.259 </td> <td style="text-align:right;"> 12572928 </td> <td style="text-align:right;"> 5186.0500 </td> </tr> <tr> <td style="text-align:right;"> 70.198 </td> <td style="text-align:right;"> 7483763 </td> <td style="text-align:right;"> 3548.3308 </td> </tr> <tr> <td style="text-align:right;"> 69.819 </td> <td style="text-align:right;"> 1056608 </td> <td style="text-align:right;"> 18008.5092 </td> </tr> <tr> <td style="text-align:right;"> 67.297 </td> <td style="text-align:right;"> 23301725 </td> <td style="text-align:right;"> 1593.0655 </td> </tr> <tr> <td style="text-align:right;"> 66.803 </td> <td style="text-align:right;"> 2874127 </td> <td style="text-align:right;"> 3095.7723 </td> </tr> <tr> <td style="text-align:right;"> 65.554 </td> <td style="text-align:right;"> 9119152 </td> <td style="text-align:right;"> 3822.1371 </td> </tr> <tr> <td style="text-align:right;"> 65.528 </td> <td style="text-align:right;"> 199579 </td> <td style="text-align:right;"> 1598.4351 </td> </tr> <tr> <td style="text-align:right;"> 65.483 </td> <td style="text-align:right;"> 169270617 </td> <td style="text-align:right;"> 2605.9476 </td> </tr> <tr> <td style="text-align:right;"> 65.152 </td> <td style="text-align:right;"> 710960 </td> <td style="text-align:right;"> 986.1479 </td> </tr> <tr> <td style="text-align:right;"> 64.698 </td> <td style="text-align:right;"> 1110396331 </td> <td style="text-align:right;"> 2452.2104 </td> </tr> <tr> <td style="text-align:right;"> 64.164 </td> <td style="text-align:right;"> 3270065 </td> <td style="text-align:right;"> 1803.1515 </td> </tr> <tr> <td style="text-align:right;"> 64.062 </td> <td style="text-align:right;"> 150448339 </td> <td style="text-align:right;"> 1391.2538 </td> </tr> <tr> <td style="text-align:right;"> 63.785 </td> <td style="text-align:right;"> 28901790 </td> <td style="text-align:right;"> 1091.3598 </td> </tr> <tr> <td style="text-align:right;"> 63.062 </td> <td style="text-align:right;"> 12267493 </td> <td style="text-align:right;"> 1712.4721 </td> </tr> <tr> <td style="text-align:right;"> 62.698 </td> <td style="text-align:right;"> 22211743 </td> <td style="text-align:right;"> 2280.7699 </td> </tr> <tr> <td style="text-align:right;"> 62.069 </td> <td style="text-align:right;"> 47761980 </td> <td style="text-align:right;"> 944.0000 </td> </tr> <tr> <td style="text-align:right;"> 60.916 </td> <td style="text-align:right;"> 8502814 </td> <td style="text-align:right;"> 1201.6372 </td> </tr> <tr> <td style="text-align:right;"> 60.022 </td> <td style="text-align:right;"> 22873338 </td> <td style="text-align:right;"> 1327.6089 </td> </tr> <tr> <td style="text-align:right;"> 59.723 </td> <td style="text-align:right;"> 14131858 </td> <td style="text-align:right;"> 1713.7787 </td> </tr> <tr> <td style="text-align:right;"> 59.545 </td> <td style="text-align:right;"> 27499638 </td> <td style="text-align:right;"> 4471.0619 </td> </tr> <tr> <td style="text-align:right;"> 59.448 </td> <td style="text-align:right;"> 1688359 </td> <td style="text-align:right;"> 752.7497 </td> </tr> <tr> <td style="text-align:right;"> 59.443 </td> <td style="text-align:right;"> 19167654 </td> <td style="text-align:right;"> 1044.7701 </td> </tr> <tr> <td style="text-align:right;"> 58.556 </td> <td style="text-align:right;"> 42292929 </td> <td style="text-align:right;"> 2602.3950 </td> </tr> <tr> <td style="text-align:right;"> 58.420 </td> <td style="text-align:right;"> 5701579 </td> <td style="text-align:right;"> 882.9699 </td> </tr> <tr> <td style="text-align:right;"> 58.040 </td> <td style="text-align:right;"> 4906585 </td> <td style="text-align:right;"> 641.3695 </td> </tr> <tr> <td style="text-align:right;"> 56.867 </td> <td style="text-align:right;"> 12894865 </td> <td style="text-align:right;"> 619.6769 </td> </tr> <tr> <td style="text-align:right;"> 56.735 </td> <td style="text-align:right;"> 1454867 </td> <td style="text-align:right;"> 13206.4845 </td> </tr> <tr> <td style="text-align:right;"> 56.728 </td> <td style="text-align:right;"> 8078314 </td> <td style="text-align:right;"> 1441.2849 </td> </tr> <tr> <td style="text-align:right;"> 56.007 </td> <td style="text-align:right;"> 9947814 </td> <td style="text-align:right;"> 942.6542 </td> </tr> <tr> <td style="text-align:right;"> 55.322 </td> <td style="text-align:right;"> 3800610 </td> <td style="text-align:right;"> 3632.5578 </td> </tr> <tr> <td style="text-align:right;"> 54.791 </td> <td style="text-align:right;"> 496374 </td> <td style="text-align:right;"> 2082.4816 </td> </tr> <tr> <td style="text-align:right;"> 54.467 </td> <td style="text-align:right;"> 12031795 </td> <td style="text-align:right;"> 1042.5816 </td> </tr> <tr> <td style="text-align:right;"> 54.110 </td> <td style="text-align:right;"> 35610177 </td> <td style="text-align:right;"> 1463.2493 </td> </tr> <tr> <td style="text-align:right;"> 52.947 </td> <td style="text-align:right;"> 76511887 </td> <td style="text-align:right;"> 690.8056 </td> </tr> <tr> <td style="text-align:right;"> 52.906 </td> <td style="text-align:right;"> 2055080 </td> <td style="text-align:right;"> 4811.0604 </td> </tr> <tr> <td style="text-align:right;"> 52.517 </td> <td style="text-align:right;"> 38139640 </td> <td style="text-align:right;"> 1107.4822 </td> </tr> <tr> <td style="text-align:right;"> 52.295 </td> <td style="text-align:right;"> 14326203 </td> <td style="text-align:right;"> 1217.0330 </td> </tr> <tr> <td style="text-align:right;"> 51.579 </td> <td style="text-align:right;"> 551201 </td> <td style="text-align:right;"> 12154.0897 </td> </tr> <tr> <td style="text-align:right;"> 51.542 </td> <td style="text-align:right;"> 29170398 </td> <td style="text-align:right;"> 1056.3801 </td> </tr> <tr> <td style="text-align:right;"> 50.728 </td> <td style="text-align:right;"> 1639131 </td> <td style="text-align:right;"> 12569.8518 </td> </tr> <tr> <td style="text-align:right;"> 50.651 </td> <td style="text-align:right;"> 10238807 </td> <td style="text-align:right;"> 1704.0637 </td> </tr> <tr> <td style="text-align:right;"> 50.430 </td> <td style="text-align:right;"> 17696293 </td> <td style="text-align:right;"> 2042.0952 </td> </tr> <tr> <td style="text-align:right;"> 49.580 </td> <td style="text-align:right;"> 8390505 </td> <td style="text-align:right;"> 430.0707 </td> </tr> <tr> <td style="text-align:right;"> 49.339 </td> <td style="text-align:right;"> 43997828 </td> <td style="text-align:right;"> 9269.6578 </td> </tr> <tr> <td style="text-align:right;"> 48.328 </td> <td style="text-align:right;"> 18013409 </td> <td style="text-align:right;"> 1544.7501 </td> </tr> <tr> <td style="text-align:right;"> 48.303 </td> <td style="text-align:right;"> 13327079 </td> <td style="text-align:right;"> 759.3499 </td> </tr> <tr> <td style="text-align:right;"> 48.159 </td> <td style="text-align:right;"> 9118773 </td> <td style="text-align:right;"> 926.1411 </td> </tr> <tr> <td style="text-align:right;"> 46.859 </td> <td style="text-align:right;"> 135031164 </td> <td style="text-align:right;"> 2013.9773 </td> </tr> <tr> <td style="text-align:right;"> 46.462 </td> <td style="text-align:right;"> 64606759 </td> <td style="text-align:right;"> 277.5519 </td> </tr> <tr> <td style="text-align:right;"> 46.388 </td> <td style="text-align:right;"> 1472041 </td> <td style="text-align:right;"> 579.2317 </td> </tr> <tr> <td style="text-align:right;"> 46.242 </td> <td style="text-align:right;"> 8860588 </td> <td style="text-align:right;"> 863.0885 </td> </tr> <tr> <td style="text-align:right;"> 45.678 </td> <td style="text-align:right;"> 3193942 </td> <td style="text-align:right;"> 414.5073 </td> </tr> <tr> <td style="text-align:right;"> 44.741 </td> <td style="text-align:right;"> 4369038 </td> <td style="text-align:right;"> 706.0165 </td> </tr> <tr> <td style="text-align:right;"> 43.828 </td> <td style="text-align:right;"> 31889923 </td> <td style="text-align:right;"> 974.5803 </td> </tr> <tr> <td style="text-align:right;"> 43.487 </td> <td style="text-align:right;"> 12311143 </td> <td style="text-align:right;"> 469.7093 </td> </tr> <tr> <td style="text-align:right;"> 42.731 </td> <td style="text-align:right;"> 12420476 </td> <td style="text-align:right;"> 4797.2313 </td> </tr> <tr> <td style="text-align:right;"> 42.592 </td> <td style="text-align:right;"> 2012649 </td> <td style="text-align:right;"> 1569.3314 </td> </tr> <tr> <td style="text-align:right;"> 42.568 </td> <td style="text-align:right;"> 6144562 </td> <td style="text-align:right;"> 862.5408 </td> </tr> <tr> <td style="text-align:right;"> 42.384 </td> <td style="text-align:right;"> 11746035 </td> <td style="text-align:right;"> 1271.2116 </td> </tr> <tr> <td style="text-align:right;"> 42.082 </td> <td style="text-align:right;"> 19951656 </td> <td style="text-align:right;"> 823.6856 </td> </tr> <tr> <td style="text-align:right;"> 39.613 </td> <td style="text-align:right;"> 1133066 </td> <td style="text-align:right;"> 4513.4806 </td> </tr> </tbody> </table> --- class: middle ## Data preprocessing * Approximately normally distributed, same variance ```r gapminder_scaled <- gapminder_ss %>% mutate(s_logPop = scale(log10(pop)), s_logGDP = scale(log10(gdpPercap)), s_lifeExp = scale(lifeExp)) ``` <img src="21-k-means_files/figure-html/unnamed-chunk-3-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Perform k-means ```r kclust1 <- kmeans(gapminder_scaled %>% select(s_lifeExp, s_logGDP, s_logPop), centers = 6) tidy(kclust1) %>% kable() ``` <table> <thead> <tr> <th style="text-align:right;"> s_lifeExp </th> <th style="text-align:right;"> s_logGDP </th> <th style="text-align:right;"> s_logPop </th> <th style="text-align:right;"> size </th> <th style="text-align:right;"> withinss </th> <th style="text-align:left;"> cluster </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.8071429 </td> <td style="text-align:right;"> 0.8637524 </td> <td style="text-align:right;"> -0.7117190 </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 25.45758 </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 0.7392166 </td> <td style="text-align:right;"> 0.6496379 </td> <td style="text-align:right;"> 0.5756151 </td> <td style="text-align:right;"> 31 </td> <td style="text-align:right;"> 17.65497 </td> <td style="text-align:left;"> 2 </td> </tr> <tr> <td style="text-align:right;"> -0.5553538 </td> <td style="text-align:right;"> -0.3687870 </td> <td style="text-align:right;"> -1.2065990 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 21.71864 </td> <td style="text-align:left;"> 3 </td> </tr> <tr> <td style="text-align:right;"> -0.7254081 </td> <td style="text-align:right;"> -0.8996504 </td> <td style="text-align:right;"> 0.6648119 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 11.35126 </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 0.3879030 </td> <td style="text-align:right;"> 0.0361519 </td> <td style="text-align:right;"> 1.7828359 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 10.20109 </td> <td style="text-align:left;"> 5 </td> </tr> <tr> <td style="text-align:right;"> -1.5165304 </td> <td style="text-align:right;"> -1.3211358 </td> <td style="text-align:right;"> -0.1372148 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> 16.19225 </td> <td style="text-align:left;"> 6 </td> </tr> </tbody> </table> --- ## Display clustering ```r kclust1 %>% augment(gapminder_scaled) %>% ggplot(aes(y = lifeExp, x = log10(pop), color = .cluster, shape = continent)) + geom_point() + theme(aspect.ratio = .8) ``` <img src="21-k-means_files/figure-html/unnamed-chunk-5-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle ## How good is the clustering? <img src="21-k-means_files/figure-html/unnamed-chunk-6-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle # When to use k-means? * When the goal is to make categories from continuous variables * Each observation is placed in a cluster * Based on distance between observations, calculated across all variables * Usually best to scale (and maybe transform in other ways) each variable * Select the number of clusters based on partitioning of "sums of squares" --- class: middle # Further reading * Course notes --- class: middle, inverse ## Task * Practice k-means