class: center, middle, inverse, title-slide # Data Visualization ## PCA: Principal Component Analysis ### Andrew Irwin,
a.irwin@dal.ca
### Math & Stats, Dalhousie University ### 2021-03-01 (updated: 2021-03-09) --- class: middle # Plan * What is PCA and why do we use it? * Demonstration with gapminder data in 3 dimensions * More demonstrations --- class: middle ### What is PCA? * A tool for simplifying data sets with many variables down to a small number * Works best with quantitative variables that are partially correlated * Usually used in combination with one or more categorical variables to identify subgroups * Main results are * Spatial pattern of points * Directions of principal components along original axes (loadings) * Amount of variation along each PC axis * Be careful with scaling and units --- class: middle ### Example data [Interactive version](plotly-1.html) <img src="../static/plotly-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle ### Pairs plot ```r my_gapminder %>% select(-year) %>% ggpairs(aes(color = continent)) ``` <img src="19-PCA_files/figure-html/unnamed-chunk-3-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle ### PCA: 2007 data only ```r pca1 <- prcomp(my_gapminder2 %>% select(-continent), scale=TRUE) autoplot(pca1, data = my_gapminder2, loadings=TRUE, loadings.label = TRUE, colour = 'continent') ``` <img src="19-PCA_files/figure-html/unnamed-chunk-5-1.png" width="55%" style="display: block; margin: auto;" /> --- class: middle ### PCA: All years ```r pca1 <- prcomp(my_gapminder %>% select(-continent, -year), scale=TRUE) autoplot(pca1, data = my_gapminder, loadings=TRUE, loadings.label = TRUE, colour = 'continent') ``` <img src="19-PCA_files/figure-html/unnamed-chunk-6-1.png" width="55%" style="display: block; margin: auto;" /> --- class: middle ### Penguins <img src="19-PCA_files/figure-html/unnamed-chunk-7-1.png" width="80%" style="display: block; margin: auto;" /> --- class: middle ```r pca2 %>% tidy(matrix="loadings") %>% filter(PC < 3) %>% pivot_wider(values_from="value", names_from="PC", names_prefix="PC_") %>% kable() ``` <table> <thead> <tr> <th style="text-align:left;"> column </th> <th style="text-align:right;"> PC_1 </th> <th style="text-align:right;"> PC_2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Culmen Length (mm) </td> <td style="text-align:right;"> 0.2871721 </td> <td style="text-align:right;"> 0.6602934 </td> </tr> <tr> <td style="text-align:left;"> Culmen Depth (mm) </td> <td style="text-align:right;"> -0.4102740 </td> <td style="text-align:right;"> 0.1879624 </td> </tr> <tr> <td style="text-align:left;"> Flipper Length (mm) </td> <td style="text-align:right;"> 0.5008652 </td> <td style="text-align:right;"> 0.2207240 </td> </tr> <tr> <td style="text-align:left;"> Body Mass (g) </td> <td style="text-align:right;"> 0.4846484 </td> <td style="text-align:right;"> 0.2033560 </td> </tr> <tr> <td style="text-align:left;"> Delta 15 N (o/oo) </td> <td style="text-align:right;"> -0.4084370 </td> <td style="text-align:right;"> 0.3686830 </td> </tr> <tr> <td style="text-align:left;"> Delta 13 C (o/oo) </td> <td style="text-align:right;"> -0.3108644 </td> <td style="text-align:right;"> 0.5501663 </td> </tr> </tbody> </table> --- class: middle # Summary * PCA to reduce the number of dimensions (variables) in your data * Interpreting loadings (arrows and numeric vectors) * Interpreting proportion of variance along each principal component * Consequences of scaling variables (or not scaling them) --- class: middle # Further reading * Course notes --- class: middle, inverse ## Task * Bonus task: Practice the PCA skills in this lesson --- class: middle ### Penguin PCA code ```r my_penguins_raw = penguins_raw %>% select(-`Sample Number`) %>% select(Species, where(is.numeric) ) %>% na.omit() %>% mutate(Species = str_remove(Species, "\\(.*\\)")) pca2 <- my_penguins_raw %>% select(-Species) %>% prcomp(scale=TRUE) autoplot(pca2, data = my_penguins_raw, loadings=TRUE, loadings.label = TRUE, loadings.label.colour = "black", loadings.colour = "black", colour = 'Species') + xlim(-0.15, 0.15) + ylim(-0.15, 0.15) ```