PCA: Principal Component Analysis

Andrew Irwin, a.irwin@dal.ca

2024-03-06

Plan

What is PCA and why do we use it?
Demonstration with gapminder data in 3 dimensions
More demonstrations

What is PCA?

A tool for simplifying data sets with many variables down to a small number
Works best with quantitative variables that are partially correlated
Can be used with one or more categorical variables to identify subgroups
Main results are
- Spatial pattern of points (with categorical labels)
- Directions of principal components along original axes (loadings)
- Amount of variation along each PC axis
Be careful with scaling and units

Example data

Interactive version

Pairs plot

my_gapminder |> select(-year) |> ggpairs(aes(color = continent))

PCA: 2007 data only

pca1 <- prcomp(my_gapminder2 %>% select(-continent), scale=TRUE)
autoplot(pca1, data = my_gapminder2, loadings=TRUE, loadings.label = TRUE,
         colour = 'continent')

Loadings

This matrix tells you how to rotate the data into PC coordinates which is the weight of each original variable in the computation of PCs.

tidy(pca1, matrix = "rotation") |>
  pivot_wider(names_from = column, values_from = value) |> kable() |> kable_styling(full_width = FALSE)

PC	lifeExp	logGDPpercap	logPop
1	-0.7075921	-0.7064117	-0.0172002
2	-0.0555353	0.0798613	-0.9952578
3	0.7044354	-0.7032813	-0.0957400

PCA: All years

pca1 <- prcomp(my_gapminder |> select(-continent, -year), 
               scale=TRUE)
autoplot(pca1, data = my_gapminder, loadings=TRUE, 
         loadings.label = TRUE, colour = 'continent')

Penguins

pca2 %>% tidy(matrix="loadings") |> # same as rotation
  filter(PC < 3) |>
  pivot_wider(values_from="value", 
              names_from="PC", names_prefix="PC_") |> kable() |> kable_styling(full_width = FALSE)

column	PC_1	PC_2
Culmen Length (mm)	0.2871721	0.6602934
Culmen Depth (mm)	-0.4102740	0.1879624
Flipper Length (mm)	0.5008652	0.2207240
Body Mass (g)	0.4846484	0.2033560
Delta 15 N (o/oo)	-0.4084370	0.3686830
Delta 13 C (o/oo)	-0.3108644	0.5501663

Summary

Use PCA to reduce the number of dimensions (variables) in your data
Interpret loadings (arrows and numeric vectors)
Pay attention to the proportion of variance along each principal component
Consequences of scaling variables (or not scaling them)

Task

Bonus task: Practice the PCA skills in this lesson

Penguin PCA code

See the course notes for more code examples.

my_penguins_raw = penguins_raw |> select(-`Sample Number`) |>
  select(Species, where(is.numeric) ) |> na.omit() |>
  mutate(Species = str_remove(Species, "\\(.*\\)"))
pca2 <- my_penguins_raw |> select(-Species) |> prcomp(scale=TRUE)
autoplot(pca2, data = my_penguins_raw, 
         loadings=TRUE, loadings.label = TRUE,
         loadings.label.colour = "black", loadings.colour = "black",
         colour = 'Species') + xlim(-0.15, 0.15) + ylim(-0.15, 0.15)