class: center, middle, inverse, title-slide # Data Visualization ## Reshaping data ### Andrew Irwin,
a.irwin@dal.ca
### Math & Stats, Dalhousie University ### 2021-02-01 (updated: 2021-01-18) --- class: middle # Wide and Long data * What is tidy data? * What is the difference between wide and long data? * Why would you want data in long format? * Why would you want data in wide format? * Pivoting data --- class: middle, inverse # Tidy data * One observation per row * Each column is a different variable * There are many ways to organize data. A key idea behind tidy data is to standardize the way data are organized, to make working with data simpler --- class: middle ## Example 1 (tidy) <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> cases </th> <th style="text-align:right;"> population </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:right;"> 745 </td> <td style="text-align:right;"> 19987071 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:right;"> 2666 </td> <td style="text-align:right;"> 20595360 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:right;"> 37737 </td> <td style="text-align:right;"> 172006362 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:right;"> 80488 </td> <td style="text-align:right;"> 174504898 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:right;"> 212258 </td> <td style="text-align:right;"> 1272915272 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:right;"> 213766 </td> <td style="text-align:right;"> 1280428583 </td> </tr> </tbody> </table> --- class: middle ## Example 2 (long) <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:left;"> type </th> <th style="text-align:right;"> count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> cases </td> <td style="text-align:right;"> 745 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> population </td> <td style="text-align:right;"> 19987071 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> cases </td> <td style="text-align:right;"> 2666 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> population </td> <td style="text-align:right;"> 20595360 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> cases </td> <td style="text-align:right;"> 37737 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> population </td> <td style="text-align:right;"> 172006362 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> cases </td> <td style="text-align:right;"> 80488 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> population </td> <td style="text-align:right;"> 174504898 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> cases </td> <td style="text-align:right;"> 212258 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> population </td> <td style="text-align:right;"> 1272915272 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> cases </td> <td style="text-align:right;"> 213766 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> population </td> <td style="text-align:right;"> 1280428583 </td> </tr> </tbody> </table> --- class: middle ## Example 3 <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:left;"> rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> 745/19987071 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> 2666/20595360 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> 37737/172006362 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> 80488/174504898 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> 212258/1272915272 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> 213766/1280428583 </td> </tr> </tbody> </table> --- class: middle ## Example 4 <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Cases</caption> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> 1999 </th> <th style="text-align:right;"> 2000 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 745 </td> <td style="text-align:right;"> 2666 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 37737 </td> <td style="text-align:right;"> 80488 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 212258 </td> <td style="text-align:right;"> 213766 </td> </tr> </tbody> </table> <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Population</caption> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> 1999 </th> <th style="text-align:right;"> 2000 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 19987071 </td> <td style="text-align:right;"> 20595360 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 172006362 </td> <td style="text-align:right;"> 174504898 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 1272915272 </td> <td style="text-align:right;"> 1280428583 </td> </tr> </tbody> </table> --- class: middle ## Where do data organization patterns come from? * The way data are collected (population from one database, cases from another) * The way data are recorded (one year at a time) * Plans for how data will be interpreted or stored * Concerns about readability (any data missing?) or compactness (storage, display) * Habits --- class: middle ## What is good about tidy data? * Easy to use `dplyr`: filter, group_by, summarize * Easy to use `ggplot`: aesthetic mappings (x, y, color, shape) and facets --- class: middle, inverse ## Pivoting ### Step by step examples of the transformations shown above --- class: middle ## Pivot longer .left-column[.tiny[ <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> cases </th> <th style="text-align:right;"> population </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:right;"> 745 </td> <td style="text-align:right;"> 19987071 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:right;"> 2666 </td> <td style="text-align:right;"> 20595360 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:right;"> 37737 </td> <td style="text-align:right;"> 172006362 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:right;"> 80488 </td> <td style="text-align:right;"> 174504898 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:right;"> 212258 </td> <td style="text-align:right;"> 1272915272 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:right;"> 213766 </td> <td style="text-align:right;"> 1280428583 </td> </tr> </tbody> </table> ]] .right-column[.small[ ```r df1 %>% pivot_longer(cases:population, names_to = "type", values_to = "count") %>% kable() %>% kable_styling(full_width = FALSE) ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:left;"> type </th> <th style="text-align:right;"> count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> cases </td> <td style="text-align:right;"> 745 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> population </td> <td style="text-align:right;"> 19987071 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> cases </td> <td style="text-align:right;"> 2666 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> population </td> <td style="text-align:right;"> 20595360 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> cases </td> <td style="text-align:right;"> 37737 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> population </td> <td style="text-align:right;"> 172006362 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> cases </td> <td style="text-align:right;"> 80488 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> population </td> <td style="text-align:right;"> 174504898 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> cases </td> <td style="text-align:right;"> 212258 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> population </td> <td style="text-align:right;"> 1272915272 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> cases </td> <td style="text-align:right;"> 213766 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> population </td> <td style="text-align:right;"> 1280428583 </td> </tr> </tbody> </table> ]] --- class: middle ## Unite, mutate ```r df1 %>% mutate(rate_fraction = cases/population) %>% unite("rate", cases:population, sep="/") %>% kable() %>% kable_styling(full_width = FALSE) ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:left;"> rate </th> <th style="text-align:right;"> rate_fraction </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> 745/19987071 </td> <td style="text-align:right;"> 0.0000373 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> 2666/20595360 </td> <td style="text-align:right;"> 0.0001294 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> 37737/172006362 </td> <td style="text-align:right;"> 0.0002194 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> 80488/174504898 </td> <td style="text-align:right;"> 0.0004612 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> 212258/1272915272 </td> <td style="text-align:right;"> 0.0001667 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> 213766/1280428583 </td> <td style="text-align:right;"> 0.0001669 </td> </tr> </tbody> </table> --- class: middle ## Separate ```r df1 %>% unite("rate", cases:population, sep="/") %>% separate(rate, into = c("cases", "population"), sep="/") %>% kable() %>% kable_styling(full_width = FALSE) ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:left;"> cases </th> <th style="text-align:left;"> population </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> 745 </td> <td style="text-align:left;"> 19987071 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> 2666 </td> <td style="text-align:left;"> 20595360 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> 37737 </td> <td style="text-align:left;"> 172006362 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> 80488 </td> <td style="text-align:left;"> 174504898 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> 212258 </td> <td style="text-align:left;"> 1272915272 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:left;"> 213766 </td> <td style="text-align:left;"> 1280428583 </td> </tr> </tbody> </table> --- class: middle ## Make two tables ```r df1A <- df1 %>% select(-population) %>% pivot_wider(names_from = "year", values_from = "cases") df1A %>% kable(caption="Cases") %>% kable_styling(full_width = FALSE) ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Cases</caption> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> 1999 </th> <th style="text-align:right;"> 2000 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 745 </td> <td style="text-align:right;"> 2666 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 37737 </td> <td style="text-align:right;"> 80488 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 212258 </td> <td style="text-align:right;"> 213766 </td> </tr> </tbody> </table> <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Population</caption> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> 1999 </th> <th style="text-align:right;"> 2000 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 19987071 </td> <td style="text-align:right;"> 20595360 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 172006362 </td> <td style="text-align:right;"> 174504898 </td> </tr> <tr> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 1272915272 </td> <td style="text-align:right;"> 1280428583 </td> </tr> </tbody> </table> --- class: middle ## Combine two tables into one ```r bind_rows(bind_cols(type = "cases", df1A), bind_cols(type = "population", df1B)) %>% kable() %>% kable_styling(full_width = FALSE) ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> type </th> <th style="text-align:left;"> country </th> <th style="text-align:right;"> 1999 </th> <th style="text-align:right;"> 2000 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> cases </td> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 745 </td> <td style="text-align:right;"> 2666 </td> </tr> <tr> <td style="text-align:left;"> cases </td> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 37737 </td> <td style="text-align:right;"> 80488 </td> </tr> <tr> <td style="text-align:left;"> cases </td> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 212258 </td> <td style="text-align:right;"> 213766 </td> </tr> <tr> <td style="text-align:left;"> population </td> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 19987071 </td> <td style="text-align:right;"> 20595360 </td> </tr> <tr> <td style="text-align:left;"> population </td> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 172006362 </td> <td style="text-align:right;"> 174504898 </td> </tr> <tr> <td style="text-align:left;"> population </td> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 1272915272 </td> <td style="text-align:right;"> 1280428583 </td> </tr> </tbody> </table> --- class: middle # Summary * What and why of tidy data, longer vs wider format * Common reshaping tasks --- class: middle # Further reading * Tidy Data (Chapter 12) in R4DS --- class: middle, inverse ## Task Practice these reshaping skills in Task 7