class: center, middle, inverse, title-slide # Data Visualization ## Text, Factors, Dates, and Times ### Andrew Irwin,
a.irwin@dal.ca
### Math & Stats, Dalhousie University ### 2021-03-22 (updated: 2021-03-12) --- class: middle # Plan * Tools to help you work with data that are not numbers * Specialized functions for working with strings, factors and dates * Strings package `stringr`, `glue`, `unglue` and functions `str_squish`, `glue` and more * Factors package `forcats` and functions `fct_*` * Date and time package: `lubridate` and functions `ymd`, `ymdhms`, `yday`, `decimal_date` and more --- class: middle ### Challenges that arise with text * Extra spaces * Upper and lower case differences * Locales (non-English text); see [ragg](https://www.tidyverse.org/blog/2021/02/modern-text-features/) for plotting with non-Latin text * Difference between factors and strings --- class: middle ### Extra spaces Data entered by a human in a web form or spreadsheet often has inconspicuous spaces, for example after the last letter, or two spaces between a word. This is easily ignored when read by humans, but creates havoc for computers. `str_squish` gets rid of these troublesome spaces. ```r str_squish(" a crazy sentence with too many spaces in strange places. ") ``` ``` ## [1] "a crazy sentence with too many spaces in strange places." ``` --- class: middle ### Letter case Upper and lower case differences are often not noticed by humans, but matter to a computer. One solution is to convert all text to upper, lower, or title case. ```r tibble(original = c("Apple", "apple", "APPLE", "aPpLe"), lower = str_to_lower(original), upper = str_to_upper(original), title = str_to_title(original)) %>% kable() ``` <table> <thead> <tr> <th style="text-align:left;"> original </th> <th style="text-align:left;"> lower </th> <th style="text-align:left;"> upper </th> <th style="text-align:left;"> title </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Apple </td> <td style="text-align:left;"> apple </td> <td style="text-align:left;"> APPLE </td> <td style="text-align:left;"> Apple </td> </tr> <tr> <td style="text-align:left;"> apple </td> <td style="text-align:left;"> apple </td> <td style="text-align:left;"> APPLE </td> <td style="text-align:left;"> Apple </td> </tr> <tr> <td style="text-align:left;"> APPLE </td> <td style="text-align:left;"> apple </td> <td style="text-align:left;"> APPLE </td> <td style="text-align:left;"> Apple </td> </tr> <tr> <td style="text-align:left;"> aPpLe </td> <td style="text-align:left;"> apple </td> <td style="text-align:left;"> APPLE </td> <td style="text-align:left;"> Apple </td> </tr> </tbody> </table> --- ### Getting data in and out of strings ```r tibble(name = c("Andrew", "Susan", "Yong"), age = c(12, 21, 35)) %>% mutate(sentence = glue("{name} is {age} years old.")) %>% kable() ``` <table> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:right;"> age </th> <th style="text-align:left;"> sentence </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Andrew </td> <td style="text-align:right;"> 12 </td> <td style="text-align:left;"> Andrew is 12 years old. </td> </tr> <tr> <td style="text-align:left;"> Susan </td> <td style="text-align:right;"> 21 </td> <td style="text-align:left;"> Susan is 21 years old. </td> </tr> <tr> <td style="text-align:left;"> Yong </td> <td style="text-align:right;"> 35 </td> <td style="text-align:left;"> Yong is 35 years old. </td> </tr> </tbody> </table> --- ### Getting data in and out of strings ```r t0 %>% select(sentence) %>% mutate( unglue_data(sentence, "{name} is {age} years old.")) %>% mutate( age = as.numeric(age), next_year = age + 1) %>% kable() ``` <table> <thead> <tr> <th style="text-align:left;"> sentence </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> age </th> <th style="text-align:right;"> next_year </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Andrew is 12 years old. </td> <td style="text-align:left;"> Andrew </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> Susan is 21 years old. </td> <td style="text-align:left;"> Susan </td> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 22 </td> </tr> <tr> <td style="text-align:left;"> Yong is 35 years old. </td> <td style="text-align:left;"> Yong </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 36 </td> </tr> </tbody> </table> --- class: middle ### Challenges that arise with factors * Plot order on scales (axes, color scale) * Too many factors * Mapping colours to specific levels (later lesson on colour) --- class: middle ### Plot order Here's a plot of mean penguin body mass by species and sex. What's the order? ```r penguins %>% ggplot(aes(x = body_mass_g, y = island)) + stat_summary() + facet_wrap(~ sex) + theme_bw() ``` <img src="30-factors-dates_files/figure-html/unnamed-chunk-6-1.svg" width="90%" style="display: block; margin: auto;" /> --- class: middle ### Plot order Order from smallest to largest, top to bottom. Watch out for NAs. ```r penguins %>% na.omit() %>% ggplot(aes(x = bill_length_mm, y = fct_reorder(species, bill_length_mm, .desc=TRUE))) + stat_summary() + my_theme ``` <img src="30-factors-dates_files/figure-html/unnamed-chunk-7-1.svg" width="50%" style="display: block; margin: auto;" /> --- class: middle ### Custom order Order from smallest to largest, top to bottom. Watch out for NAs. ```r penguins %>% ggplot(aes(x = bill_length_mm, y = fct_relevel(species, "Gentoo", "Adelie"))) + stat_summary() + my_theme ``` <img src="30-factors-dates_files/figure-html/unnamed-chunk-8-1.svg" width="50%" style="display: block; margin: auto;" /> --- class: middle ### Challenges that arise with dates and times * Date format * Extracting components of date or time * Formatting axes on plots * Arithmetic with dates and times * Time zones --- ### Converting text to dates ```r tibble(date = c("01/02/03", "121006", "05/12/08", "11-03-21"), ymd = ymd(date), dmy = dmy(date), mdy = mdy(date)) %>% kable() ``` <table> <thead> <tr> <th style="text-align:left;"> date </th> <th style="text-align:left;"> ymd </th> <th style="text-align:left;"> dmy </th> <th style="text-align:left;"> mdy </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 01/02/03 </td> <td style="text-align:left;"> 2001-02-03 </td> <td style="text-align:left;"> 2003-02-01 </td> <td style="text-align:left;"> 2003-01-02 </td> </tr> <tr> <td style="text-align:left;"> 121006 </td> <td style="text-align:left;"> 2012-10-06 </td> <td style="text-align:left;"> 2006-10-12 </td> <td style="text-align:left;"> 2006-12-10 </td> </tr> <tr> <td style="text-align:left;"> 05/12/08 </td> <td style="text-align:left;"> 2005-12-08 </td> <td style="text-align:left;"> 2008-12-05 </td> <td style="text-align:left;"> 2008-05-12 </td> </tr> <tr> <td style="text-align:left;"> 11-03-21 </td> <td style="text-align:left;"> 2011-03-21 </td> <td style="text-align:left;"> 2021-03-11 </td> <td style="text-align:left;"> 2021-11-03 </td> </tr> </tbody> </table> --- ### Dates and times ```r tibble(date = c("2021/03/11 10:05", "2021/03/12 15:12", "2021/03/11 15:14", "2021/03/11 11:50 PM"), dt = ymd_hm(date), value = 1:4) %>% ggplot(aes(y = dt, x = value)) + geom_point(size=4) + my_theme ``` <img src="30-factors-dates_files/figure-html/unnamed-chunk-10-1.svg" width="50%" style="display: block; margin: auto;" /> --- ### Dates and times ```r tibble(date = c("2021/03/11 10:05", "2021/03/12 15:12", "2021/03/11 15:14", "2021/03/11 11:50 PM"), dt = ymd_hm(date), value = 1:4) %>% ggplot(aes(y = dt, x = value)) + geom_point(size=4) + scale_y_datetime(date_labels = "%H:%M") + my_theme ``` <img src="30-factors-dates_files/figure-html/unnamed-chunk-11-1.svg" width="50%" style="display: block; margin: auto;" /> --- ### Time arithmetic ```r tibble(date = c("2021/03/11 10:05", "2021/03/12 15:12", "2021/03/11 15:14", "2021/03/11 11:50 PM"), dt = ymd_hm(date), value = 1:4) %>% mutate(elapsed = dt - min(dt)) %>% kable() ``` <table> <thead> <tr> <th style="text-align:left;"> date </th> <th style="text-align:left;"> dt </th> <th style="text-align:right;"> value </th> <th style="text-align:left;"> elapsed </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 2021/03/11 10:05 </td> <td style="text-align:left;"> 2021-03-11 10:05:00 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> 0 secs </td> </tr> <tr> <td style="text-align:left;"> 2021/03/12 15:12 </td> <td style="text-align:left;"> 2021-03-12 15:12:00 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> 104820 secs </td> </tr> <tr> <td style="text-align:left;"> 2021/03/11 15:14 </td> <td style="text-align:left;"> 2021-03-11 15:14:00 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> 18540 secs </td> </tr> <tr> <td style="text-align:left;"> 2021/03/11 11:50 PM </td> <td style="text-align:left;"> 2021-03-11 23:50:00 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> 49500 secs </td> </tr> </tbody> </table> --- ### Time arithmetic ```r t1 %>% mutate(elapsed = dt - min(dt)) %>% ggplot(aes(x = as.numeric(elapsed)/3600, y = value)) + geom_point(size=4) + labs(x = "Time in hours since start") + my_theme ``` <img src="30-factors-dates_files/figure-html/unnamed-chunk-14-1.svg" width="50%" style="display: block; margin: auto;" /> --- class: middle # Further reading * Course notes * Healy appendix * R4DS Chapter 14 [Strings](https://r4ds.had.co.nz/strings.html), Chapter 15 [Factors](https://r4ds.had.co.nz/factors.html), and Chapter 16 [Dates and times](https://r4ds.had.co.nz/dates-and-times.html) --- class: middle, inverse ## Task * Task 17 as described in repository