Text & Factors
Dates & Times

Andrew Irwin, a.irwin@dal.ca

2024-03-26

Plan

  • Tools to help you work with data that are not numbers

  • Specialized functions for working with strings, factors and dates

  • Strings package stringr, glue, unglue and functions str_squish, glue and more

  • Factors package forcats and functions fct_*

  • Date and time package: lubridate and functions ymd, ymdhms, yday, decimal_date and more

Challenges that arise with text

  • Extra spaces

  • Upper and lower case differences

  • Locales (non-English text); see ragg for plotting with non-Latin text

  • Difference between factors and strings

Extra spaces

Data entered by a human in a web form or spreadsheet often has inconspicuous spaces, for example after the last letter, or two spaces between a word. This is easily ignored when read by humans, but creates havoc for computers. str_squish gets rid of these troublesome spaces.

str_squish(" a crazy sentence   with 
           too many spaces in    strange places.     ")
[1] "a crazy sentence with too many spaces in strange places."

Letter case

Upper and lower case differences are often not noticed by humans, but matter to a computer. One solution is to convert all text to upper, lower, or title case.

tibble(original = c("Apple", "apple", "APPLE", "aPpLe"),
       lower = str_to_lower(original),
       upper = str_to_upper(original),
       title = str_to_title(original)) |> kable()
original lower upper title
Apple apple APPLE Apple
apple apple APPLE Apple
APPLE apple APPLE Apple
aPpLe apple APPLE Apple

Getting data in and out of strings

tibble(name = c("Andrew", "Susan", "Yong"),
       age = c(12, 21, 35)) |>
  mutate(sentence = glue("{name} is {age} years old.")) |> kable()
name age sentence
Andrew 12 Andrew is 12 years old.
Susan 21 Susan is 21 years old.
Yong 35 Yong is 35 years old.

Getting data in and out of strings

t0 |> select(sentence) |>
  mutate( unglue_data(sentence, "{name} is {age} years old.")) |>
  mutate( age = as.numeric(age),
          next_year = age + 1) |> kable()
sentence name age next_year
Andrew is 12 years old. Andrew 12 13
Susan is 21 years old. Susan 21 22
Yong is 35 years old. Yong 35 36

Challenges that arise with factors

  • Plot order on scales (axes, color scale)

  • Too many factors

  • Mapping colours to specific levels (later lesson on colour)

Plot order

Here’s a plot of mean penguin body mass by specie and sex. What’s the order?

penguins |> ggplot(aes(x = body_mass_g, y = species)) + 
  stat_summary() +
  facet_wrap(~ sex) + theme_bw() 

Plot order

Order from smallest to largest, top to bottom. Watch out for NAs.

penguins |> na.omit() |>
  ggplot(aes(x = body_mass_g, 
             y = fct_reorder(species, body_mass_g,
                             .desc=TRUE))) + 
  stat_summary() + my_theme

Custom order

penguins |>
  ggplot(aes(x = body_mass_g, 
             y = fct_relevel(species, "Gentoo", "Adelie"))) + 
  stat_summary() + my_theme

Ordering works for other aesthetics too

penguins |> na.omit() |>
  ggplot(aes(x = body_mass_g, 
             y = flipper_length_mm,
             color = fct_reorder(species, body_mass_g, .desc=TRUE))) + 
  geom_point() + my_theme +
  labs(color = "Species")

Challenges that arise with dates and times

  • Date format

  • Extracting components of date or time

  • Formatting axes on plots

  • Arithmetic with dates and times

  • Time zones

Converting text to dates

tibble(date = c("01/02/03", "121006", "05/12/08", "11-03-21"),
       ymd = ymd(date),
       dmy = dmy(date),
       mdy = mdy(date)) |> kable()
date ymd dmy mdy
01/02/03 2001-02-03 2003-02-01 2003-01-02
121006 2012-10-06 2006-10-12 2006-12-10
05/12/08 2005-12-08 2008-12-05 2008-05-12
11-03-21 2011-03-21 2021-03-11 2021-11-03

Dates and times

tibble(date = c("2021/03/11 10:05", "2021/03/12 15:12", 
                "2021/03/11 15:14", "2021/03/11 11:50 PM"),
       dt = ymd_hm(date),
       value = 1:4) |>
  ggplot(aes(y = dt, x = value)) + geom_point(size=4) + my_theme

Dates and times

tibble(date = c("2021/03/11 10:05", "2021/03/12 15:12", 
                "2021/03/11 15:14", "2021/03/11 11:50 PM"),
       dt = ymd_hm(date),
       value = 1:4) |>
  ggplot(aes(y = dt, x = value)) + geom_point(size=4) + 
  scale_y_datetime(date_labels = "%H:%M") + my_theme

Calendar arithmetic

tibble(date = c("2024/02/29", "2021/01/01", 
                "2021/06/21", "2023/09/01",
                "1900/02/29"),
       dt = ymd(date),
       yday(dt),
       decimal_date(dt)) |> kable()
date dt yday(dt) decimal_date(dt)
2024/02/29 2024-02-29 60 2024.161
2021/01/01 2021-01-01 1 2021.000
2021/06/21 2021-06-21 172 2021.468
2023/09/01 2023-09-01 244 2023.666
1900/02/29 NA NA NA

Calendar arithmetic

ymd("2024/01/31") + 
  duration(seq(0, 330, by = 30), units = "day")
 [1] "2024-01-31" "2024-03-01" "2024-03-31" "2024-04-30" "2024-05-30"
 [6] "2024-06-29" "2024-07-29" "2024-08-28" "2024-09-27" "2024-10-27"
[11] "2024-11-26" "2024-12-26"
ymd("2024/01/31") + 
  duration(seq(0, 11, by = 1), units = "month")
 [1] "2024-01-31 00:00:00 UTC" "2024-03-01 10:30:00 UTC"
 [3] "2024-03-31 21:00:00 UTC" "2024-05-01 07:30:00 UTC"
 [5] "2024-05-31 18:00:00 UTC" "2024-07-01 04:30:00 UTC"
 [7] "2024-07-31 15:00:00 UTC" "2024-08-31 01:30:00 UTC"
 [9] "2024-09-30 12:00:00 UTC" "2024-10-30 22:30:00 UTC"
[11] "2024-11-30 09:00:00 UTC" "2024-12-30 19:30:00 UTC"

More calendar arithmetic

tibble(date = c("2021/06/21", "2024/02/29", "2024/01/01", "2023/09/01", "2027/03/01"),
       dt = ymd(date),
       next_year = dt + duration(1, units = "year"),
       rounded = round(next_year, unit = "day")) |> select(-date) |> kable()
dt next_year rounded
2021-06-21 2022-06-21 06:00:00 2022-06-21
2024-02-29 2025-02-28 06:00:00 2025-02-28
2024-01-01 2024-12-31 06:00:00 2024-12-31
2023-09-01 2024-08-31 06:00:00 2024-08-31
2027-03-01 2028-02-29 06:00:00 2028-02-29

Tricky results

duration(1, units = "years")
[1] "31557600s (~1 years)"
60*60*24*365.25
[1] 31557600
ymd("2000/02/29") + duration(100, units = "years")
[1] "2100-03-01"
ymd("2100/02/29") # no leap day in 2100
[1] NA
ymd("2000/02/29") - duration(100, units = "years")
[1] "1900-02-28"
ymd("2000/02/29") + duration(200, units = "years")
[1] "2200-03-02"
ymd("2000/02/29") + duration(300, units = "years")
[1] "2300-03-03"
ymd("2000/02/29") + duration(400, units = "years")
[1] "2400-03-03"

Time arithmetic

tibble(date = c("2021/03/11 10:05", "2021/03/12 15:12", 
                "2021/03/11 15:14", "2021/03/11 11:50 PM"),
       dt = ymd_hm(date),
       value = 1:4) |>
  mutate(elapsed = dt - min(dt)) |> kable()
date dt value elapsed
2021/03/11 10:05 2021-03-11 10:05:00 1 0 secs
2021/03/12 15:12 2021-03-12 15:12:00 2 104820 secs
2021/03/11 15:14 2021-03-11 15:14:00 3 18540 secs
2021/03/11 11:50 PM 2021-03-11 23:50:00 4 49500 secs

Time arithmetic

t1 |> mutate(elapsed = dt - min(dt)) |>
  ggplot(aes(x = as.numeric(elapsed)/3600, 
             y = value)) + geom_point(size=4) +
  labs(x = "Time in hours since start") + my_theme

Time zones

t1 <- ymd_hms("2024-03-27 10:05:05", tz = "America/Halifax")
t2 <- ymd_hms("2024-03-27 10:05:05", tz = "GMT")
t1
[1] "2024-03-27 10:05:05 ADT"
t2
[1] "2024-03-27 10:05:05 GMT"
t1-t2
Time difference of 3 hours

Time zones

now()
[1] "2024-04-05 11:26:33 ADT"
format_ISO8601(now(), usetz=TRUE)
[1] "2024-04-05T11:26:33-0300"
stamp("January 1, 2024 at 10:00 AM")(now())
[1] "April 05, 2024 at 11:26 AM"
strftime(now(), "%Y-%m-%d %H:%M:%S %Z")
[1] "2024-04-05 11:26:33 ADT"
t2 <- ymd_hms("2024-03-27 10:05:05")
format_ISO8601(t2, usetz=TRUE)
[1] "2024-03-27T10:05:05+0000"
stamp("January 1, 2024 at 10:00 AM")(t2)
[1] "March 27, 2024 at 10:05 AM"
strftime(t2, "%Y-%m-%d %H:%M:%S %Z")
[1] "2024-03-27 07:05:05 ADT"

Further reading