Reshaping data

Andrew Irwin, a.irwin@dal.ca

2024-02-06

Overview

  • What is tidy data?

  • What is the difference between wide and long data?

  • Why would you want data in long format?

  • Why would you want data in wide format?

  • Pivoting data

Tidy data

  • One observation per row

  • Each column is a different variable

  • There are many ways to organize data. A key idea behind tidy data is to standardize the way data are organized, to make working with data simpler

Example 1 (tidy)

country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

Example 2 (long)

country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583

Example 3

country year rate
Afghanistan 1999 745/19987071
Afghanistan 2000 2666/20595360
Brazil 1999 37737/172006362
Brazil 2000 80488/174504898
China 1999 212258/1272915272
China 2000 213766/1280428583

Example 4

Cases
country 1999 2000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766
Population
country 1999 2000
Afghanistan 19987071 20595360
Brazil 172006362 174504898
China 1272915272 1280428583

Where do data organization patterns come from?

  • The way data are collected (population from one database, cases from another)

  • The way data are recorded (one year at a time)

  • Plans for how data will be interpreted or stored

  • Concerns about readability (any data missing?) or compactness (storage, display)

  • Habits

What is good about tidy data?

  • Easy to use dplyr: filter, group_by, summarize

  • Easy to use ggplot: aesthetic mappings (x, y, color, shape) and facets

Pivoting

Step by step examples of the transformations shown above

Pivot longer

country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583
df1 |> 
  pivot_longer(cases:population, 
               names_to = "type", 
               values_to = "count") |>
  kable() |> kable_styling(full_width = FALSE)
country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583

Unite, mutate

df1 |> mutate(rate_fraction = cases/population) |>
  unite("rate", cases:population, sep="/") |>
  kable() |> kable_styling(full_width = FALSE)
country year rate rate_fraction
Afghanistan 1999 745/19987071 0.0000373
Afghanistan 2000 2666/20595360 0.0001294
Brazil 1999 37737/172006362 0.0002194
Brazil 2000 80488/174504898 0.0004612
China 1999 212258/1272915272 0.0001667
China 2000 213766/1280428583 0.0001669

Separate

df1 |> unite("rate", cases:population, sep="/") |>
  separate(rate, into = c("cases", "population"), sep="/") |>
  kable() |> kable_styling(full_width = FALSE)
country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

Make two tables

df1A <- df1 |> select(-population) |>
  pivot_wider(names_from = "year", values_from = "cases") 
df1A |> kable(caption="Cases") |> kable_styling(full_width = FALSE)
Cases
country 1999 2000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766
df1B <- df1 |> select(-cases) |>
  pivot_wider(names_from = "year", values_from = "population") 
df1B |> kable(caption="Population") |> kable_styling(full_width = FALSE)
Population
country 1999 2000
Afghanistan 19987071 20595360
Brazil 172006362 174504898
China 1272915272 1280428583

Combine two tables into one

bind_rows(mutate(df1A, type = "cases"),
          mutate(df1B, type = "population")) |>
  kable() |> kable_styling(full_width = FALSE)
country 1999 2000 type
Afghanistan 745 2666 cases
Brazil 37737 80488 cases
China 212258 213766 cases
Afghanistan 19987071 20595360 population
Brazil 172006362 174504898 population
China 1272915272 1280428583 population

Another way to join tables

First make a long table.

df1A |> pivot_longer(`1999`:`2000`, names_to = "year", values_to = "cases") |> 
  kable() |> kable_styling(full_width = FALSE)
country year cases
Afghanistan 1999 745
Afghanistan 2000 2666
Brazil 1999 37737
Brazil 2000 80488
China 1999 212258
China 2000 213766

Another way to join tables

Then join them together.

full_join(
  df1A |> pivot_longer(`1999`:`2000`, names_to = "year", values_to = "cases"),
  df1B |> pivot_longer(`1999`:`2000`, names_to = "year", values_to = "population"),
  by = c("country", "year")
) |> kable() |> kable_styling(full_width = FALSE)
country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

Summary

  • What and why of tidy data

  • Longer vs wider format

  • Common reshaping tasks

Further reading

  • Tidy Data (Chapter 5) in R4DS
  • More examples in course notes

Task

Practice these reshaping skills in the assigned Task.