Reshaping data

Andrew Irwin, a.irwin@dal.ca

2024-02-06

Overview

What is tidy data?
What is the difference between wide and long data?
Why would you want data in long format?
Why would you want data in wide format?
Pivoting data

Tidy data

One observation per row
Each column is a different variable
There are many ways to organize data. A key idea behind tidy data is to standardize the way data are organized, to make working with data simpler

Example 1 (tidy)

country	year	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

Example 2 (long)

country	year	type	count
Afghanistan	1999	cases	745
Afghanistan	1999	population	19987071
Afghanistan	2000	cases	2666
Afghanistan	2000	population	20595360
Brazil	1999	cases	37737
Brazil	1999	population	172006362
Brazil	2000	cases	80488
Brazil	2000	population	174504898
China	1999	cases	212258
China	1999	population	1272915272
China	2000	cases	213766
China	2000	population	1280428583

Example 3

country	year	rate
Afghanistan	1999	745/19987071
Afghanistan	2000	2666/20595360
Brazil	1999	37737/172006362
Brazil	2000	80488/174504898
China	1999	212258/1272915272
China	2000	213766/1280428583

Example 4

Cases
country	1999	2000
Afghanistan	745	2666
Brazil	37737	80488
China	212258	213766

Population
country	1999	2000
Afghanistan	19987071	20595360
Brazil	172006362	174504898
China	1272915272	1280428583

Where do data organization patterns come from?

The way data are collected (population from one database, cases from another)
The way data are recorded (one year at a time)
Plans for how data will be interpreted or stored
Concerns about readability (any data missing?) or compactness (storage, display)
Habits

What is good about tidy data?

Easy to use dplyr: filter, group_by, summarize
Easy to use ggplot: aesthetic mappings (x, y, color, shape) and facets

Pivoting

Step by step examples of the transformations shown above

Pivot longer

country	year	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

df1 |> 
  pivot_longer(cases:population, 
               names_to = "type", 
               values_to = "count") |>
  kable() |> kable_styling(full_width = FALSE)

country	year	type	count
Afghanistan	1999	cases	745
Afghanistan	1999	population	19987071
Afghanistan	2000	cases	2666
Afghanistan	2000	population	20595360
Brazil	1999	cases	37737
Brazil	1999	population	172006362
Brazil	2000	cases	80488
Brazil	2000	population	174504898
China	1999	cases	212258
China	1999	population	1272915272
China	2000	cases	213766
China	2000	population	1280428583

Unite, mutate

df1 |> mutate(rate_fraction = cases/population) |>
  unite("rate", cases:population, sep="/") |>
  kable() |> kable_styling(full_width = FALSE)

country	year	rate	rate_fraction
Afghanistan	1999	745/19987071	0.0000373
Afghanistan	2000	2666/20595360	0.0001294
Brazil	1999	37737/172006362	0.0002194
Brazil	2000	80488/174504898	0.0004612
China	1999	212258/1272915272	0.0001667
China	2000	213766/1280428583	0.0001669

Separate

df1 |> unite("rate", cases:population, sep="/") |>
  separate(rate, into = c("cases", "population"), sep="/") |>
  kable() |> kable_styling(full_width = FALSE)

country	year	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

Make two tables

df1A <- df1 |> select(-population) |>
  pivot_wider(names_from = "year", values_from = "cases") 
df1A |> kable(caption="Cases") |> kable_styling(full_width = FALSE)

Cases
country	1999	2000
Afghanistan	745	2666
Brazil	37737	80488
China	212258	213766

df1B <- df1 |> select(-cases) |>
  pivot_wider(names_from = "year", values_from = "population") 
df1B |> kable(caption="Population") |> kable_styling(full_width = FALSE)

Population
country	1999	2000
Afghanistan	19987071	20595360
Brazil	172006362	174504898
China	1272915272	1280428583

Combine two tables into one

bind_rows(mutate(df1A, type = "cases"),
          mutate(df1B, type = "population")) |>
  kable() |> kable_styling(full_width = FALSE)

country	1999	2000	type
Afghanistan	745	2666	cases
Brazil	37737	80488	cases
China	212258	213766	cases
Afghanistan	19987071	20595360	population
Brazil	172006362	174504898	population
China	1272915272	1280428583	population

Another way to join tables

First make a long table.

df1A |> pivot_longer(`1999`:`2000`, names_to = "year", values_to = "cases") |> 
  kable() |> kable_styling(full_width = FALSE)

country	year	cases
Afghanistan	1999	745
Afghanistan	2000	2666
Brazil	1999	37737
Brazil	2000	80488
China	1999	212258
China	2000	213766

Another way to join tables

Then join them together.

full_join(
  df1A |> pivot_longer(`1999`:`2000`, names_to = "year", values_to = "cases"),
  df1B |> pivot_longer(`1999`:`2000`, names_to = "year", values_to = "population"),
  by = c("country", "year")
) |> kable() |> kable_styling(full_width = FALSE)

country	year	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

Summary

What and why of tidy data
Longer vs wider format
Common reshaping tasks

Task

Practice these reshaping skills in the assigned Task.

Reshaping data

Overview

Tidy data

Example 1 (tidy)

Example 2 (long)

Example 3

Example 4

Where do data organization patterns come from?

What is good about tidy data?

Pivoting

Pivot longer

Unite, mutate

Separate

Make two tables

Combine two tables into one

Another way to join tables

Another way to join tables

Summary

Further reading

Task