country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
2024-02-06
What is tidy data?
What is the difference between wide and long data?
Why would you want data in long format?
Why would you want data in wide format?
Pivoting data
One observation per row
Each column is a different variable
There are many ways to organize data. A key idea behind tidy data is to standardize the way data are organized, to make working with data simpler
country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
country | year | type | count |
---|---|---|---|
Afghanistan | 1999 | cases | 745 |
Afghanistan | 1999 | population | 19987071 |
Afghanistan | 2000 | cases | 2666 |
Afghanistan | 2000 | population | 20595360 |
Brazil | 1999 | cases | 37737 |
Brazil | 1999 | population | 172006362 |
Brazil | 2000 | cases | 80488 |
Brazil | 2000 | population | 174504898 |
China | 1999 | cases | 212258 |
China | 1999 | population | 1272915272 |
China | 2000 | cases | 213766 |
China | 2000 | population | 1280428583 |
country | year | rate |
---|---|---|
Afghanistan | 1999 | 745/19987071 |
Afghanistan | 2000 | 2666/20595360 |
Brazil | 1999 | 37737/172006362 |
Brazil | 2000 | 80488/174504898 |
China | 1999 | 212258/1272915272 |
China | 2000 | 213766/1280428583 |
country | 1999 | 2000 |
---|---|---|
Afghanistan | 745 | 2666 |
Brazil | 37737 | 80488 |
China | 212258 | 213766 |
country | 1999 | 2000 |
---|---|---|
Afghanistan | 19987071 | 20595360 |
Brazil | 172006362 | 174504898 |
China | 1272915272 | 1280428583 |
The way data are collected (population from one database, cases from another)
The way data are recorded (one year at a time)
Plans for how data will be interpreted or stored
Concerns about readability (any data missing?) or compactness (storage, display)
Habits
Easy to use dplyr
: filter, group_by, summarize
Easy to use ggplot
: aesthetic mappings (x, y, color, shape) and facets
Step by step examples of the transformations shown above
country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
df1 |>
pivot_longer(cases:population,
names_to = "type",
values_to = "count") |>
kable() |> kable_styling(full_width = FALSE)
country | year | type | count |
---|---|---|---|
Afghanistan | 1999 | cases | 745 |
Afghanistan | 1999 | population | 19987071 |
Afghanistan | 2000 | cases | 2666 |
Afghanistan | 2000 | population | 20595360 |
Brazil | 1999 | cases | 37737 |
Brazil | 1999 | population | 172006362 |
Brazil | 2000 | cases | 80488 |
Brazil | 2000 | population | 174504898 |
China | 1999 | cases | 212258 |
China | 1999 | population | 1272915272 |
China | 2000 | cases | 213766 |
China | 2000 | population | 1280428583 |
df1 |> mutate(rate_fraction = cases/population) |>
unite("rate", cases:population, sep="/") |>
kable() |> kable_styling(full_width = FALSE)
country | year | rate | rate_fraction |
---|---|---|---|
Afghanistan | 1999 | 745/19987071 | 0.0000373 |
Afghanistan | 2000 | 2666/20595360 | 0.0001294 |
Brazil | 1999 | 37737/172006362 | 0.0002194 |
Brazil | 2000 | 80488/174504898 | 0.0004612 |
China | 1999 | 212258/1272915272 | 0.0001667 |
China | 2000 | 213766/1280428583 | 0.0001669 |
df1 |> unite("rate", cases:population, sep="/") |>
separate(rate, into = c("cases", "population"), sep="/") |>
kable() |> kable_styling(full_width = FALSE)
country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
bind_rows(mutate(df1A, type = "cases"),
mutate(df1B, type = "population")) |>
kable() |> kable_styling(full_width = FALSE)
country | 1999 | 2000 | type |
---|---|---|---|
Afghanistan | 745 | 2666 | cases |
Brazil | 37737 | 80488 | cases |
China | 212258 | 213766 | cases |
Afghanistan | 19987071 | 20595360 | population |
Brazil | 172006362 | 174504898 | population |
China | 1272915272 | 1280428583 | population |
First make a long table.
Then join them together.
full_join(
df1A |> pivot_longer(`1999`:`2000`, names_to = "year", values_to = "cases"),
df1B |> pivot_longer(`1999`:`2000`, names_to = "year", values_to = "population"),
by = c("country", "year")
) |> kable() |> kable_styling(full_width = FALSE)
country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
What and why of tidy data
Longer vs wider format
Common reshaping tasks
Practice these reshaping skills in the assigned Task.