12 Reshaping data

There are many ways to organize data in an array. The key principles we’re following are

  • each row is a different observation, and
  • each column is a different variable.

This organization tends to make for very long tables. For example, the gapminder data has 1704 rows: 12 years of data for 142 countries. We might want to show median life expectancy for each continent (5 rows) in three different years. This would make a nice compact table, but it seems like the data is not in the right shape for this.

12.1 Pivot wider

We’ll start by selecting three years (1987, 1997, 2007) and computing the median life expectancy for each continent in each year. (The median for the continent may not be representative of all countries in the continent, but we’ll worry about that some other time.)

t1 <- gapminder %>% 
  filter(year %in% c(1987, 1997, 2007)) %>%
  group_by(continent, year) %>%
  summarise(median_life_expectancy = median(lifeExp), .groups = "drop")
t1
## # A tibble: 15 x 3
##    continent  year median_life_expectancy
##    <fct>     <int>                  <dbl>
##  1 Africa     1987                   51.6
##  2 Africa     1997                   52.8
##  3 Africa     2007                   52.9
##  4 Americas   1987                   69.5
##  5 Americas   1997                   72.1
##  6 Americas   2007                   72.9
##  7 Asia       1987                   66.3
##  8 Asia       1997                   70.3
##  9 Asia       2007                   72.4
## 10 Europe     1987                   74.8
## 11 Europe     1997                   76.1
## 12 Europe     2007                   78.6
## 13 Oceania    1987                   75.3
## 14 Oceania    1997                   78.2
## 15 Oceania    2007                   80.7

Now we can see the data and it’s not too large, the reshaping problem is clearer. I’d like to create a table with one row for each continent, one column for each year and the numbers in the grid should be the median life expectancy. This reshaping is called a pivot, specifically a pivot to make a wider table, and we simply need to specify the current column to use as the new names (year) and the current column to use as the values for the new grid (median_life_expectancy).

t1 %>% pivot_wider(names_from = year, values_from = median_life_expectancy)
## # A tibble: 5 x 4
##   continent `1987` `1997` `2007`
##   <fct>      <dbl>  <dbl>  <dbl>
## 1 Africa      51.6   52.8   52.9
## 2 Americas    69.5   72.1   72.9
## 3 Asia        66.3   70.3   72.4
## 4 Europe      74.8   76.1   78.6
## 5 Oceania     75.3   78.2   80.7

I’ll give you an advanced peek at table formatting skills from the next lesson to make this look a bit better:

t2 <- t1 %>% pivot_wider(names_from = year, values_from = median_life_expectancy)
t2 %>% rename(Continent = continent) %>%
  arrange(-`2007`) %>%
  kable(digits = 1) %>%
  column_spec(1, bold=TRUE) %>%
  kable_styling(full_width = FALSE)
Continent 1987 1997 2007
Oceania 75.3 78.2 80.7
Europe 74.8 76.1 78.6
Americas 69.5 72.1 72.9
Asia 66.3 70.3 72.4
Africa 51.6 52.8 52.9

12.2 Pivot longer

Sometimes data a provided in a “wide” format, like the summary table above. This is often very convient for data entry and visual inspection. Suppose you wanted to use the years in the table shown above as a aesthetic in a plot, or a grouping variable in a summarise operation. You can’t! So you might want to pivot the table to make it longer.

pivot_longer undoes the pivot_wider operation. You need to give the set of variables to be stacked and a name for the new variable to be filled with those column headings. Here’s how to perform the inverse of the pivot_wider shown above. You can describe a sequence of variables by giving the first one, a colon (:), and the last one. Since our variable names are just numbers (which are not standard column names in data tables), we need to put backticks around them. (The plain expression 1987:2007 would be interpreted quite differently by R. Try it in the console. )

t3 <- t2 %>% pivot_longer(cols = `1987`:`2007`, names_to = "year", values_to = "median_life_expectancy")
t3
## # A tibble: 15 x 3
##    continent year  median_life_expectancy
##    <fct>     <chr>                  <dbl>
##  1 Africa    1987                    51.6
##  2 Africa    1997                    52.8
##  3 Africa    2007                    52.9
##  4 Americas  1987                    69.5
##  5 Americas  1997                    72.1
##  6 Americas  2007                    72.9
##  7 Asia      1987                    66.3
##  8 Asia      1997                    70.3
##  9 Asia      2007                    72.4
## 10 Europe    1987                    74.8
## 11 Europe    1997                    76.1
## 12 Europe    2007                    78.6
## 13 Oceania   1987                    75.3
## 14 Oceania   1997                    78.2
## 15 Oceania   2007                    80.7

There is an important difference between the original table t1 and this recovered table t3: the year variable in t1 was an integer (numeric) but the year variable in t3 is character (text). This will matter if you want to calculate with the new year variable or put it on a quantitative scale. The hablar package makes it really easy to convert variables from one type to another:

t3 %>% convert(int(year))
## # A tibble: 15 x 3
##    continent  year median_life_expectancy
##    <fct>     <int>                  <dbl>
##  1 Africa     1987                   51.6
##  2 Africa     1997                   52.8
##  3 Africa     2007                   52.9
##  4 Americas   1987                   69.5
##  5 Americas   1997                   72.1
##  6 Americas   2007                   72.9
##  7 Asia       1987                   66.3
##  8 Asia       1997                   70.3
##  9 Asia       2007                   72.4
## 10 Europe     1987                   74.8
## 11 Europe     1997                   76.1
## 12 Europe     2007                   78.6
## 13 Oceania    1987                   75.3
## 14 Oceania    1997                   78.2
## 15 Oceania    2007                   80.7

The dataset who in the tidyr package is counts of tuberculosis cases by country and year. It is in wide format with many columns (new…) describing diagnosis method, sex, and age category. There are also a lot of missing data when the data are shown in this wide format.

Let’s pivot this data to make it long.

who_long <- who %>% pivot_longer(new_sp_m014:newrel_f65, names_to = "category", values_to = "counts")
head(who_long)
## # A tibble: 6 x 6
##   country     iso2  iso3   year category     counts
##   <chr>       <chr> <chr> <int> <chr>         <int>
## 1 Afghanistan AF    AFG    1980 new_sp_m014      NA
## 2 Afghanistan AF    AFG    1980 new_sp_m1524     NA
## 3 Afghanistan AF    AFG    1980 new_sp_m2534     NA
## 4 Afghanistan AF    AFG    1980 new_sp_m3544     NA
## 5 Afghanistan AF    AFG    1980 new_sp_m4554     NA
## 6 Afghanistan AF    AFG    1980 new_sp_m5564     NA

the original data had 7240 observations and 60 columns (4 that we’ve kept and 56 that we have pivoted). As a result our new table who_long has 6 columns (4+2) and 56 x 7240 = 405,440 rows. Let’s see how many of the count data are missing.

who_long %>% summarize(count_NA = sum(is.na(counts)))
## # A tibble: 1 x 1
##   count_NA
##      <int>
## 1   329394

A lot of them! In fact 81% of the data in the original matrix are NA. So let’s discard them using na.omit or filter(!is.na(counts)). You can also use values_drop_na = TRUE in the pivot_longer function call. That will make a much smaller table. If you look at the smaller table, you’ll see there are some counts equal to 0. So NA did not simply mean 0. (You should never use NA to mean 0; it should mean missing. But some people do.)

who_long <- who_long %>% filter(!is.na(counts))

12.3 Separate and Unite

The category variable combines three pieces of information together in one label. How can we decode the category into three columns: diagnosis, sex, and age group? The separate function is made for this. The easiest way to use separate is if the variable you are separating is consistently structured with the same character between each column, such as new_rel_f_2534. That’s not the case here: some values start with new_ and some are missing the underscore after the new. Also there is no underscore between sex and age group. So we will do a bit of pre-processing using the stringr package before we use separate.

First, I’ll remove “new” or “new_” (described concisely by a “regular expression” using new_*). Then I’ll change _f to _f_ and _m to _m_. When you first start learning to do these kinds of manipulations, you should check through a lot of the cases to be sure all the transformations worked the way you expect. There are a lot os str_ functions in the stringr package to help with these kinds of manipulations.

who_long %>% mutate(category = str_remove(category, "new_*"),
                    category = str_replace(category, "_f", "_f_"),
                    category = str_replace(category, "_m", "_m_")
                    ) %>% 
  head()
## # A tibble: 6 x 6
##   country     iso2  iso3   year category  counts
##   <chr>       <chr> <chr> <int> <chr>      <int>
## 1 Afghanistan AF    AFG    1997 sp_m_014       0
## 2 Afghanistan AF    AFG    1997 sp_m_1524     10
## 3 Afghanistan AF    AFG    1997 sp_m_2534      6
## 4 Afghanistan AF    AFG    1997 sp_m_3544      3
## 5 Afghanistan AF    AFG    1997 sp_m_4554      5
## 6 Afghanistan AF    AFG    1997 sp_m_5564      2

Now, we will use separate to convert category into three columns using the underscore as a separator.

who_new <- who_long %>% mutate(category = str_remove(category, "new_*"),
                    category = str_replace(category, "_f", "_f_"),
                    category = str_replace(category, "_m", "_m_")
                    ) %>% 
  separate(col = category, into = c("diagnosis", "sex", "age_group"), sep="_")
who_new %>% head()
## # A tibble: 6 x 8
##   country     iso2  iso3   year diagnosis sex   age_group counts
##   <chr>       <chr> <chr> <int> <chr>     <chr> <chr>      <int>
## 1 Afghanistan AF    AFG    1997 sp        m     014            0
## 2 Afghanistan AF    AFG    1997 sp        m     1524          10
## 3 Afghanistan AF    AFG    1997 sp        m     2534           6
## 4 Afghanistan AF    AFG    1997 sp        m     3544           3
## 5 Afghanistan AF    AFG    1997 sp        m     4554           5
## 6 Afghanistan AF    AFG    1997 sp        m     5564           2

Now we can use the dplyr methods group_by and summarize to count cases by sex, age group, diagnosis, country, and year. Or we can use these new variables in data visualizations for facets or colours.

Sometimes you will want to do the reverse operation: combining two or more columns together. For example, if I have some biological data with the variables genus and species, I might want to combine the two, since a species is usually described by both together (Homo is our genus, and sapiens is our species name). That’s a job for unite.

12.4 Suggested reading