There are many ways to organize data in an array. The key principles we’re following are
each row is a different observation, and
each column is a different variable.
This organization tends to make for very long tables. For example, the gapminder data has 1704 rows: 12 years of data for 142 countries. We might want to show median life expectancy for each continent (5 rows) in three different years. This would make a nice compact table, but it seems like the data is not in the right shape for this.
12.1 Pivot wider
We’ll start by selecting three years (1987, 1997, 2007) and computing the median life expectancy for each continent in each year. (The median for the continent may not be representative of all countries in the continent, but we’ll worry about that some other time.)
# A tibble: 15 × 3
continent year median_life_expectancy
<fct> <int> <dbl>
1 Africa 1987 51.6
2 Africa 1997 52.8
3 Africa 2007 52.9
4 Americas 1987 69.5
5 Americas 1997 72.1
6 Americas 2007 72.9
7 Asia 1987 66.3
8 Asia 1997 70.3
9 Asia 2007 72.4
10 Europe 1987 74.8
11 Europe 1997 76.1
12 Europe 2007 78.6
13 Oceania 1987 75.3
14 Oceania 1997 78.2
15 Oceania 2007 80.7
Now we can see the data and it’s not too large, the reshaping problem is clearer. I’d like to create a table with one row for each continent, one column for each year and the numbers in the grid should be the median life expectancy. This reshaping is called a pivot, specifically a pivot to make a wider table, and we simply need to specify the current column to use as the new names (year) and the current column to use as the values for the new grid (median_life_expectancy).
Sometimes data a provided in a “wide” format, like the summary table above. This is often very convient for data entry and visual inspection. Suppose you wanted to use the years in the table shown above as a aesthetic in a plot, or a grouping variable in a summarise operation. You can’t! So you might want to pivot the table to make it longer.
pivot_longer undoes the pivot_wider operation. You need to give the set of variables to be stacked and a name for the new variable to be filled with those column headings. Here’s how to perform the inverse of the pivot_wider shown above. You can describe a sequence of variables by giving the first one, a colon (:), and the last one. Since our variable names are just numbers (which are not standard column names in data tables), we need to put backticks around them. (The plain expression 1987:2007 would be interpreted quite differently by R. Try it in the console. )
# A tibble: 15 × 3
continent year median_life_expectancy
<fct> <chr> <dbl>
1 Africa 1987 51.6
2 Africa 1997 52.8
3 Africa 2007 52.9
4 Americas 1987 69.5
5 Americas 1997 72.1
6 Americas 2007 72.9
7 Asia 1987 66.3
8 Asia 1997 70.3
9 Asia 2007 72.4
10 Europe 1987 74.8
11 Europe 1997 76.1
12 Europe 2007 78.6
13 Oceania 1987 75.3
14 Oceania 1997 78.2
15 Oceania 2007 80.7
There is an important difference between the original table t1 and this recovered table t3: the year variable in t1 was an integer (numeric) but the year variable in t3 is character (text). This will matter if you want to calculate with the new year variable or put it on a quantitative scale. The hablar package makes it really easy to convert variables from one type to another:
t3|>convert(int(year))
# A tibble: 15 × 3
continent year median_life_expectancy
<fct> <int> <dbl>
1 Africa 1987 51.6
2 Africa 1997 52.8
3 Africa 2007 52.9
4 Americas 1987 69.5
5 Americas 1997 72.1
6 Americas 2007 72.9
7 Asia 1987 66.3
8 Asia 1997 70.3
9 Asia 2007 72.4
10 Europe 1987 74.8
11 Europe 1997 76.1
12 Europe 2007 78.6
13 Oceania 1987 75.3
14 Oceania 1997 78.2
15 Oceania 2007 80.7
The dataset who in the tidyr package is counts of tuberculosis cases by country and year. It is in wide format with many columns (new…) describing diagnosis method, sex, and age category. There are also a lot of missing data when the data are shown in this wide format.
# A tibble: 6 × 6
country iso2 iso3 year category counts
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 Afghanistan AF AFG 1980 new_sp_m014 NA
2 Afghanistan AF AFG 1980 new_sp_m1524 NA
3 Afghanistan AF AFG 1980 new_sp_m2534 NA
4 Afghanistan AF AFG 1980 new_sp_m3544 NA
5 Afghanistan AF AFG 1980 new_sp_m4554 NA
6 Afghanistan AF AFG 1980 new_sp_m5564 NA
the original data had 7240 observations and 60 columns (4 that we’ve kept and 56 that we have pivoted). As a result our new table who_long has 6 columns (4+2) and 56 x 7240 = 405,440 rows. Let’s see how many of the count data are missing.
A lot of them! In fact 81% of the data in the original matrix are NA. So let’s discard them using na.omit or filter(!is.na(counts)). You can also use values_drop_na = TRUE in the pivot_longer function call. That will make a much smaller table. If you look at the smaller table, you’ll see there are some counts equal to 0. So NA did not simply mean 0. (You should never use NA to mean 0; it should mean missing. But some people do.)
The category variable combines three pieces of information together in one label. How can we decode the category into three columns: diagnosis, sex, and age group? The separate function is made for this. The easiest way to use separate is if the variable you are separating is consistently structured with the same character between each column, such as new_rel_f_2534. That’s not the case here: some values start with new_ and some are missing the underscore after the new. Also there is no underscore between sex and age group. So we will do a bit of pre-processing using the stringr package before we use separate.
First, I’ll remove “new” or “new_” (described concisely by a “regular expression” using new_*). Then I’ll change _f to _f_ and _m to _m_. When you first start learning to do these kinds of manipulations, you should check through a lot of the cases to be sure all the transformations worked the way you expect. There are a lot os str_ functions in the stringr package to help with these kinds of manipulations.
# A tibble: 6 × 8
country iso2 iso3 year diagnosis sex age_group counts
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan AF AFG 1997 sp m 014 0
2 Afghanistan AF AFG 1997 sp m 1524 10
3 Afghanistan AF AFG 1997 sp m 2534 6
4 Afghanistan AF AFG 1997 sp m 3544 3
5 Afghanistan AF AFG 1997 sp m 4554 5
6 Afghanistan AF AFG 1997 sp m 5564 2
Now we can use the dplyr methods group_by and summarize to count cases by sex, age group, diagnosis, country, and year. Or we can use these new variables in data visualizations for facets or colours.
Sometimes you will want to do the reverse operation: combining two or more columns together. For example, if I have some biological data with the variables genus and species, I might want to combine the two, since a species is usually described by both together (Homo is our genus, and sapiens is our species name). That’s a job for unite.