The tidyverse packages provide powerful tools for transforming and summarizing data. In this lesson we will explore the methods I use most frequently, including
mutate for defining new variables based on existing variables,
filter for including or excluding observations (rows) according to the values of one or more variables,
group_by for dividing a data table in to a set of groups according to one or more variables, before summarizing the data, and
summarise for reducing a table to a single row or a single row for each value of a grouped variable.
These tools mirror the methods used by database software for querying and summarizing data, such as the widely used SQL data query language. The correspondence is close enough that the tools can be used directly on databases by translating the tidyverse code into SQL with the dbplyr package. In databases, there are often links between two or more tables and information be combined using “joins”; R has all the standard joins for combining data from two or more tables into one.
There are many other data manipulation and reduction tools to be learned, but these four functions are the foundation for a huge number of calculations to help you understand, simplify, summarize and communicate information about a dataset.
We will work with the diamonds dataset; it’s fairly large and has a combination of quantitative and categorical data which can be used to illustrate these methods. R4DS chapter 5 does many similar demonstrations with a large database of flight data. You should practice all these calculations on a data set of your choice (some ideas are here).
9.1 Creating new variables with mutate
The diamonds data has dimensions (x, y, and z) and a mass (carat) of each diamond. It might be interesting to compare a cuboid approximation of the diamond to its mass to get a quantitative measure of how different each diamond is from a box. To start, we’ll create a new variable called box defined by the product of x, y, and z. Then we’ll create the ratio box_ratio as the quotient of its mass (carat) and this volume.
# A tibble: 6 × 12
carat cut color clarity depth table price x y z box box_ratio
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 38.2 0.00602
2 0.21 Premi… E SI1 59.8 61 326 3.89 3.84 2.31 34.5 0.00609
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 38.1 0.00604
4 0.29 Premi… I VS2 62.4 58 334 4.2 4.23 2.63 46.7 0.00621
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 51.9 0.00597
6 0.24 Very … J VVS2 62.8 57 336 3.94 3.96 2.48 38.7 0.00620
This function adds two new columns to the data table, computing values for every row in the table. It doesn’t change any of the other columns. You can also modify an existing column, for example to change units from dollars to thousands of dollars or carats to grams, by using the name of an existing column on the left hand side of the equals sign.
A few things to note:
We use the name of a column to create or modify on the left hand side of the equals sign/
We use an = instead of <- when creating or modifying columns. The arrow is used to assign values to new R objects in your environment (see the Environment tab in R studio); here we are modifying the columns of a data frame so a different notation is appropriate.
The transformation shown above does not modify the diamonds object – it creates a new object, which you can then store in a new R variable in your environment. I often perform transformations simply for the purposes of making a table or a plot without ever storing the the result.
9.2 Filtering rows
Tbe diamonds data has many rows (more than 50,000). You might be interested in just a subset of the rows. We use filter to select rows to keep from a table. (We can specify rows to remove and use the logical not operator ! to reverse the logic and keep all the other rows.)
For example, there are only 5 diamonds of more than 4 carats:
You can combine filtering operations by chaining them together with commas, pipes, or using boolean logic (& for and; | for or). The following three expressions all do the same thing.
The second example a convenient abbreviation of the first, but why would we want the third way? Because we can use a logical “or” instead of an and to make different kinds of selections. For example, we might want all the diamonds that meet any of three different tests.
These logical expressions can be challenging to interpret, so use care and test them carefully to be sure the calculation does what you think it should.
9.3 Getting just some rows and columns
All of the tables above showed all 10 columns. Sometimes you only want some of the columns. Use select to extract and reorder columns:
# A tibble: 6 × 5
price cut color clarity carat
<int> <ord> <ord> <ord> <dbl>
1 326 Ideal E SI2 0.23
2 326 Premium E SI1 0.21
3 327 Good E VS1 0.23
4 334 Premium I VS2 0.29
5 335 Good J SI2 0.31
6 336 Very Good J VVS2 0.24
You’ll notice I’ve used head to get just the first 6 rows of the data table. You can also use tail to get the last few rows. And you can modify the number of rows (see the help page for head and tail.)
A more general method for extracting rows by number is slice.
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.8 Premium H SI1 61.5 58 2760 5.97 5.93 3.66
3 0.72 Premium F SI1 61.8 61 2777 5.82 5.71 3.56
4 1.12 Premium J SI2 60.6 59 2898 6.68 6.61 4.03
5 1.05 Very Good I SI2 62.3 59 3742 6.42 6.46 4.01
6 1.12 Premium J SI2 60.6 59 2898 6.68 6.61 4.03
Sometimes with a large dataset I want to look at a random subset of all the rows. Use these variants on slice_ for that task. (Try it yourself: you will get different rows each time.)
diamonds|>slice_sample(n =5)
# A tibble: 5 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1.02 Premium D VS2 62.9 60 7389 6.42 6.37 4.02
2 1.51 Good J VS1 61.5 59 8271 7.26 7.31 4.48
3 1.51 Good H SI1 61.5 61 10277 7.29 7.34 4.5
4 0.52 Ideal E SI2 61 57 1412 5.18 5.21 3.17
5 0.78 Good E SI2 64 57 2379 5.84 5.81 3.73
You can use arrange to sort the data by any number of columns before slicing or displaying the data.
Warning: There was 1 warning in `arrange()`.
ℹ In argument: `..2 = -color`.
Caused by warning in `Ops.ordered()`:
! '-' is not meaningful for ordered factors
# A tibble: 10 × 4
carat cut color price
<dbl> <ord> <ord> <int>
1 5.01 Fair J 18018
2 4.5 Fair J 18531
3 4.13 Fair H 17329
4 3.65 Fair H 11668
5 3.4 Fair D 15964
6 3.11 Fair J 9823
7 3.02 Fair I 10577
8 3.01 Fair H 10761
9 3.01 Fair I 18242
10 3.01 Fair I 18242
You’ll notice that there are only a few levels of each of the categorical variables (cut, color, clarity). Suppose you want to see all combinations that exist? Use select to pick out those rows, then distinct to show only the rows that are different from each other. (There are 276 combinations, so I’m showing only the top 5, after sorting.)
# A tibble: 5 × 3
cut clarity color
<ord> <ord> <ord>
1 Fair I1 D
2 Good I1 D
3 Very Good I1 D
4 Premium I1 D
5 Ideal I1 D
9.4 Summarizing data tables
These filtering and transformation functions are powerful and encourage a really useful way of probing a complext dataset. The summarize (or summarise) provides a way to apply a function to one or more variables across all the rows or colletions of rows. To use this effectively you want to think of functions that operate on a set of numbers and return just a single one, such as mean, median, max, min, sd, or the simplest of all n which counts the number of rows.
All the original data is gone, replaced by these four functions applied to all the data in the corresponding column. The name of the new column appears on the left side of the = and the calculation on the right. This is the same way we wrote the mutate function, but instead of creating a new variable and computing a value for each row, we get one value computed using all the rows.
Rows can be grouped to get summaries for each level of the grouped varible. Let’s group on cut but otherwise repeat this calculation:
You can group on as many variables at once as you like; the result will have one row for each combination of levels from each grouped variable. (You can even group on quantitative varibles, but you usually don’t want to!)
# A tibble: 5 × 7
# Groups: cut, clarity [1]
cut clarity color n_rows mean_price median_price min_carat
<ord> <ord> <ord> <int> <dbl> <dbl> <dbl>
1 Fair I1 D 4 7383 5538. 0.91
2 Fair I1 E 9 2095. 2036 0.7
3 Fair I1 F 35 2544. 1570 0.34
4 Fair I1 G 53 3187. 1954 0.5
5 Fair I1 H 52 4213. 3340. 0.7
Grouping doen’t do any calculations, it just makes a notation on a data frame that your grouping should be used for subsequent calculations that pay attention to grouping. After you summarise data, one grouping variable is removed (the last one) but the result is still grouped by the remaining variables. (See the notation: Groups: cut, clarity in the result above.)
summarise has an optional argument .groups to say what grouping you want to retain after the calculation is done. drop_last is the default, described above. Probably the next most frequently used choice is drop which ungroups all the rows. To see the diference, I’ll add another step, count(), which is equivalent to summarise(n=n()) and counts rows.
Using drop_last I get the number of rows for each level of cut and clarity (usually 7 for the levels of color: D, E, F, G, H, I, J) for all 40 combinations of cut and clarity (only a few shown here.)
Change the options to .groups = "drop" and I get a very different result – just the number of combinations of cut, clarity, and color (276 of the maximum 5 x 8 x 7 = 280 which are possible):
You can also use ungroup() following a summarise to discard all grouping if you don’t want to use the .groups argument.
9.5 Example
These methods can be combined together to produce elaborate queries. For example, let’s use mutate to compute a price per carat ratio, then find the median of this ratio over all diamonds grouped by cut, color, and clarity. Also report the sample size used for each median. Finally we will extract the top 2 ratios for each cut type (using a grouped slice) and arrange the resulting table from largest to smallest ratio.
diamonds|>mutate(price_per_carat =price/carat)|>group_by(color, clarity, cut)|>summarise(median_price_per_carat =median(price_per_carat), n =n(), .groups ="drop")|>arrange(-median_price_per_carat)|>group_by(cut)|>slice_head(n=2)|>arrange(-median_price_per_carat)
# A tibble: 10 × 5
# Groups: cut [5]
color clarity cut median_price_per_carat n
<ord> <ord> <ord> <dbl> <int>
1 D IF Good 14932. 9
2 D IF Premium 11057. 10
3 D IF Very Good 10202. 23
4 D IF Ideal 7162. 28
5 J VVS1 Premium 5336. 24
6 J VVS2 Very Good 5227. 29
7 H IF Good 5100. 4
8 E VVS1 Fair 4921. 3
9 G VS2 Fair 4838 45
10 H SI1 Ideal 4469. 763
When I did this calculation the first time, I forgot the first arrange. Why is it needed before the grouped slice_head? (Remove it and see how the result changes. Try removing the slice_head to see the full results.) Why is the second arrange needed?
The top 4 price per carat categories are all the best colour (D) and clarity (IF), but cut seems to be less important in determining this ratio. One of the top 10 categories has a lot of diamonds in it (763), while the other groups have very few diamonds relative to the size of the whole dataset (more than 50,000).
9.6 What does this have to do with data visualization?
These tools are very useful for exploring data; helping you understand the structure of your data. This will help you make better visualizations. In particular, the humble n function which counts rows should be used frequently when you summarize with mean or median to let you know if you have very few observations or too many for a good plot.
Summarizing data, for example by computing means and standard errors can be done as part of the visualization process using stat_summary, but it can be easier to know exactly what you are calculating if you compute the summary yourself using the methods in this lesson and then plot the results.
Tables can be a useful way to visualize data and grouped summaries are often the easiest way to make a table to describe your data.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. "O’Reilly Media". https://r4ds.had.co.nz/.