Summarizing data with dplyr

Andrew Irwin, a.irwin@dal.ca

2024-01-30

Transforming data

With a large dataset we commonly want to

  • Count observations in categories

  • Compute means, standard deviations, and other statistics

  • Compute derived quantities (ratios, change units)

dplyr examples

We will transform and reduce observations using

  • mutate

  • filter

  • group_by

  • summarize

We will use the ChickWeight data in the datasets package.

Data

glimpse(ChickWeight)
Rows: 578
Columns: 4
$ weight <dbl> 42, 51, 59, 64, 76, 93, 106, 125, 149, 171, 199, 205, 40, 49, 5…
$ Time   <dbl> 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 21, 0, 2, 4, 6, 8, 10, 1…
$ Chick  <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ Diet   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

Use summarize to count observations

ChickWeight |> summarize(count = n()) |> kable()
count
578
# ChickWeight |> count() # a handy abbreviation

Count observations in groups

ChickWeight |>
  group_by(Diet) |> 
  summarize(count = n(), .groups="drop") |>
  kable()
Diet count
1 220
2 120
3 120
4 118
# ChickWeight |> count(Diet)

Two grouping factors

ChickWeight |>
  group_by(Diet, Chick) |> 
  summarize(count = n(), .groups = "drop") |>
  paged_table()

Compute any function you want

ChickWeight |>
  group_by(Chick, Diet) |> 
  summarize(count = n(), 
            max_time = max(Time), 
            median_weight = median(weight),
            increase_weight = max(weight) - min(weight),
            .groups="drop") |>
    paged_table()

Use mutate to create new columns for each observation

ChickWeight |> 
  mutate(growth = weight / Time) |>
  paged_table()

Use mutate to create new columns for each observation

ChickWeight |>
  mutate(growth = (weight-min(weight)) / Time) |>
  paged_table()

Exclude observations with filter

ChickWeight |> 
  filter(Time >= 10) |> 
  arrange(Time) |>
  paged_table()

Exclude observations with filter

ChickWeight |> 
  filter(Time == 10) |> 
  group_by(Diet) |> 
  summarize(weight_10 = mean(weight), .groups = "drop") |>
  kable()
Diet weight_10
1 93.05263
2 108.50000
3 117.10000
4 126.00000

Combine all three

  • Compute max weight for each chick and diet over all time.
  • Filter to retain chicks that reach a target weight
  • Count the number of observations per chick and diet treatment
ChickWeight |>
  group_by(Diet, Chick) |>
  mutate(max_weight = max(weight)) |>
  filter(max_weight > 110) |>
  group_by(Diet, Chick) |>
  summarize(n = n()) |>
  summarize(count_chicks = n(), .groups = "drop") |>
  kable()

Combine all three

  • Compute max weight for each chick and diet over all time.
  • Filter out chicks that reach a target weight
  • Count the number of observations per chick and diet treatment
Diet count_chicks
1 15
2 9
3 10
4 10

Summary

  • Use summarize to compute functions (count, mean, …) to reduce many observations to one

  • Use group_by to split, summarize, and combine data

  • Use mutate to define a new variable for all observations

  • Use filter to select observations

Bonus: dplyr is closely linked to SQL

library(dbplyr)
memdb_frame(ChickWeight) |> count(Diet) |> show_query()
<SQL>
SELECT `Diet`, COUNT(*) AS `n`
FROM `dbplyr_001`
GROUP BY `Diet`

dplyr and SQL

memdb_frame(ChickWeight) |>  group_by(Diet, Chick) |>
  mutate(max_weight = max(weight)) |>
  filter(max_weight > 110) |>
  group_by(Diet, Chick) |>
  summarize(n = n()) |>
  summarize(count_chicks = n(), .groups = "drop") |> show_query()
<SQL>
SELECT `Diet`, COUNT(*) AS `count_chicks`
FROM (
  SELECT `Diet`, `Chick`, COUNT(*) AS `n`
  FROM (
    SELECT `q01`.*
    FROM (
      SELECT
        `dbplyr_002`.*,
        MAX(`weight`) OVER (PARTITION BY `Diet`, `Chick`) AS `max_weight`
      FROM `dbplyr_002`
    ) AS `q01`
    WHERE (`max_weight` > 110.0)
  ) AS `q01`
  GROUP BY `Diet`, `Chick`
) AS `q01`
GROUP BY `Diet`

Further reading

Task

Practice these skills by doing the calculations in the associated Task.