Summarizing data with dplyr

Andrew Irwin, a.irwin@dal.ca

2024-01-30

Transforming data

With a large dataset we commonly want to

Count observations in categories
Compute means, standard deviations, and other statistics
Compute derived quantities (ratios, change units)

dplyr examples

We will transform and reduce observations using

mutate
filter
group_by
summarize

We will use the ChickWeight data in the datasets package.

Data

glimpse(ChickWeight)

Rows: 578
Columns: 4
$ weight <dbl> 42, 51, 59, 64, 76, 93, 106, 125, 149, 171, 199, 205, 40, 49, 5…
$ Time   <dbl> 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 21, 0, 2, 4, 6, 8, 10, 1…
$ Chick  <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ Diet   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

Use summarize to count observations

ChickWeight |> summarize(count = n()) |> kable()

count
578

# ChickWeight |> count() # a handy abbreviation

Count observations in groups

ChickWeight |>
  group_by(Diet) |> 
  summarize(count = n(), .groups="drop") |>
  kable()

Diet	count
1	220
2	120
3	120
4	118

# ChickWeight |> count(Diet)

Two grouping factors

ChickWeight |>
  group_by(Diet, Chick) |> 
  summarize(count = n(), .groups = "drop") |>
  paged_table()

Compute any function you want

ChickWeight |>
  group_by(Chick, Diet) |> 
  summarize(count = n(), 
            max_time = max(Time), 
            median_weight = median(weight),
            increase_weight = max(weight) - min(weight),
            .groups="drop") |>
    paged_table()

Use mutate to create new columns for each observation

ChickWeight |> 
  mutate(growth = weight / Time) |>
  paged_table()

Use mutate to create new columns for each observation

ChickWeight |>
  mutate(growth = (weight-min(weight)) / Time) |>
  paged_table()

Exclude observations with filter

ChickWeight |> 
  filter(Time >= 10) |> 
  arrange(Time) |>
  paged_table()

Exclude observations with filter

ChickWeight |> 
  filter(Time == 10) |> 
  group_by(Diet) |> 
  summarize(weight_10 = mean(weight), .groups = "drop") |>
  kable()

Diet	weight_10
1	93.05263
2	108.50000
3	117.10000
4	126.00000

Combine all three

Compute max weight for each chick and diet over all time.
Filter to retain chicks that reach a target weight
Count the number of observations per chick and diet treatment

ChickWeight |>
  group_by(Diet, Chick) |>
  mutate(max_weight = max(weight)) |>
  filter(max_weight > 110) |>
  group_by(Diet, Chick) |>
  summarize(n = n()) |>
  summarize(count_chicks = n(), .groups = "drop") |>
  kable()

Combine all three

Compute max weight for each chick and diet over all time.
Filter out chicks that reach a target weight
Count the number of observations per chick and diet treatment

Diet	count_chicks
1	15
2	9
3	10
4	10

Summary

Use summarize to compute functions (count, mean, …) to reduce many observations to one
Use group_by to split, summarize, and combine data
Use mutate to define a new variable for all observations
Use filter to select observations

Bonus: dplyr is closely linked to SQL

library(dbplyr)
memdb_frame(ChickWeight) |> count(Diet) |> show_query()

<SQL>
SELECT `Diet`, COUNT(*) AS `n`
FROM `dbplyr_001`
GROUP BY `Diet`

dplyr and SQL

memdb_frame(ChickWeight) |>  group_by(Diet, Chick) |>
  mutate(max_weight = max(weight)) |>
  filter(max_weight > 110) |>
  group_by(Diet, Chick) |>
  summarize(n = n()) |>
  summarize(count_chicks = n(), .groups = "drop") |> show_query()

<SQL>
SELECT `Diet`, COUNT(*) AS `count_chicks`
FROM (
  SELECT `Diet`, `Chick`, COUNT(*) AS `n`
  FROM (
    SELECT `q01`.*
    FROM (
      SELECT
        `dbplyr_002`.*,
        MAX(`weight`) OVER (PARTITION BY `Diet`, `Chick`) AS `max_weight`
      FROM `dbplyr_002`
    ) AS `q01`
    WHERE (`max_weight` > 110.0)
  ) AS `q01`
  GROUP BY `Diet`, `Chick`
) AS `q01`
GROUP BY `Diet`

Task

Practice these skills by doing the calculations in the associated Task.

Summarizing data with dplyr

Transforming data

dplyr examples

Data

Use summarize to count observations

Count observations in groups

Two grouping factors

Compute any function you want

Use mutate to create new columns for each observation

Use mutate to create new columns for each observation

Exclude observations with filter

Exclude observations with filter

Combine all three

Combine all three

Summary

Bonus: dplyr is closely linked to SQL

dplyr and SQL

Further reading

Task