8 Using the grammar of graphics
8.1 Goals
In this lesson I will demonstrate how to use R and ggplot2 to make visualizations using the ideas from the previous lesson. The emphasis is on the mechanics of making the visualizations. In time we will integrate the ideas about what features of visualizations work best to convey an idea that were introduced at the start of the course.
Your task for this lesson is to practice these skills by generating a series of graphs using different geometries and aesthetic mappings.
8.2 Introduction
By the end of this lesson you should understand how to make many different plots using ggplot. The mental model developed in the previous lesson will connect directly to the R commands in this lesson.
Incidentally, Hadley Wickham, who originally developed ggplot2
is from New Zealand and one consequence is that he allows for “British” and “American” spellings of some words. So you can use color
or colour
. In a future lesson when we summarize data you’ll see we can write summarize
or summarise
. If I switch back and forth, don’t get confused. Both are OK.
8.3 Data
We will use the diamonds
dataset for examples in this lesson. As always, you should use str
or View
to take a look at the data to famiiarize yourself with the variables and the number of rows in the data before you begin to make a plot.
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
This is a large dataset, with over 50,000 rows. There are 7 quantiative variables and three categorical variables. Read the help page on the dataset to learn more.
8.4 Aesthetic mappings and geometries
We usually pick the aesthetic mappings once we’ve thought about what geometry we want to use. The goal of this lesson will be to demonstrate some of the basics: histogram, box plot, and scatter plot. For a survey of other common geometries, consult [Wilke, chapter 5]](https://clauswilke.com/dataviz/directory-of-visualizations.html). Even these three kinds give us lots of room to show of the power of the grammar of graphics.
8.5 Histogram
Let’s draw a histogram of the price of diamonds in the dataset. We map price to the x axis and request the histogram geometry.
diamonds %>% ggplot(aes(x=price)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Maybe one of the categorical features will help us see features in the data. Let’s break the bars down by cut using colour.
diamonds %>% ggplot(aes(x=price, fill=cut)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
For skewed distribution of positive numbers, a log transform can sometimes help reveal patterns. Let’s change the scale to see if that works.
diamonds %>% ggplot(aes(x=price, fill=cut)) + geom_histogram() +
scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Stacked bar graphs like this are interesting, but they can be hard to read. Is the distribution the same for all the cuts? Or are there more Premium and Very Good cuts for the more expensive diamonds? Let’s try a few different ways to split the histogram.
We can modify the geometry by modifing the histogram geom. It’s helpful to have fewer bars in this histogram, so I’ve set the number of bars to 10 using bins=10
.
diamonds %>% ggplot(aes(x=price, fill=cut)) + scale_x_log10() + geom_histogram(bins = 10, position="dodge")
The peak for Ideal is definitely at a lower price than the peak for Premium or Very Good.
8.6 Box plots
Box plots are useful for showing distributions too. You can draw a box plot with one quantitative variable, or with a quantitative variable and a categorical variable. You can use either x or y for the quantitative variable. A plot with too many colours is hard to read, but we can interpret lots of side-by-side boxplots. So I’ll switch to clarity for the categorical variable.
diamonds %>% ggplot(aes(x = price, y = clarity)) + geom_boxplot()
diamonds %>% ggplot(aes(x = price, y = clarity)) + geom_boxplot() + scale_x_log10()
If you are willing to read a complex plot, you can fill the boxes using cut. (Try color=
instead of fill=
to compare the two ways of using colour.) This figure is probably too complicated to show someone else, but might be useful as an exploratory plot to see a lot of information in a small space. Think of it – this is a summary of over 50,000 prices across two cateogorical variables with 5 x 8 = 40 different combinations!
diamonds %>% ggplot(aes(x = price, y = clarity, fill= cut)) + geom_boxplot() + scale_x_log10()
For our third geom, we will use geom_point
to make a scatter plot. Just knowing that you can probably create the plot below by modifying the code above. In the code below, I changed geom_boxplot
to geom_point
, changed fill
to color
and changed clarity
to carat
to have a second quantative variable on the y axis.
diamonds %>% ggplot(aes(x = price, y = carat, color= cut)) + geom_point() + scale_x_log10()
That’s too many points on a scatter plot! There are a few tricks you can use, like making the points smaller and making them partly transparent – but they don’t really help with this much data.
diamonds %>% ggplot(aes(x = price, y = carat, color= cut)) +
geom_point(alpha = 0.5, size = 0.2) +
scale_x_log10()
8.7 Two dimensional histogram
What to do? Let’s create a histogram with two quantitative variables, and show the height of each bar using color.
diamonds %>% ggplot(aes(x = price, y = carat)) + geom_bin2d() + scale_x_log10()
Accurate quantitative assessment is hard to make (basically impossible) with colour brightness, but you can see the price and carat combinations for most of the diamonds. We had to give up using colour for clarity. We’ll return to this data when we talk about facets in a future lesson to see how we can add in one more categorical variable.
We can do a little better with a contour plot instead of colours. You can even add color=cut
back in if you like. Try geom_density_2d_filled
for an interesting variant.
diamonds %>% ggplot(aes(x = price, y = carat)) + geom_density_2d() + scale_x_log10()
8.7.1 Statistics
We said that in addition to connecting variables to aesthetic features, we could use statistical transformations to create new derived variables for our plots. So let’s try that!
Instead of plotting a point for each diamond in the dataset, let’s compute averages and standard errors for all the diamonds group by clarity.
diamonds %>% ggplot(aes(x = price, y = clarity)) + stat_summary(fun.data = "mean_se") + scale_x_log10()
Now adding a colour for each cut doesn’t make the plot too complicated.
diamonds %>% ggplot(aes(x = price, y = clarity, color=cut)) + stat_summary(fun.data = "mean_se") + scale_x_log10()
Most of the stat_
functions are directly linked to geom_
functions, but a few like stat_summary
or stat_unique
are handy on their own.
8.8 Scales
We’ve seen how scales can be used to transform the x axis, but there is a lot more we can do.
First, we can set the limits of the axis anywhere we want, to highlight some values or expand the range. (Maybe we have a very specific price range in mind for our data analysis.)
diamonds %>% ggplot(aes(x = price, y = clarity, color=cut)) + stat_summary(fun.data = "mean_se") + scale_x_log10() +
xlim(2000,4000)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 43582 rows containing non-finite values (stat_summary).
This is an example of using the power of ggplot and accidentally shooting your own (data) foot off. The data outside this x range were discarded before the mean and standard error were computed! We got a warning, but it was hard to understand! So this is dangerous with summary statistics. (Another reason we will learn to summarize data on our own in a future lesson.)
It’s perfectly safe with raw unsummarized data. We still get a warning, but all the dots shown are untransformed, so we don’t need to wonder if the axis limits were set before or after transforming the data.
diamonds %>% ggplot(aes(x = price, y = carat, color=cut)) + geom_point(size=0.1) + scale_x_log10() +
xlim(2000,4000) + ylim(0,1.7)
The yellow we used before didn’t won’t look good printed in a report, so let’s change the range of the colours.
diamonds %>% ggplot(aes(x=price, fill=cut)) + scale_x_log10() + geom_histogram(bins = 10, position="dodge") +
scale_fill_viridis_d(begin = 0.0, end = 0.8)
The viridis colour scale is supposed to be colour-blind friendly and to translate well when printed in gray scale on paper. It’s a range of colours selected between two extremes. Experiment with different values for begin
and end
between 0 and 1.
8.9 Annotations
The most important annotations are labels for the axes, guides for colours and shapes, and the title, subtitle, and caption. Here’s an example showing how to change each one using the labs
(for labels) function.
diamonds %>% ggplot(aes(x=price, fill=cut)) + scale_x_log10() + geom_histogram(bins = 10, position="dodge") +
scale_fill_viridis_d(begin = 0.0, end = 0.8) +
labs(x = "Price ($, log scale)",
y = "Number of diamonds",
fill = "Cut",
title = "Diamond price varies with cut quality",
subtitle = "I don't often use subtitles, but you can",
caption = "For the source of the data or other note")
Another kind of annotation adds text to a figure. It’s called an annotation instead of a geom because the annotation is a custom thing you add that doesn’t come from the data. Sometimes this is a corporate branding graphic. Or a cartoon reminding the reader what the data are about. Here I’ll add a text message.
diamonds %>% ggplot(aes(x=price, fill=cut)) + scale_x_log10() + geom_histogram(bins = 10, position="dodge") +
annotate(geom="text", x = 1300, y = 4500, label = "Compare the peaks for\nIdeal and Good.",
hjust = 0, vjust = 0.5, size = 5)
You can add annotations in the shape of points or arrows too.
A better way to annotate is to create a data frame with x and y locations and a label. Here I’ll find the average price and carat for each combination of cut and clarity, use colour for cut and add a text label for clarity. We’ll learn more about summarizing data later, so feel free to skip over the calculation and focus on the plotting for now.
## `summarise()` has grouped output by 'cut'. You can override using the `.groups` argument.
s %>% ggplot(aes(x= price, y = carat, color = cut)) + scale_x_log10() +
geom_point() +
geom_text(aes(label = clarity ))
There’s a few problems with that graph! The labels are coloured too. The color scale for “cut” looks strange. The text labels are on top of the points.
The colour of the labels comes from the inheritance of the aesthetics. It’s easy to fix. Only map clarity to a colour in the geom_point.
s %>% ggplot(aes(x= price, y = carat)) + scale_x_log10() +
geom_point(aes(color=cut)) +
geom_text(aes(label = clarity ))
A simple change makes a huge difference.
We can use geom_text_repel
from the ggrepel
package to fix the placement of the labels. I’ll shrink the size of the text a bit too.
library(ggrepel)
s %>% ggplot(aes(x= price, y = carat)) + scale_x_log10() +
geom_point(aes(color=cut)) +
geom_text_repel(aes(label = clarity ), size = 3)
There are too many labels on the plot so it’s not a good final visualization, but it demonstrates how to add labels and make a plot that might be very useful for you as you explore a dataset.
8.10 Theme
The theme allows you to set text font and size for labels and numbers on scale, line thicknesses for axes and ticks, position of the guides and many other features. You can also use predefined themes created by others. Here are a few examples that I find useful as an introduction to this topic.
My favourite gets rid of the gray background.
s %>% ggplot(aes(x= price, y = carat, color = cut)) + scale_x_log10() +
geom_point() +
theme_bw()
Second favourite makes the text larger for all elements.
s %>% ggplot(aes(x= price, y = carat, color = cut)) + scale_x_log10() +
geom_point() + theme_bw() +
theme(text = element_text(size=20))
s %>% ggplot(aes(x= price, y = carat, color = cut)) + scale_x_log10() +
geom_point() + theme_bw() +
theme(text = element_text(size=20),
legend.position = c(0.15,0.75))
8.11 Further reading
- Healy Chapter 3 on making plots
- A chapter on these ggplot concepts from a data science course
- A ggplot cheatsheet summarizing a huge amount of information in two pages
- A guide to themes from the ggplot2 book
- A whole book on ggplot2