Exercises

I have created exercises to give you practice with the main skills and ideas from each lesson. Exercises are new to the course this year, so you should anticipate revisions to this page throughout the term.

After the first week, most exercises have work for you to do in R. Save your completed work as a quarto document using R as it will be a reference for you later in the course when you want to review skills you learned earlier in the course.

Most of the questions on the midterm test and final exam will be based on an exercise on this page.

Exercises are not to be submitted for evaluation and are not graded.

Invitation (Lesson 1)

Describe an example from the lesson: what data are shown on the graph, how are the data presented, what is the most straightforward conclusion to be drawn from the presentation, are there any elements that are difficult to understand?
Find a visualization you like on the internet (provide a copy and the source). Repeat question 1, writing so that a classmate would understand what you see in the visualization.
How does the visualization of atmospheric CO2 concentration on different time axes (a million years, a thousand years, a century, a decade, a year, a month) change what you see on the graphs? Write a sentence about the message of this graph shown on three different time axes.
Describe some data you are interested in and a question you think it could answer. How could you get those data? Describe a visualization you would create. What would a reader easily understand from your visualization?
Reinterpret one of the graphs from the lesson. How could you present it differently? For example, describe what would you learn from a histogram of the historical weather for the current month. What if you plotted the historical CO2 data the way the sea ice volume data are plotted?
In Figure 1.7, projected career earnings, can you conclude that a chemical engineer will earn more than someone working in education? Is it likely? When you consider the probable range of lifetime earnings (middle 50%), how many career areas summarized in the graph have a reasonable chance of similar, mid-level earnings? Explain.

Computer tools (Lesson 2)

Read Healy, Sections 2.1-2.4, which contains excellent advice on the reasons for using quarto (R markdown), R/Rstudio, projects, the basics of R, and being patient as you learn computing tasks.

Make a list of about 10 key takeaway messages from the reading. For example, I would start with

Do all my learning and work in R in Quarto (.qmd, R markdown) documents created with Rstudio. This is a new skill to learn, but will be worthwhile. Don’t use a word processor. Don’t cut and paste code and results from the R console into a document. It’s too easy to make mistakes this way and will take longer.
Take notes about what I learn (for ggplot and other skills). I’ll write these in a private qmd document called data-viz-notes.Rmd (or something like that)
Start a new .qmd document for each new skill I’m trying to learn or each new set of exercises I work on. Give the document an informative name and put in a “Stat 2430” folder.

Describe in one or two sentences the capacity or purpose of each of the following computer software: R, Rstudio, ggplot, tidyverse, R markdown (or quarto), git, and github.
Most software gets replaced as ideas and technologies develop. What is a function of each of the software tools used in this course that you would want to have in a technology that replaces it?
Some people prefer python to R for data analysis. What are some generic reasons you might prefer one tool to another? (familiarity, performance, capacity to do a particular task, use by colleagues/associates)

Setting up (Lesson 3)

Follow the instructions in the syllabus or in Lesson 3 notes to set up your computing environment.

Provide your name, email address, and GitHub ID using the form linked here. Your GitHub ID will be your public name for work submitted in the course and shared with your teammate for the term project, so use professional judgement when creating your GitHub ID.

Install R, Rstudio, and the packages listed in Lesson 3.
Install git. What happens when you type “git” in Rstudio’s Terminal window?
Create a github account.
Login to posit.cloud. This is a complete R, Rstudio, and git setup “in the cloud” that can be used if you have trouble using R on your own computer. I suggest you login using your git account to link Rstudio to git.
Ask for help with any of these exercises if you need it.
Have you installed the packages listed in the lesson? What happens when you type “library(tidyverse)” in the Console window? There should be a bunch of messages printed on the screen. Some contain warnings, which are informational messages and often don’t mean anything is wrong. Learn to notice the difference between a warning and an error. You can make an error by typing something at the console that can’t be interpreted, like 2 +* 2.

Look at data (Lesson 4)

Read Healy, Chapter 1 which contains much more on the topic of this lesson.

Some people feel very strongly about the placement of 0 on the vertical scale of plots. Look again at the carbon dioxide plots in Lesson 1. The vertical scale does not start at 0. Use the ideas in Healy, Chapter 1.6 to describe how you would interpret vertical position on the carbon dioxide plots and how you could interpret this position if 0 was included on the vertical scale.
Hans Rosling’s visualizations (as shown in Lesson 1) use many channels for conveying data: x and y position, color, size, an annotation for year in the plot background. The interactive versions use an animation for change over time, and mouse-over pop-ups to identify the country for each dot. These are very complex visualizations!
1. For Rosling’s plot shown in Lesson 1, what variables are shown for each of x and y position, color, and symbol size?
2. According to Healy, Chapter 1 which of these 4 features is most difficult to make quantitative comparisons with? Why? Do you agree? Could you defend the use of each feature for each variable? How?
3. In your judgment, is this visualization effective or too complex? Watch the TED talk or experiment with the interactive version before answering the question. Does the live explanation provided in Rosling’s oral presentation help you interpret the plot? Does this make you think that visualizations destined for different media (e.g., an article or blog, a live presentation, a recorded video, a zoom presentation) should be different?
Color or shading is sometimes used to indicate quantitative data, for example on a map. Does the checker shadow illusion suggest a way such a representation could be hard to interpret quantitatively?

Making your first plot (Lesson 5)

Repeat the examples from Lesson 5 until you are comfortable with the basics of making a plot. We will learn much more about plotting starting the lesson after next. Refer to Healy, Section 2.6 for a different basic plot.
Explain what the functions ggplot, geom_point, and aes do.
Explain what the functions labs and theme do. (I would expect your explanation to be incomplete, since you just learned about them, but you should be able to say something about what these functions can do.)
How many aesthetics do you know about now? These are the attributes of the plot that can be mapped to a variable using the aes function.
What is the difference between a dataset (e.g., gapminder, penguins) and a variable inside the dataset? How do you use each – conceptually and in terms of how code is written – when you make a graph with ggplot?

Version control software (Lesson 6)

Get the software git working on your computer by following the steps in Lesson 6.

If you gave me your GitHub ID after lesson 3, you will receive an invitation by email to a new GitHub repository. Accept that invitation. The invitation will expire a few days after you receive it, so don’t delay accepting the invitation. Use these instructions to get the repository for homework assignments on your computer.

Grammar of Graphics (Lesson 7)

Review material on ggplot to connect the ideas from to the concept of the grammar of graphics to prepare you for starting to learn the details in the next lesson.
Browse the R graph gallery to explore the huge variety of visualiaations you can make with R. Practice thinking about how the language of aesthetic mappings, geometries, and scales can be used to describe these visualizations.
Read through the ggplot cheatsheet to see how the concepts of the grammar of graphics will be connected to computer code. Don’t worry about the details – you will practice making visualizations using these tools over many future lessons.
In the context of the grammar of graphics, what is an aesthetic?
Have you used Excel or other software to plot data before? Compare the perspective of using that software to the ideas in the grammar of graphics.
What are the 7 different layers in a visualization? (See course notes.) Some of these we have not discussed yet, but you should be able to explain how data, aesthetics, geometries, and theme each describe a separate element of a visualization and can be combined together to make a visualization.

Using the grammar of graphics (Lesson 8)

Practice using ggplot to make graphs by reproducing examples from notes and modifying them by changing variables used in aesthetic mappings. Practice using other data sources described in the slides for this lesson or the “Finding and Accessing Data” chapter in course notes.
What aesthetic mapping is required to make a histogram? Give an example. Should you use a quantitative or categorical variable?
What aesthetic mappings are required to make a boxplot? You don’t need to specify a cateogorical variable for color or fill, but what happens if you do?
What aesthetic mappings are required to make a scatterplot? Should you use quantitative or categorical variables?
How can you change the text in the following positions of a figure: x and y axis labels, title, subtitle, label on legend? Write an example of each.
A default setting with ggplot is to make the background grey with white grid lines. How can you change this to white, with black grid lines? What is the effect of using theme_classic()?

Summarizing data (Lesson 9)

Work through the exercises to practice mutate, filter, group_by, and summarize provided in the file task-L09-summarize. Right click the link and select “save as” to get the file on your computer. Then open it with Rstudio. Follow the instructions in the file. When you are done, knit the file to be sure there are no errors. Save your answers for future reference.

Describe what the functions mutate, filter, group_by, and summarize do to a data frame. Write a working example using each function and explain what it does.
Does a line of code like gapminder |> filter(country == "Canada") change the original dataset? Use that code and then follow it up with gapminder |> filter(country == "China"). If the dataset changed, what would you expect from this second line of code?
Write a line of code that creates a new quantitative variable in a data frame from an arithmetic calculation using an existing variable. Explain in English what your line of code does.
What do the symbols &, |, and ! mean inside the filter function?
Is there a difference in the output of diamonds |> filter(color == "J", cut == "Premium") and diamonds |> filter(color == "J") |> filter(cut == "Premium")? Explain what the code does and what the difference is, if any.
What do the functions head, tail, select, and arrange do?
What do slice, slice_sample, and slice_head do?
List as many summary statistic functions you can such as mean that give one number when applied to a set of numbers.

Facetted plots (Lesson 10)

What is the difference between facet_wrap and facet_grid? Write an example using each.

For the following questions, use the subset of the diamonds dataset defined as diamonds_subset <- diamonds |> filter(clarity == "IF", carat > 0.5), or other data of your choosing.

Make a scatter plot of price as a function of carat from the diamonds_subset data. Create a facet for each level of the cut categorical variable. Describe what you learn by looking at the different facets of the plot.
When I drew the previous plot, I noticed a different pattern for diamonds smaller than 1 carat compared to diamonds 1 carat or larger. Create a new categorical variable to separate diamonds into these two groups. Use this categorical variable to create a two facet scatter plot and colour the points on your plot according to the cut variable. Show any variables you like on the plot.

diamonds_subset_1ct <- diamonds_subset |> 
  mutate(one_carat = cut(carat, breaks = c(0, 1, Inf), 
                         labels = c("Less than 1 ct", "1 ct or more")))

Make a scatter plot of the length and width of each diamond (the variables x and y) in the diamonds subset data. Create facets for each combination of color and cut. Describe what you learn by looking at the different facets of the plot.

Using LLM to learn (Lesson 11)

Practice getting an LLM to explain what R code does. Pick examples from the course notes. Type the word “explain” and paste in the code. Try several different pieces of code. Try the same code in different AI assistants to see if there is a answer you find easier to understand or more informative.
Try asking follow-up questions if part of an explanation is not clear (or even if it is clear to you, to see how the AI responds.)
Explain what the following code does:

diamonds |>
  mutate(volume = x * y * z,
         price_per_volume = price / volume) |>
  filter(volume > 0) |>
  group_by(cut, color) |>
  summarise(mean_price_per_volume = mean(price_per_volume),
            mean_carat = mean(carat),
            n = n(),
            .groups = "drop") |>
  filter(n >= 50) |>
  arrange(-mean_price_per_volume) |>
  group_by(color) |>
  slice_head(n = 3) |>
  arrange(color, -mean_price_per_volume)

Paste the code into an AI assistant and ask for an explanation. Check your line by line understanding and your overall summary of the result of the calculation.

Write a sentence that describes a calculation, for example: Write tidyverse code to compute a table based on the diamonds dataset showing which cut-color combinations command the highest prices relative to their physical size, filtered to statistically significant groups, with up to 3 cuts shown per color grade. and ask an LLM to write tidyverse code to perform the calculation. Read the code and decide if it is correct. Paste the code into R and determine if it works and if it does the calculation described.
Write lots of English-language queries, such as Find the smallest penguin of each species in the palmerpenguins dataset. Write R code to accomplish the task. Ask the LLM to write code. Compare your code and its code. Test both versions in R. Do they give the same answer? If not, ask the LLM to explain what your code does. Many computations can be achieved in multiple ways, so don’t be surprised if the LLM produces code different from yours.

Reading data (Lesson 12)

Download the Excel file from the lesson, practice reading it into R, editing it in Excel (or another spreadsheet), and confirming that you can read the changes with R. What happens if you leave a cell blank? What happens if you put a number in a column that is text in the other rows? What about putting text in a column where the other entries are numbers?
Get the file “test-data.csv” and read the data into R using read_csv.
Find an Excel or csv data file in your own records or on the Internet. Read the data into R and confirm you have read the correct number of variables, the data were interpreted correctly, and you have the correct number of rows of data. If the data are not “tidy” (a grid with variable names in the first row), how does R store and present the data? What if your Excel workbook has multiple sheets in it?
Variable names with spaces or “special” symbols need to be enclosed in back-ticks before they can be used in R. Did your imported data have any variable names that needed to be enclosed in back-ticks? Change a column name in test-data.xlsx to have a space, or a minus sign, or other symbols and re-read the file. How does R present the variable name?
Use the function janitor::clean_names() on your imported dataset. Did this function change the names of the variables? Say how, briefly.

Reshaping data (Lesson 13)

Reshape the table penguins |> count(species, island) into a wide format table with species in rows and islands in columns.
Read the file “question-2.csv” into R. Reshape the data from wide to long format using pivot_longer to obtain three columns: Student, Task_number, Grade. Make a faceted plot of grade on the y-axis, task on the x-axis with each facet corresponding to a different student.
Make a wide table from the gapminder data. Use filter to include only countries from Oceania and only the years 1997, 2002, and 2007. Then use pivot_wider to make a wide table with countries in rows and population in the entries under the columns named with the year.
Use separate on the results of question 2 to create two new columns: Task and Number, where Task is the word “Task” and Number is the task number as a numeric variable.
Use str_remove to clean up the task number column to be just the number. Convert the number from text format to a numeric format.
Data from the slides came from the tidyr package, specifically the who dataset (which is not tidy!) and the population dataset. Get the data from here and repeat the examples from the slides, with your own variations.

Formatting tables (Lesson 14)

Display a table of population for countries in Oceania in three years (from gapminder). Round the population to the nearest 1000 (3 places before the decimal place) and add a separator between every three digits to make the numbers easier to read (a comma or space). Add a header row over the three years that says “Year”. Capitalize the heading “country” as “Country”. Add a brief but informative caption. Work though these refinements one at a time, making sure your code works after each improvement.
Compute a total score for tasks 1-4 for each student using the grades from the file question-2.csv and the adorn_totals function from janitor. Assume the maximum scores are all the same for each task, so you don’t need to weight the scores when you add them. Make a kable formatted table with an informative caption.

Getting help (Lesson 15)

The material in this lesson should be helpful if you run into challenges while working on Assignment 2, which asks you to develop new skills with unfamiliar functions.

Working with models (Lesson 16)

Practice adding linear models and smooths to visualizations. Reproduce some of the examples from the course notes, mini-lecture, or course textbook. Create new visualizations of your own design by changing the model, data, or underlying visualization. Experiment with colors and facets.

Take any plot with two quantitative variables and add a smoothing line using geom_smooth. Experiment with different options: linear model (method="lm"), generalized additive model (gam), locally estimated scatterplot smoothing (loess). Try variations on model = y ~ x as appropriate.
Develop an example using geom_quantile.
Healy has an example geom_smooth(method = MASS::rlm). What does that do? Compare to method = "lm". Look at some examples you create and read the help page for MASS::rlm. For example, try anscombe |> ggplot(aes(x3, y3)) + geom_point() + geom_smooth(method = MASS::rlm).
Develop some rules for yourself to decide if lm (linear, polynomial), gam or loess is more appropriate to highlight patterns in different scatter plots. What questions could you answer with each?

Linear models (Lesson 17)

Exercises on linear models. File: task-L16-linear-models.

Work through Healy section 6.4 to generate and draw “confidence ribbons”.
What is the difference between prediction and confidence intervals?
Experiment with the geom_pointrange function to draw a point and error bar. Use the tidy output of a linear model to obtain the estimate and confidence interval for the slope of a linear model. Follow the example in Healy Chapter 6.5, then create your own example. (See Fig. 6.6 in this section.)
Compare a standard linear model, using lm with quantile regression using rq from the quantreg package. Use the diamonds dataset and make a scatter plot of price vs carat. Add a linear model and a quantile regression line for the 25th, 50th, and 75th percentiles. Describe what you learn from this plot about how price changes with carat size.
Advanced example. Use group_by and nest from Healy Chapter 6.6 to make a set of regression lines. First repeat the example from the book. Then try to adapt it to generate coefficients for a linear model of price vs carat for each level of clarity in the diamonds data set.
The economics dataset in the ggplot2 package reports several economic indicators for the USA monthly from 1967-07-01 to 2015-04-01. The data include pop (total population, in thousands), unemploy (the number of unemployed persons, in thousands), uempmed (median duration of unemployment in weeks), and psavert (the personal savings rate). For more information, see the help page.

Plot the personal savings rate as a function of time. Add a regression line (straight line, y ~ x).

I have computed the time since the start of the time series and called it years. Use lm to create a straight line regression line. Display a table showing the intercept and slope of the regression line, with a confidence interval on these parameters.

my_econ <- economics %>% mutate(dd = decimal_date(date),
                                years = dd - min(dd))

How well does a straight line represent this data? Describe any concerns you have about this representation of the data.

Sometimes economists report a “seasonally adjusted” rate of unemployment, in recognition that there are seasonal patterns of employment in some sectors of the economy that cause regular, predictable fluctuations in employment and unemployment. Here I compute the day of the year for each observation (discarding the actual year) and the unemployment rate. Plot the unemployment rate as a function of the day of the year. Add quantile regression lines for the 25, 50, and 75th quantiles.

my_econ2 <- my_econ |> mutate(yday = yday(date),
                              urate = unemploy/pop)

826dda54c57099bec4f50c51958bf324d573953c

GAM and LOESS (Lesson 18)

The economics dataset in the ggplot2 package reports several economic indicators for the USA monthly from 1967-07-01 to 2015-04-01. The data include pop (total population, in thousands), unemploy (the number of unemployed persons, in thousands), uempmed (median duration of unemployment in weeks), and psavert (the personal savings rate). Plot the median duration of unemployment as a function of the date. Add a GAM or LOESS smooth to the data which you think captures the main trend in the data. Explain your preferred choice. (Why is GAM better than LOESS, or the reverse?) What is the main trend highlighted by your smooth?
In the code block below I calculate the increase in population each year, relative to lowest population in the year. The first and last year in the dataset are outliers because there is less than a full year of data for those years.

my_econ3 <- economics |> mutate(year = year(date),
                                 month = month(date)) |>
              group_by(year) |>
              mutate(pop_incr = (pop - min(pop))/min(pop)) |>
              filter(pop_incr == max(pop_incr))

Plot the population increase (pop_incr) as a function of the year. Use filter to exclude the first and last years. Add a smooth curve to the graph. Summarize the trend in the annual rate of population increase in the USA in a sentence or two.

Collaborating (Lesson 19)

Practice skills associated with collaborating on GitHub using the project repository for your team.

Working with your teammates, make some deliberately independent or conflicting edits. For example, add your names in new lines near the top of the proposal document. Then make different changes to the same line of a file. Stage, commit, and push to GitHub. The first person will have no trouble, but the second should get a merge error. Resolve teh error. Repeat the whole process, but switch the order you push changes to GitHub. Clean up your file by removing all these experimental edits.

Finding data (Lesson 20)

Look for some data on the internet. Download the data to your computer. Read the data into R. Make a summary table describing some part of the data. Make a visualization using some of the data. Can you find any formatting errors in the data? Did you have any trouble reading the data into R?

You can use any data you like for this exercise. If you want a specific suggestion, get some data from gapminder.org or another source in the lesson.

Think of a question you want to answer. Write a prompt for an LLM to request data to answer your question and for code to read the data in R. Test the code to see if it works! Try to find documentation on the internet, not simply what is provided by the LLM, to confirm the source of the data and associated metadata.

Reproducible Reports (Lesson 21)

We’ve been using Quarto (or R markdown) to make reproducible reports throughout the course. In this exercise, practice using here and chunk options. See the course notes for suggestions.

PCA (Lesson 22) and MDS (Lesson 23)

The data frame iris is similar to the penguins data we have been using, except it is about iris plants.

Here is code for a pairs plot of four quantitative variables from this dataset. The diagonal shows the density of each variable. Below the diagonal is a scatter plot of pairs of variables. Above the diagonal is the Pearson correlation between pairs of variables. Colour is used to distinguish the three species.

library(GGally)
iris |> ggpairs(aes(color=Species), columns=1:4, progress=FALSE)

Create a pairs plot for the Palmer penguins data.

Use prcomp to create a PCA of the four quantitative variables in iris by modifying examples from lecture slides or notes.
Use autoplot to create a “biplot” showing the data projected onto the first two principal components. Colour the points according to the species. (Note you have to spell colour with a “u” in this command, and you need to put the variable name ‘Species’ in quotation marks; this is different from the way most functions we’ve been using work.) Show the “loading” vectors with their labels.
Use augment and ggplot to make your own customized biplot. You don’t need to include the loading arrows, but you can if you like.
Use the function dist to compute the distance between each plant (row) in the iris dataset.

dist1 <- dist(iris |> select(-Species) |> scale() )

Then use metaMDS from the vegan package to perform multidimensional scaling on the iris data using this distance matrix.

Use ordiplot to show the points as separated into the two dimensional ordination space.
If the result of your MDS is stored in a variable called mds1 you can get the MDS coordinates of each plant using mds1$points (change mds1 to the name of the object you stored the MDS analysis in.) Below I also use the as_tibble function to convert the matrix to a tibble in preparation for our next step.

mds1$points |> as_tibble()

Use bind_cols to combine these points with the original iris dataset. Make a scatter plot using the MDS coordinates and colour the points using the Species variable.

bind_cols(iris, mds1$points |> as_tibble())

Make a scatterplot using any two of the original quantitative variables from iris. Do the species form three separate clusters in your plot using the original data? Are the observations separated in the plane according to the species name in the MDS plot?

K-means (Lesson 24)

Continue working with the iris data. Use k-means clustering on scaled versions of the four quantitative variables to create 3 clusters.
Use tidy to show the centres of the three clusters.
Make a scatterplot of each observation using any two of the original quantitative variables. Colour the points according to the cluster. Use augment to combine the clustering results with the original data.
Use augment to combine the cluster and original data again. Count the number of combinations of each pair of species name and cluster number using group_by and summarize (or just count).
The four quantitative variables are all measured in cm and are all similar in magnitude. Perhaps the absolute values of the lengths should be retained. Repeat the cluster analysis (showing the plot and the table of species and cluster numbers) with data that are not scaled.

Write a sentence or two that summarizes how well k-means clusters divide these three species into three clusters using these four variables. How do the results from scaled and un-scaled data differ? Is one approach clearly better?

Presentations (Lesson 25)

Create and modify a Quarto presentation as described in the lesson. Ensure you know how to

show graphics output, but not the code that generated it,
make the graphic the right size to fit on the slide,
open the HTML version in your web browser.

File: task-L24-presentation code and the formatted presentation.

Checking your work (Lesson 26)

Dynamic graphics (Lesson 27)

Making Maps (Lesson 28)

Practice making a map from the lesson

More Maps (Lesson 29)

Make an outline (vector) map of any country or region except for Canada or the USA. Fill the country with any colour you like except for the default grey. Experiment with map projections, showing or hiding the grid lines, and showing or hiding axis labels.
Pick a location on the Earth that you find interesting (e.g., a place you visited, where you lived for a long time, or a place you are interested in but have never been to). Find latitude and longitude coordinates for this location. One way to do this is to use google maps to find the location, then look in the address bar for the latitude and longitude. Make a leaflet (interactive, sliding) map centred on this area. Choose a level of zoom to show the detail you think is appropriate. Use the function setView to set the coordinates at the middle of the map and the zoom level. Add a point on the map at the place of interest.

Map alternatives (Lesson 30)

Make a statebins map using some data from the USA. Why do you think this sort of map is not common for Canada? How would you display provice- and territory-level data for Canada?
Get some UN data and display it on a map of the world with a square for each country. See Assignment 5 to get started.

Factors and Dates (Lesson 31)

Make a boxplot of the diamonds data. Show the distribution of the quantitative data carat for each of the levels of the cut categorical variable. Order the cut categorical variable using fct_reorder according to the median of the price.
Make a boxplot using the manufacturer variable from mpg. Use fct_lump_min or fct_lump_prop to combine the less frequently recorded manufacturers into a single other category.

Colour (Lesson 32)

Make a custom color scale using a web interactive tool and then use those colours on a plot. Select a palette from a list given in the lesson and use it in a visualization.

Themes (Lesson 33)

See task-themes.