19  Finding and accessing data

Data are available from a huge number of sites on the internet run by governments, NGOs, companies, industry associations, research projects, and individuals. These data come in a variety of formats and while many will be easy to read with R, some will be quite challenging. In this lesson I will introduce you to some sources of data. Partly this is to help you start to learn how to navigate a variety of data sources, but of course part of the reason is to get you thinking about data you might use in your project.

19.1 Finding data

19.1.1 Data in R packages

Many datasets are available as part of R packages. These are the easiest to use, but they are often small and designed for demonstration purposes. For example, the gapminder package only contains a small portion of the data available on the gapminder website.

These are the go-to datasets that we have used for demonstrating many simple visualizations:

  • mtcars and many other well-known data in datasets package
  • penguins in palmerpenguins package
  • gapminder in gapminder package (but see website too Gapminder)
  • diamonds, mpg, economics, midwest, msleep in ggplot2 package
  • nycflights13 in dbplyr package
  • gss_sm, gss_cat, gss_sib, gss_lon and gss_lon in socviz package

19.1.2 Tidy Tuesday

Tidy Tuesday is a project to encourage people to develop skills at data visualization and analysis using the tidyverse packages in R. Each week a new data set is posted and participants experiment to create new visualizations. Some people share their work on GitHub and social media.

  • The Tidy Tuesday website describes the project and has a catalog of available datasets from previous weeks.
  • The R package to access the data is called tidytuesdayR. Examples of using the package are here

19.1.3 R packages for accessing data

There are many packages designed primarily to provide access to large collections of data. Here are a few examples.

  • OECD for data from the OECD
  • cansim for data from Statistics Canada
  • cancensus for data from the Canadian census and National household survey. You need to create an account and get an API key to use this package. The package documentation shows you how.

19.1.4 Websites with data collections

Naturally there are websites that curate lists of data available from other sites. Here are a few I’ve found useful.

19.1.5 Canadian COVID data

Many countries and other organizations have developed collections of COVID-related data. Here are some sources for Canada that worked in 2022. Several of them are “retired” or missing in 2024.

19.2 Examples of accessing data

19.2.1 Gapminder

Gapminder distributes many datasets, some of their own, and some collected by other organizations that they are redistributing. They have a webpage to help you search, browse and access the data. Using this webpage I have found data on population of countries by years from 1800-2100 (with many missing data, some interpolated data, and of course many years of projections). Here is some R code to read and work with this data. There are data on 195 countries over 302 years. I have selected five countries to make a simple plot.

pop <- read_csv("static/population_total.csv")
Rows: 195 Columns: 302
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (1): country
dbl (301): 1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pop %>% filter(country %in% c("Canada", "China", "Chile", "Germany", "United States")) %>%
  pivot_longer(`1800`:`2100`, names_to = "years", values_to = "population") %>%
  mutate(years = as.numeric(years)) %>%
  ggplot(aes(x = years, 
             y = population, 
             color = fct_reorder(country, population, max, .desc=TRUE))) +
  geom_line(size=2) +
  scale_y_log10(labels = trans_format("log10", math_format(10^.x))) +
  labs(color = "Country")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

19.2.2 Consumer price index in Canada

One measure of inflation is its effect on prices. Here is a table from Statistics Canada that reports over 1 million rows of data on this measure. Statistics Canada tables are an extreme version of “long” data, which usually require a lot of filtering to get just the rows you want. You will want to make use of dpylr functions summarize and count to study the structure of the data.

# Population and projection in a huge table with 2 million rows, 300 MB of data
# cansim::get_cansim("17100057")  
cpi <- cansim::get_cansim("18-10-0004-01")
Accessing CANSIM NDM product 18-10-0004 from Statistics Canada
Parsing data
cpi %>% filter(`Products and product groups` == "All-items") %>%
  count(GEO)
# A tibble: 30 × 2
   GEO                                                    n
   <chr>                                              <int>
 1 Alberta                                              546
 2 British Columbia                                     546
 3 Calgary, Alberta                                     638
 4 Canada                                              1322
 5 Charlottetown and Summerside, Prince Edward Island   602
 6 Edmonton, Alberta                                    638
 7 Halifax, Nova Scotia                                 638
 8 Iqaluit, Nunavut                                     255
 9 Manitoba                                             546
10 Montréal, Quebec                                     638
# ℹ 20 more rows

Let’s look at these data for Canada as a whole.

cpi %>% filter(`Products and product groups` == "All-items",
               GEO == "Canada") %>%
  select(REF_DATE, VALUE) %>%
  mutate(date = lubridate::ym(REF_DATE)) %>%
  ggplot(aes(x = date, y = VALUE)) +
  geom_line() + 
  labs(x = "Date", y = "All items CPI",
       title = "Consumer price index in Canada",
       caption = "Scaled so that 2002 = 100") + 
  scale_y_log10()

19.2.3 Causes of death worldwide

Our world in data has many compliations of data across many countries. Here is a table listing causes of death over time and countries. There is a link to a csv file on that page that can be easily read with R. As is often the case, this is a large table (nearly 7000 observations of 37 variables). You will need to explore the data a bit to understand it. Here is one simple plot that can be made.

death <- read_csv("static/annual-number-of-deaths-by-cause.csv", guess_max = 10000)
Rows: 6686 Columns: 37
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Entity, Code, Number of executions (Amnesty International)
dbl (34): Year, Deaths - Road injuries - Sex: Both - Age: All Ages (Number),...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
death %>% filter(Entity %in% c("Canada", "China", "Germany", "United States"),
                 Year == 2015) %>%
  select(Entity, `Deaths - Road injuries - Sex: Both - Age: All Ages (Number)`:`Terrorism (deaths)`) %>%
  pivot_longer(`Deaths - Road injuries - Sex: Both - Age: All Ages (Number)`:`Terrorism (deaths)`,
               names_to = "Cause", values_to = "Deaths") %>%
  mutate(Cause = str_remove(Cause, "Deaths - ") %>% 
           str_remove(" - Sex: Both - Age: All Ages \\(Number\\)")) %>%
  ggplot(aes(x = Deaths, y = fct_reorder(Cause, Deaths), 
             color = Entity)) +
  geom_point() +
  scale_x_log10()
Warning: Transformation introduced infinite values in continuous x-axis

Of course, it would probably be more interesting to plot these as per capita deaths; or at least, that would enable a different sort of comparison.

19.4 Describing data

Once you have done work to find data, you will also want to do some research to learn the “5 Ws” of data. In addition to what the variables are and what each observation represents, you will want to know who collected it, what was collected, and where, when, why, and how the data were collected. It’s a good idea to write a “readme” to summarize what you learn. For your term project you will be asked to provide some information on the dataset you analyze. There are online guides that provide advice on documenting data.

19.5 Distribution of data

If you want to distribute an analysis of data or redistribute the original data, please be sure to respect the terms of use of the data. Many people encourage the use of FAIR data usage principles.

19.6 Further reading