33 Beyond tidy data
Throughout this course we have been using tidy data: every dataset is a table organized so that each column is a variable and each row is an observation. This design is very convenient for data visualizations and data analysis. The format makes it easy to associate a column with an aesthetic on a graph, a predictor or response variable in a regression, or a dimension in a clustering or multivariate analysis.
There are at least two important ways many data sets differ from this tidy format: there may be multiple tables that are related to each other, known as relational data, or the data can be more irregular than a rectangular table. These formats can be reshaped into rectangular tidy data, but at a cost of some repetition or creating missing values, both of which make the data larger to store and often harder to change or update.
In this lesson I will show you an example of each dataset, explain why the non-tidy formats are useful, and show you how to reshape the data for the tools we have studied in this course which demand tidy (rectangular) data.
Another reason databases are used is that they provide tools for working with very large datasets, including data which are too large to hold in the working memory of one computer. A useful distinction in computing is between
- long-term storage: the relatively slow, but large capacity to store data “permanently”, meaning after the computer is turned off, and
- working storage: data that can be accessed thousands of times faster than slow storage, but is relatively scarce; sometimes your computer will move data from working storage to long-term storage to complete the task you have set, slowing its work down while making the task possible.
Long-term storage is commonly called “disk” (floppy disk, hard disk, solid state drive as technologies change) and working storage is commonly called RAM (random access memory). We talk about “saving” data in “files” on disk, and “reading” data from long-term storage into working storage. When you purchase a computer you make decisions about the relative amount of each to get; commonly a consumer-grade computer has 8-16 GB of RAM and 256-1024 GB (32-128 times more) disk. When you use a cell phone or a service accessed through the browser, these distinctions are blurred and you often have no idea how much RAM you can access, you just know the capacity of the service for long-term storage. This simplification is generally a good thing.
We have used comma separated values (CSV) format data throughout the course because the format is very widely used and easy to understand. There are some drawbacks to this format which can been addressed by storing tidy table data in other formats when saving data to files on a disk. Regardless of the long-term storage format, we aim to use the same tools in R to work with the data (data tables, tidyverse functions), but there are some special details when working with very large data, even data that are too large for your computer’s working storage.
33.1 Relational data
The classic example of relational data is a business with a list of products, a list of customers, and a combination of both describing sales of products to customers. Another common example is a list of students, a list of courses, and a list of courses taken by students (and their grades or other registration data).
In both of these cases, you can see that there are data sets that are independent of each other (products and customers; students and courses) and very important relationships between the two tables of data (products sold to customers, courses taken by students).
To flesh out the examples a bit more, imagine you are describing a list of courses, a list of students, a timetable, and the registration records. You might create three data tables with the following data
- Courses
- Course ID (STAT 2430)
- Course name (Data visualization)
- Prerequisites, Course description, etc.
- Students
- Student ID
- Student name
- Program of study, registration status, etc.
- Time table
- Course ID
- Year and Term
- Room number
- Maximum enrollment
- Class meeting times
- Registration data
- Course ID
- Student ID
- Year and Term
- Grade
Notice that the variables Student ID, Course ID, and Year and Term each appear in multiple tables and will be used to connect the data from different tables together. All of the data could be in a single table, with every row having a student, course, and time; but you’d need to have a lot of duplicated information. Some tasks like preparing the timetable wouldn’t fit in to a single table well since there would be no students registered when the course was first created. Having four separate tables is a much more convenient way to represent these data.
Another example is a list of books, authors, publishers, and libraries that bought the book:
- Books
- Book ID number (e.g., ISBN)
- Book title
- Publisher
- Publication date
- Author list (using ID numbers – no practical ID system exists for book authors)
- Authors
- Author name
- Author ID
- Publisher
- Publisher name
- Publisher address, website, other contact information
- Publisher ID (The first few digits of an ISBN identify the publisher; the remaining digits identify a specific book)
- Library
- Library name
- Book ID of purchased books
- Is book in library, or checked out
- Last date the book was checked out
Incidentally, a similar database structure would work for academic research articles. In the last few years, an international standard for author ID numbers (ORCID) has been developed to facilitate this sort of database construction.
Each of these data structures is more complicated than the simple tables that we have used in this course. Generally managing these sort of data is done with a relational database. Software for relational databases provides ways to combine data from different tables together (“join” the data) and validate data to be sure for example that a book is only created using valid author and publisher IDs. R has tools for working with many different databases and their software. Fortunately databases are standardized enough that many of the common functions are available using a common syntax and this matches the dplyr
tools we have been using nearly perfectly.
33.1.1 Example with airline flight data
Here we will work with data on flights in and out of New York City, in a database. There are five separate tables: airlines, airports, planes, flights, and weather.
Take a few minutes to familiarize yourself with each of these tables. You’ll notice that they are not like normal tables (or tibbles) in R. The number of rows is unknown (reported as ??). This is because the data are not read into R; you simply have access to them to make queries (using dplyr, which is translated into SQL for you.) This is convenient because there are 16 airlines, 3322 airplanes, 1458 airports, and 336,776 flights.
count(airports)
# Source: SQL [1 x 1]
# Database: sqlite 3.44.2 [/var/folders/nr/n31pl_p95tq7w0p_6pgv_t4h0000gn/T//RtmpXAuXe4/nycflights13.sqlite]
n
<int>
1 1458
count(flights)
# Source: SQL [1 x 1]
# Database: sqlite 3.44.2 [/var/folders/nr/n31pl_p95tq7w0p_6pgv_t4h0000gn/T//RtmpXAuXe4/nycflights13.sqlite]
n
<int>
1 336776
You should notice that there are common variables between some pairs of tables:
- airplanes and flights have codes to identify each airplane (tailnum)
- airports and flights three letter codes for airports (faa, origin, dest)
- airlines and flights have two letter codes for the airline (carrier)
These allow the details of each airport, airline, and airplane to be connected to data about each flights.
We can use these tables to find all the United Airlines (“UA”) flights to Minneapolis (“MSP”) on an Embrarer 145 (EMB-145, EMB-145LR, EMB-145XR).
First, let’s get the United Airlines flights to Minneapolis (operated by their subsiduary ExpressJet Airlines Inc.):
flights_ev_msp <- flights |> filter(carrier == "EV", dest == "MSP")
We can get all the Embraer airplanes:
# Source: SQL [6 x 9]
# Database: sqlite 3.44.2 [/var/folders/nr/n31pl_p95tq7w0p_6pgv_t4h0000gn/T//RtmpXAuXe4/nycflights13.sqlite]
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
2 N10575 2002 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
3 N11106 2002 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
4 N11107 2002 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
5 N11109 2002 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
6 N11113 2002 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
Then we will use a join to get all the flights on one of these planes.
inner_join(
airplanes |> filter(substr(model, 1, 3) == "EMB"),
flights_ev_msp,
by = "tailnum"
)
# Source: SQL [?? x 27]
# Database: sqlite 3.44.2 [/var/folders/nr/n31pl_p95tq7w0p_6pgv_t4h0000gn/T//RtmpXAuXe4/nycflights13.sqlite]
tailnum year.x type manufacturer model engines seats speed engine year.y
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr> <int>
1 N11107 2002 Fixed wi… EMBRAER EMB-… 2 55 NA Turbo… 2013
2 N18120 2003 Fixed wi… EMBRAER EMB-… 2 55 NA Turbo… 2013
3 N14977 1999 Fixed wi… EMBRAER EMB-… 2 55 NA Turbo… 2013
4 N34110 2002 Fixed wi… EMBRAER EMB-… 2 55 NA Turbo… 2013
5 N31131 2003 Fixed wi… EMBRAER EMB-… 2 55 NA Turbo… 2013
6 N33182 2005 Fixed wi… EMBRAER EMB-… 2 55 NA Turbo… 2013
7 N11137 2003 Fixed wi… EMBRAER EMB-… 2 55 NA Turbo… 2013
8 N14168 2004 Fixed wi… EMBRAER EMB-… 2 55 NA Turbo… 2013
9 N33182 2005 Fixed wi… EMBRAER EMB-… 2 55 NA Turbo… 2013
10 N17196 2005 Fixed wi… EMBRAER EMB-… 2 55 NA Turbo… 2013
# ℹ more rows
# ℹ 17 more variables: month <int>, day <int>, dep_time <int>,
# sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dbl>
33.2 Irregular data
You can download data from online accounts from many services. For example, Spotify will prepare a zip file of the music and podcasts you have listened to over the past year (or longer), with the correct authentication Google will provide data in a structured format. A widely used format for data that is not strictly a table is JavaScript Object Notation (JSON). The R package jsonlite
can convert data tables to and from this format.
Here is a relatively short, but complex example (see this page for more examples):
json_example <- '[{
"name": "Chris",
"age": 23,
"address": {
"city": "New York",
"country": "America"
},
"friends": [
{
"name": "Emily",
"hobbies": [ "biking", "music", "gaming" ]
},
{
"name": "John",
"hobbies": [ "soccer", "gaming" ]
}
]
}]'
jsonExample <- fromJSON(json_example)
jsonExample
name age address.city address.country
1 Chris 23 New York America
friends
1 Emily, John, biking, music, gaming, soccer, gaming
flatten(jsonExample)
name age friends address.city
1 Chris 23 Emily, John, biking, music, gaming, soccer, gaming New York
address.country
1 America
jsonExample$friends[[1]]
name hobbies
1 Emily biking, music, gaming
2 John soccer, gaming
Take a look at the list of repositories for a user on GitHub, for example, Hadley Wickham. As I write, this query gives a list of 356 repositories with lots of metadata including short and full names for repositories, a description, and lots more. Some of these data, for example, data about the user associated with each repository are tables in their own right. One way to structure this data would be using a set of tables and a relational database. The approach used here with JSON is to nest one table inside another.
How are nested tables managed in R data tables? The column of a data table can be, as we’ve seen, text, a number, or a date. It can also be another table. What we do below is read the JSON text (follow the link to see the text format of this data) and then convert it into a table, including nested tables. GitHub just gives the first 30 results; this is common with web or database queries where the number of results could be large. You need to ask for more data using the language of the GitHub API if you want it. When you display the main table, the embedded table appears with prefix in the name (columns named owner.XXX
).
(Code output not shown here; run example in your own R session.)
You can get the type of each column using typeof
and sapply
; a column of type list in this example means that column is made of up of tables.
sapply(data2, typeof)
# data2 |> select(where(function(x) typeof(x) == "list")) # to select list columns
If you don’t want to deal with the nested table, you can flatten it into a regular table. The new columns appear at the right side.
flatten(data2)
33.3 Beyond CSV: other data formats
We have relied heavily on the comma separated value format in this course, both because it is a very commonly used format and because it is relatively easy to use. The CSV format has several drawbacks
- the data type for each column is not described, so the computer software needs to “guess” the data type or you need to describe it explicitly,
- there are no standard conventions for variable names, beyond the fact that usually – but not always – the names are given on a single line,
- if a text value in a column has a comma in it (for example, an address), then quotation marks are usually used to delineate the content of the column, but then special “escaping” notation is needed to include a quotation mark in text data, and these mechanisms are not always implemented well or correctly,
- in some locales, a different separator is used (e.g., a semicolon or tab) and number formatting can differ (the decimal separator can be a period or a comma),
- since the file is stored as text, the way the text is encoded will depend on the (human) language used and the country or “locale” where the computer is used, complicating retrieval of the data for someone from another locale,and
- large files are relatively slow to read compared to some other formats.
We will explore two ways of providing metadata for a CSV file, focusing on variable names, data types, and human-readable descriptions for data. We will then look at data storage formats that permit faster work on large datasets.
33.3.1 Frictionless metadata for tables
The frictionless data approach to this problem is to combine a csv file and a metadata file in a “data package”. The metadata includes
- a short name and descriptive title for the package
- a short name and description of each variable in each table
- the data type of each variable
Here is an example using two tables from the palmerpenguins package.
# remotes::install_github("frictionlessdata/frictionless-r") # or order version from CRAN
library(frictionless)
# dir.create(here::here("lessons/static/L160"), recursive=TRUE, showWarnings = FALSE)
# write_csv(penguins, here::here("lessons/static/L160", "penguins.csv"))
# write_csv(penguins_raw, here::here("lessons/static/L160", "penguins_raw.csv"))
schema1 <- create_schema(penguins)
schema2 <- create_schema(penguins_raw)
package <- create_package()
# filename and column names must contain only lower case letters, -, _, .
package <- add_resource(package, resource_name = "penguins",
# data = "static/L160/penguins.csv",
data = penguins,
schema = schema1)
package <- add_resource(package, resource_name = "penguins_raw",
# data = "static/L160/penguins_raw.csv",
data = penguins_raw,
schema = schema2)
# add metadata to package (recent change in library; commented out for now)
# package$resources[[1]]$schema$missingValues = list('', 'NA')
# package$resources[[2]]$schema$missingValues = list('', 'NA')
# package <- append(package, c(name = "palmerpenguins"), after = 0)
# package <- append(package, c(title = "Data about penguins from Palmer Station, Antarctica"), after = 1)
# package <- append(package, c(citation = "Horst, A. and Gorman, K. Palmer Penguins Data. https://doi.org/10.5281/zenodo.3960218" ), after = 2)
write_package(package, here::here("lessons/static/L160", "penguins-data-tables"))
Then you can give this directory (possibly as a zip file) to someone else who can open it with the following functions. You can be assured that the variable names, data types, and descriptions will all be transmitted with your data.
p <- read_package(here::here("lessons/static/L160/penguins-data-tables", "datapackage.json"))
resources(p)
[1] "penguins" "penguins_raw"
p1 <- read_resource(p, "penguins")
Data are reproduced faithfully except three columns were changed from int to double: flipper_length_mm, body_mass_g, and year.
If your data package is distributed through a website like zenodo, you can read the data directly from that website.
package <- read_package("https://zenodo.org/record/5879096/files/datapackage.json")
resources(package)
[1] "reference-data" "gps" "acceleration"
schema_zenodo <- get_schema(package, "reference-data") # contains: name, title, description, type, format for each variable
schema_zenodo$fields[[1]] # just a taste of the full output
$name
[1] "tag-id"
$title
[1] "tag ID"
$description
[1] "A unique identifier for the tag, provided by the data owner. If the data owner does not provide a tag ID, an internal Movebank tag identifier may sometimes be shown. Example: '2342'; Units: none; Entity described: tag"
$type
[1] "string"
$format
[1] "default"
$`skos:exactMatch`
[1] "http://vocab.nerc.ac.uk/collection/MVB/current/MVB000181/2/"
ref_data <- read_resource(package, "reference-data") # 13 rows, 24 variables
# gps <- read_resource(package, "gps") # 73,047 rows, 21 variables
33.3.2 CSVW
A second solution to the problem of metadata for CSV files was produced by a working group at the World Wide Web Consortium (W3C). The solution is called CSV on the Web (CSVW).
The csvwr library implements parts of this standard in R. The goal is to support the reading and writing of annotated CSV tables, to ensure consistent processing and reduce the amount of manual work needed to parse and prepare data before it can be used.
As with frictionless data, the metadata is stored in an accompanying json document and describes
- the csv dialect (i.e. the choice of field separator or quoting characters),
- short and descrpitive variable names,
- data types for each variable.
Here is an example of creating and reading a CSVW file:
library(csvwr)
# Start with a data frame saved as a csv (we did this for the previous example already)
# write_csv(penguins, here::here("lessons/static/L160", "penguins.csv"))
# write_csv(penguins_raw, here::here("lessons/static/L160", "penguins_raw.csv"))
# Derive a schema from the data
schema1 <- derive_table_schema(penguins)
schema1 # Automatically generated; you can improve this information!
$columns
name titles datatype
1 species species string
2 island island string
3 bill_length_mm bill_length_mm number
4 bill_depth_mm bill_depth_mm number
5 flipper_length_mm flipper_length_mm integer
6 body_mass_g body_mass_g integer
7 sex sex string
8 year year integer
schema2 <- derive_table_schema(penguins_raw)
# Alternatively, derive schema from the csv file
# schema1 <- derive_metadata( here::here("lessons/static/L160", "penguins.csv"))
# schema2 <- derive_metadata( here::here("lessons/static/L160", "penguins_raw.csv"))
# Create metadata (as a list)
table1 <- list(url = "penguins.csv", tableSchema = schema1)
table2 <- list(url = "penguins_raw.csv", tableSchema = schema2)
m <- create_metadata(tables=list(table1, table2))
# Serialise the metadata to JSON
j <- jsonlite::toJSON(m)
# Write the json to a file
cat(j, file=here::here("lessons/static/L160", "penguins-csvw-metadata.json"))
Read the data from the files
The data are reproduced faithfully, except factor variables in the original data are read as text from the CSVW and some column names have been changed in penguins_raw
(e.g., spaces changed to periods).
[1] "Component \"species\": 'current' is not a factor"
[2] "Component \"island\": 'current' is not a factor"
[3] "Component \"sex\": 'current' is not a factor"
[1] "Names: 10 string mismatches"
[2] "Attributes: < Length mismatch: comparison on first 2 components >"
33.3.3 Speed and metadata solution
A second problem with CSV is that reading the data from disk into memory is slow for large files. There are two reasons: the text data is larger than equivalent information stored in a binary format, and the reading process requires the software to infer the types of the data. Here I will introduce alternative storage formats that address both problems. The first is a binary format called parquet made by the non-profit Apache Software Foundation (where the earliest popular web servers were developed). The second is a database format called SQLite which is freely and widely available. It is the most widely used database software in the world and is an archival format recommended by the US Library of Congress. Both are easy to use with R.
A file with 1 million floating point numbers takes this much storage on disk: csv (18.4 MB), compressed csv (csv.gz, 8.5 MB), SQLite (9.3 MB), parquet (6.9 MB), and feather (5.5 MB). The binary formats parquet and feather are both smaller and faster to work with than a compressed text format (csv.gz).
Here I demonstrate creating and reading these files.
library(arrow) # for parquet and arrow format files
# On M1/2/3 Mac, instead of installing from CRAN do the following
# install.packages('arrow', repos = c('https://apache.r-universe.dev', 'https://cloud.r-project.org'))
library(RSQLite)
library(dplyr)
library(dbplyr)
df <- matrix(runif(1000000), 100000, 10)
df <- df |> as_tibble(.name_repair = "unique")
names(df) <- letters[1:10]
# parquet compresses data on columns, try the nycflights13 data with 6.4m "entries" (336,776 observations of 19 variables)
# 4.8 MB parquet, 9.9 MV feather, 29.6 MB csv, 7.9 MB csv.gz, 21.1 MB SQLite
# df <- nycflights13::flights
write_parquet(df, "static/L160/test.gz.parquet", compression="gzip")
write_feather(df, "static/L160/test.feather", version=2, compression = "zstd") # lz4, zstd, uncompressed
write_csv(df, "static/L160/test.csv")
write_csv(df, "static/L160/test.csv.gz")
con <- dbConnect(SQLite(), "static/L160/test.sqlite")
# dbRemoveTable(con, "df")
dbWriteTable(con, "df", df, overwrite=TRUE)
dbDisconnect(con)
The speeds of reading each file type, relative to csv (smaller is faster) is: csv (1.0), parquet (0.4), SQLite (0.46).
This code block opens each file – uncomment the line you want to use.
# sw <- read_csv("static/L160/test.csv.gz", show_col_types = FALSE)
# sw <- read_csv("static/L160/test.csv", show_col_types = FALSE)
sw <- read_parquet("static/L160/test.gz.parquet")
# sw <- read_feather("static/L160/test.feather")
# con <- dbConnect(SQLite(), "static/L160/test.sqlite"); sw <- tbl(con, "df")
This code block reads data and does a dplyr calculation:
# A tibble: 1 × 2
mean_b mean_c
<dbl> <dbl>
1 0.490 0.501
It is good practice to close the connection to your SQLite database when you are done.
dbDisconnect(con)
Warning in connection_release(conn@ptr): Already disconnected
Remove files to clean up your storage, since you don’t need any of these files.
33.4 Packages used
In addition to the tidyverse
and palmerpenguins
, in this lesson I have used
-
dbplyr
for using dplyr functions on databases (“db”) -
nycflights13
for access to a relatively large dataset -
jsonlite
for working with JSON data frictionless
csvwr
-
arrow
for the parquet and feather formats -
RSQLite
for SQLite database creation and access
33.5 References
- For more on Frictionless data packages, see its GitHub page or doi 10.5281/zenodo.5815355
- A second, more elaborate package to create frictionless data packages is the datapackage tool
- rOpenSci has a package deposits to help ensure data submitted to repositories are complete and well documented
- Another approach to improving CSV files is the CSV on the Web (CSVW) r package: CSVWR
- The R4DS chapters on databases, arrow format, and hierarchical data. In particular the chapter on hierarchical data has advice on working with JSON files.
- More a more detailed approach to metadata, see Dataspice
- Yet another approach to metadata is implemented by the yamlet package