25 Checking your work

25.1 An introduction to testing

You’ve made a data visualization. It looks great. All the calculations necessary are in a single Rmd file – reading the data, organizing the data, and creating the figure. You can revise the data and reproduce the plot any time you want.

But then a thought occurs to you – how do I know the visualization is correct? The computer saves us time by doing calculations for us, but the price is that you can’t keep track of everything it does.

What if there is a problem with the data? Maybe the data you analyzed has a lot of observations – too many to check by hand. Or maybe the data are fine, but something went wrong when you read them into R. There are lots of ways for errors to creep in. Missing values when you thought there were none. Unexpected levels of factors (too many or too few). Detectable errors in the data such as impossible values.

The best idea to counteract all these problems is testing. The secret is to get the computer to perform the tests for you. In this lesson we will discuss two kinds of testing: checking your data and checking your calculations.

A lot of testing is done for you before you even start R: most (we hope all) the packages you use are carefully tested to be sure they work as intended. Still – you might misunderstand what the packages are supposed to do. Or you might make a typo and use the wrong variable name somewhere. Or you might get the logic of your calculation wrong. So you should test your work in as many ways as you can.

The most important reason to test is that you will catch mistakes. Possibly the second most important reason for doing testing – and being explicit about it – is that it can help the people who use your analysis have more confidence in it. This includes you in the future!

25.2 Checking analysis and visualizations

Once you have a preliminary analysis, develop a testing plan. Methods that can help:

Check for errors in your data.
Perform your analysis on a small subset of your data that you can understand without computer help. This is the “sample calculation” method of testing.
Perform your analysis on simulated or fake “data” designed to test certain cases (independent variables, strong dependence, etc.) that will allow you to check your methods and interpretation.
Perform your calculations or visualizations two different ways, to check your understanding. This is especially useful if you have a fast and a slow way of doing a calculation. Use the slow way as a check on the fast method.

25.3 Data errors

What kinds of errors can appear in data?

Misspellings, upper/lower case inconsistency, extra spaces
Duplicated observations
Miss-coded missing data (writing -999 or 0 instead of NA)
Inconsistently formatted dates and times
Impossible values arising from typographical errors
Data in wrong columns
All the data can look correct, but the methods for collecting data may have changed, creating “silent” errors

Why do data sometimes have errors?

To answer this, you need to know about the process used to create the data. Were some data output by a particular machine or software package that has errors? Were the data typed in by a single person? Were many different people, who may have had different understandings of the data collection goals, involved? Were the data collected over a long period of time, when machines, people, and goals may have changed?

How can you find errors in data?

Make summary tables and cross-check the summaries
Look for missing data
Check to be sure the correct number of variables (columns) and rows (observations) are present
Plot histograms and densities
Write tests to test data belong to a legitimate set of values

What do you do with errors?

Keep original data, so you can revert the change in case of a misunderstanding
Log changes and describe error
Write tests to check for future errros
Communicate with the people who collected the data and the people who will receive the analysis

25.3.1 Sample data to test

The following data were collected by a class of students evaluating their ability to identify a jelly bean flavour by blindfolded testing. Much of our sense of taste comes from smell, so there were two treatments. All participants were blindfolded. In the treatment, the participants blocked their nose to reduce their sense of smell and in the control participants could smell normally. Groups of students entered their data into a single Google docs spreadsheet which was exported to the csv file below.

This dataset has a lot of problems, but they are very typical for data entered by a group of people who are not directly involved in the systematic analysis of the data with a software package like R. (They are not attuned to the problems of data errors.) Take a look at the file and see how many problems you can find before continuing.

jelly <- read_csv(here("lessons/static/jelly-bean-data.csv"),
                  show_col_types = FALSE)

New names:
• `` -> `...7`
• `` -> `...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...11`
• `` -> `...12`
• `` -> `...13`
• `` -> `...14`
• `` -> `...15`
• `` -> `...16`
• `` -> `...17`
• `` -> `...18`
• `` -> `...19`
• `` -> `...20`
• `` -> `...21`
• `` -> `...22`
• `` -> `...23`
• `` -> `...24`
• `` -> `...25`
• `` -> `...26`
• `` -> `...27`

jelly |> datatable()

You should always look at data. For a first look, View, skim (in the skimr package), and glimpse functions are useful and will help you identify some problems.

glimpse(jelly)

Rows: 261
Columns: 27
$ Initials                            <chr> "PKL", "PKL", "PKL", "LAG", "LAG",…
$ `Group (choose from drop down)`     <chr> "control", "control", "control", "…
$ `Trial #`                           <dbl> 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3…
$ Flavour                             <chr> "orange", "strawberry", "cherry", …
$ `Reaction time (in sec)`            <chr> "7.2", "7.6", "10.1", "10.15", "12…
$ `Accuracy (0=incorrect, 1=correct)` <dbl> 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1…
$ ...7                                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...8                                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...9                                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...10                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...11                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...12                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...13                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...14                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...15                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...16                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...17                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...18                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...19                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...20                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...21                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...22                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...23                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...24                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...25                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...26                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...27                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…

# skimr::skim(jelly)  # try this yourself on some data

There are many columns that have no heading and are purely missing data. The janitor package has a function to remove rows and columns that are all missing data. I think some of these column names are a bit long and the punctuation makes them hard to type, so I will rename them to shorter names.

jelly <- jelly |>
  clean_names() |>
  rename(group = group_choose_from_drop_down,
         reaction_s = reaction_time_in_sec,
         accuracy = accuracy_0_incorrect_1_correct) |>
  remove_empty(c("rows", "cols"))

Check to see if there are any more missing data using the miss_var_summary function from the naniar package.

jelly |> miss_var_summary() |> 
  kable() |> kable_styling(full_width = FALSE)

variable	n_miss	pct_miss
initials	56	24.4541485
group	2	0.8733624
reaction_s	2	0.8733624
accuracy	2	0.8733624
trial_number	1	0.4366812
flavour	1	0.4366812

Are those missing values are a problem? I might choose not to care about the person who collected the data (initials), but a missing flavour or accuracy is a big problem!

The dlookr package provides a similar table, plus it adds a count of the number of distinct values for each variable.

diagnose(jelly) |> kable() |> kable_styling(full_width = FALSE)

variables	types	missing_count	missing_percent	unique_count	unique_rate
initials	character	56	24.4541485	75	0.3275109
group	character	2	0.8733624	4	0.0174672
trial_number	numeric	1	0.4366812	4	0.0174672
flavour	character	1	0.4366812	44	0.1921397
reaction_s	character	2	0.8733624	201	0.8777293
accuracy	numeric	2	0.8733624	3	0.0131004

We still don’t know what most of these variables mean, but already one thing should stand out. The variable accuracy (which originally also had the notation 0_incorrect_1_correct) presumably should only be 0 or 1, but has three different values. We may not care about the initials of the investigator, but we might be concerned about the fact that there are 56 missing values. The reaction time (measured in seconds) is text data, but should be a number.

25.4 Data quality assurance

You can get report on your data using the pointblank package. (See the documentation linked in further reading for examples.) Let’s test a few things:

are any values of the accuracy variable not 0 or 1?
is reaction time and trial_number numeric? (We know the answer already, but I’ll show how to test this condition.)
are there any negative reaction times?
are all the rows distinct? (It’s unlikely, but not impossible to get the same result twice.)

First we will write the tests.

my_test <- jelly |> 
  create_agent(actions = action_levels(warn_at = 1)) |>
  col_vals_in_set(accuracy, c(0,1)) |>
  col_is_numeric(vars(trial_number, reaction_s)) |>
  col_vals_gt(reaction_s, 0) |>
  rows_distinct()

Now you can use the tests to get a report. The report is not shown because it doesn’t fit well into a knitted document. Open up your own R session, read the data, clean it, and produce a report using the following code.

my_test |> interrogate()

Reaction time (in seconds) is text not a number. Let’s fix that. Probably there will be some non-numeric values in there, because otherwise the data would have been read as numeric. So let’s look for those rows first. If you try to convert text to a number and the text doesn’t look like a number, you’ll get an NA instead. So filter out the NAs that get created when you convert from text to number. R will give us a message to alert us to the fact that there are some NAs created.

jelly |> filter(is.na(as.numeric(reaction_s))) |>
  kable() |> kable_styling(full_width = FALSE)

Warning: There was 1 warning in `filter()`.
ℹ In argument: `is.na(as.numeric(reaction_s))`.
Caused by warning:
! NAs introduced by coercion

initials	group	trial_number	flavour	reaction_s	accuracy
KS	experimental	3	Orange	Lemon	0
BP	NA	NA	NA	NA	NA
NA	control	3	yellow and brown	NA	1

One of those is obviously an error. (Lemon is not a number!) In a careful analysis, you would contact the person who recorded the data (that’s what the initials are for) and find out what went wrong. Same thing with the NA, where the time didn’t get recorded. For now, we’ll just throw away these numbers.

jelly <- jelly |> 
  mutate(reaction_s = as.numeric(reaction_s)) |>
  filter(!is.na(reaction_s))

The pointblank package also has a helpful function to produce a structured report on your whole dataset. Again the results are not shown here; create the report in your own R session to see what it looks like.

jelly |> scan_data()

25.5 Data cleaning

Freshly collected and tabulated data are often “dirty”: there are errors in the data such as typographical errors, impossible values, or simply more detail than is appropriate for your analysis. Preparing a dataset for analysis is called data cleaning and it can be a complex and lengthy task that is at least as complex as analyzing data. Although there are common tasks performed in data cleaning, every dataset presents its own set of challenges. That’s the reason we usually use “clean” data in courses.

Data cleaning tasks are very individual and can take considerable creativity. Here we’ll just try a few.

Let’s look at the values of flavour. It seems like there are a lot of them.

jelly |> count(flavour) |> arrange(-n) |> 
  kable() |> kable_styling(full_width = FALSE)

flavour	n
orange	32
lemon	23
Orange	19
cherry	19
grape	18
coconut	14
Cherry	10
Banana	9
Grape	9
Lemon	9
banana	7
Coconut	5
lime	4
purple	4
Yellow White	3
apple	3
strawberry	3
Purple	2
White	2
Yellow Brown	2
blueberry	2
bubblegum	2
red	2
white	2
yellow	2
yellow and white	2
Apple	1
Bubblegum	1
Coffee	1
Lime	1
Pinapple	1
Plum	1
Red	1
Watermelon	1
cinnamon	1
cocnut	1
lemon/lime	1
licorice	1
marshmallow	1
raspberry	1
watermelon	1
yellow + white	1
yellow and brown	1

We definitely have some duplication. I’ll do one correction – changing all letters to lower case so differences in capitalization don’t duplicate flavours. (This simple change eliminates 13 different values of flavour.) If you look closely at the data, you will find spelling errors, different codings (and, +, or neither), and one missing flavour.

jelly |> mutate(flavour = tolower(flavour)) |>
  count(flavour) |> arrange(-n) |> 
  kable() |> kable_styling(full_width = FALSE)

flavour	n
orange	51
lemon	32
cherry	29
grape	27
coconut	19
banana	16
purple	6
lime	5
apple	4
white	4
bubblegum	3
red	3
strawberry	3
yellow white	3
blueberry	2
watermelon	2
yellow	2
yellow and white	2
yellow brown	2
cinnamon	1
cocnut	1
coffee	1
lemon/lime	1
licorice	1
marshmallow	1
pinapple	1
plum	1
raspberry	1
yellow + white	1
yellow and brown	1

Use what you learned to tidy the data. Here are some changes I’ll make before continuing the analysis.

jelly <- jelly |> 
  mutate(flavour = tolower(flavour),
         flavour = case_when(flavour == "cocnut" ~ "coconut", 
                             TRUE ~ flavour), 
         flavour = str_replace(flavour, " and ", " "),
         flavour = str_replace(flavour, "[/+]", " "),
         flavour = str_squish(flavour))

25.6 Analyze a subset of your data

Let’s compute the average reaction time for correct and incorrect answers for each flavour and each treatment. We will count the number of observations for each grouping too.

jelly |> group_by(flavour, group, accuracy) |>
  summarize(count = n(),
            mean_reaction_time = mean(reaction_s),
            .groups = "drop") |> 
  kable() |> kable_styling(full_width = FALSE)

flavour	group	accuracy	count	mean_reaction_time
apple	experimental	0	4	14.660000
banana	control	1	9	12.616667
banana	experimental	0	4	24.222500
banana	experimental	1	3	17.166667
blueberry	control	0	1	5.980000
blueberry	experimental	0	1	20.730000
bubblegum	control	0	1	7.300000
bubblegum	experimental	0	2	17.620000
cherry	control	0	4	17.487500
cherry	control	1	12	6.785000
cherry	experimental	0	11	15.835455
cherry	experimental	1	1	12.320000
cherry	experimental	NA	1	11.000000
cinnamon	control	0	1	5.920000
coconut	control	0	3	22.596667
coconut	control	1	6	6.821667
coconut	experimental	0	9	18.861111
coconut	experimental	1	2	15.040000
coffee	experimental	0	1	7.650000
grape	Control	1	1	13.250000
grape	control	0	6	23.903333
grape	control	1	5	5.926000
grape	experimental	0	12	17.655833
grape	experimental	1	3	16.473333
lemon	control	0	10	10.756000
lemon	control	1	8	7.138750
lemon	experimental	0	10	17.110000
lemon	experimental	1	3	15.613333
lemon	NA	1	1	5.460000
lemon lime	experimental	1	1	10.800000
licorice	experimental	0	1	8.460000
lime	control	0	3	10.640000
lime	experimental	0	2	12.760000
marshmallow	experimental	0	1	12.050000
orange	control	0	9	14.307778
orange	control	1	16	6.546250
orange	experimental	0	16	18.295000
orange	experimental	1	10	10.086000
pinapple	experimental	0	1	11.770000
plum	control	0	1	15.200000
purple	control	1	2	14.445000
purple	experimental	0	3	14.756667
purple	experimental	1	1	36.000000
raspberry	experimental	0	1	16.150000
red	control	1	1	12.120000
red	experimental	0	2	24.000000
strawberry	control	0	3	23.716667
watermelon	control	0	2	23.800000
white	control	1	2	3.300000
white	experimental	0	1	21.500000
white	experimental	1	1	7.480000
yellow	control	1	2	9.265000
yellow brown	control	1	2	15.800000
yellow brown	experimental	0	1	9.340000
yellow white	control	0	1	13.060000
yellow white	control	1	1	15.500000
yellow white	experimental	0	3	20.200000
yellow white	experimental	1	1	11.000000

This is a fairly simple calculation, but it serves to demonstrate the “manual check” method. Filter out just the incorrect, experimental results for apple flavours. Check that the number of rows and average match.

jelly |> filter(flavour == "apple",
                 accuracy == 0,
                 group == "experimental") |> 
  kable() |> kable_styling(full_width = FALSE)

initials	group	trial_number	flavour	reaction_s
FACP	experimental	1	apple	14.30
VP	experimental	1	apple	21.35
NA	experimental	2	apple	9.99
Tw	experimental	1	apple	13.00

Looks good!

25.7 Use fake data, where you already know the answer

I’ll create some simple data and then compute the averages using exactly the same code I wrote above. I’ll put in a NA in one row just to see what happens.

my_jelly <- tribble(
  ~group, ~ flavour, ~ reaction_s, ~ accuracy,
  "experimental", "lime", 1, 1,
  "experimental", "lime", 2, 1,
  "experimental", "lemon", 3, 1,
  "control", "lemon", 4, 1,
)
my_jelly |> kable() |> kable_styling(full_width = FALSE)

group	flavour	reaction_s	accuracy
experimental	lime	1	1
experimental	lime	2	1
experimental	lemon	3	1
control	lemon	4	1

I know exactly what I’m expecting. Stop and decide for yourself before reading on.

my_jelly |> group_by(flavour, group, accuracy) |>
  summarize(count = n(),
            mean_reaction_time = mean(reaction_s),
            .groups = "drop") |> 
  kable() |> kable_styling(full_width = FALSE)

flavour	group	accuracy	count	mean_reaction_time
lemon	control	1	1	4.0
lemon	experimental	1	1	3.0
lime	experimental	1	2	1.5

25.8 Do you calculations two different ways

Suppose we weren’t really confident that we knew what mean does? Even simple functions like sd (for standard deviation) and median might not do what you think they should. How can we check it? Let’s write our own function based on sum, length, and division.

my_mean = function(x) sum(x) / length(x)
my_jelly |> group_by(flavour, 
                      group, 
                      accuracy) |>
  summarize(count = length(flavour),
            mean_reaction_time = my_mean(reaction_s),
            .groups = "drop") |>
  kable() |> kable_styling(full_width = FALSE)

flavour	group	accuracy	count	mean_reaction_time
lemon	control	1	1	4.0
lemon	experimental	1	1	3.0
lime	experimental	1	2	1.5

Excellent! We got the same answers.

That’s an introduction to testing calculations. These are good checks to make as you calculate and test your analysis. The next step is to write computer code that checks your answers as well, by giving sample data to your calculation code along with the known answers. There is a package (of course!) for helping with that: testthat and tinytest (links below). We won’t be using these in the course, but you should take time to learn about them after the course.

25.9 At the end of your analysis

Provide a relatively simple dataset together with analysis results that can be used to verify your code is working the same way in the future, or that someone who develops a new way of analysing the data can use for comparison
Document any testing processes you used for your data and for your computer code so that later users will know what problems you looked for
Document any problems you found in the data and what steps you took to fix the problems
Describe any weaknesses in your method which anticipate might cause problems for future users

25.10 Lessons for this course

We are not integrating testing into our work in this course. This lesson is here to alert you to this problem, ensure you know many people think carefully about this, and to show you the first steps to developing a good quality assurance plan. Be aware that testing takes time – possibly as much time as “the rest” of the work you do for data analysis.

25.11 Packages used

In addition to tidyverse, here and kableExtra we’ve used the following packages

janitor for clean_names and to remove columns of purely missing data
naniar to summarize missing data
pointblank to make more detailed reports on data tables
dlookr to summarize data
DT for table formatting

25.12 Further reading

Data cleaning overview
pointblank documentation and examples
dlookr documentation and examples
dataMaid documentation for data cleaning
For testing R code, look at tinytest and testthat