25  Checking your work

25.1 An introduction to testing

You’ve made a data visualization. It looks great. All the calculations necessary are in a single Rmd file – reading the data, organizing the data, and creating the figure. You can revise the data and reproduce the plot any time you want.

But then a thought occurs to you – how do I know the visualization is correct? The computer saves us time by doing calculations for us, but the price is that you can’t keep track of everything it does.

What if there is a problem with the data? Maybe the data you analyzed has a lot of observations – too many to check by hand. Or maybe the data are fine, but something went wrong when you read them into R. There are lots of ways for errors to creep in. Missing values when you thought there were none. Unexpected levels of factors (too many or too few). Detectable errors in the data such as impossible values.

The best idea to counteract all these problems is testing. The secret is to get the computer to perform the tests for you. In this lesson we will discuss two kinds of testing: checking your data and checking your calculations.

A lot of testing is done for you before you even start R: most (we hope all) the packages you use are carefully tested to be sure they work as intended. Still – you might misunderstand what the packages are supposed to do. Or you might make a typo and use the wrong variable name somewhere. Or you might get the logic of your calculation wrong. So you should test your work in as many ways as you can.

The most important reason to test is that you will catch mistakes. Possibly the second most important reason for doing testing – and being explicit about it – is that it can help the people who use your analysis have more confidence in it. This includes you in the future!

25.2 Checking analysis and visualizations

Once you have a preliminary analysis, develop a testing plan. Methods that can help:

  • Check for errors in your data.
  • Perform your analysis on a small subset of your data that you can understand without computer help. This is the “sample calculation” method of testing.
  • Perform your analysis on simulated or fake “data” designed to test certain cases (independent variables, strong dependence, etc.) that will allow you to check your methods and interpretation.
  • Perform your calculations or visualizations two different ways, to check your understanding. This is especially useful if you have a fast and a slow way of doing a calculation. Use the slow way as a check on the fast method.

25.3 Data errors

What kinds of errors can appear in data?

  • Misspellings, upper/lower case inconsistency, extra spaces
  • Duplicated observations
  • Miss-coded missing data (writing -999 or 0 instead of NA)
  • Inconsistently formatted dates and times
  • Impossible values arising from typographical errors
  • Data in wrong columns
  • All the data can look correct, but the methods for collecting data may have changed, creating “silent” errors

Why do data sometimes have errors?

  • To answer this, you need to know about the process used to create the data. Were some data output by a particular machine or software package that has errors? Were the data typed in by a single person? Were many different people, who may have had different understandings of the data collection goals, involved? Were the data collected over a long period of time, when machines, people, and goals may have changed?

How can you find errors in data?

  • Make summary tables and cross-check the summaries
  • Look for missing data
  • Check to be sure the correct number of variables (columns) and rows (observations) are present
  • Plot histograms and densities
  • Write tests to test data belong to a legitimate set of values

What do you do with errors?

  • Keep original data, so you can revert the change in case of a misunderstanding
  • Log changes and describe error
  • Write tests to check for future errros
  • Communicate with the people who collected the data and the people who will receive the analysis

25.3.1 Sample data to test

The following data were collected by a class of students evaluating their ability to identify a jelly bean flavour by blindfolded testing. Much of our sense of taste comes from smell, so there were two treatments. All participants were blindfolded. In the treatment, the participants blocked their nose to reduce their sense of smell and in the control participants could smell normally. Groups of students entered their data into a single Google docs spreadsheet which was exported to the csv file below.

This dataset has a lot of problems, but they are very typical for data entered by a group of people who are not directly involved in the systematic analysis of the data with a software package like R. (They are not attuned to the problems of data errors.) Take a look at the file and see how many problems you can find before continuing.

jelly <- read_csv(here("lessons/static/jelly-bean-data.csv"),
                  show_col_types = FALSE) 
New names:
• `` -> `...7`
• `` -> `...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...11`
• `` -> `...12`
• `` -> `...13`
• `` -> `...14`
• `` -> `...15`
• `` -> `...16`
• `` -> `...17`
• `` -> `...18`
• `` -> `...19`
• `` -> `...20`
• `` -> `...21`
• `` -> `...22`
• `` -> `...23`
• `` -> `...24`
• `` -> `...25`
• `` -> `...26`
• `` -> `...27`
jelly |> datatable()

You should always look at data. For a first look, View, skim (in the skimr package), and glimpse functions are useful and will help you identify some problems.

glimpse(jelly)
Rows: 261
Columns: 27
$ Initials                            <chr> "PKL", "PKL", "PKL", "LAG", "LAG",…
$ `Group (choose from drop down)`     <chr> "control", "control", "control", "…
$ `Trial #`                           <dbl> 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3…
$ Flavour                             <chr> "orange", "strawberry", "cherry", …
$ `Reaction time (in sec)`            <chr> "7.2", "7.6", "10.1", "10.15", "12…
$ `Accuracy (0=incorrect, 1=correct)` <dbl> 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1…
$ ...7                                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...8                                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...9                                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...10                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...11                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...12                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...13                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...14                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...15                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...16                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...17                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...18                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...19                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...20                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...21                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...22                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...23                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...24                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...25                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...26                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ...27                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
# skimr::skim(jelly)  # try this yourself on some data

There are many columns that have no heading and are purely missing data. The janitor package has a function to remove rows and columns that are all missing data. I think some of these column names are a bit long and the punctuation makes them hard to type, so I will rename them to shorter names.

jelly <- jelly |>
  clean_names() |>
  rename(group = group_choose_from_drop_down,
         reaction_s = reaction_time_in_sec,
         accuracy = accuracy_0_incorrect_1_correct) |>
  remove_empty(c("rows", "cols"))

Check to see if there are any more missing data using the miss_var_summary function from the naniar package.

jelly |> miss_var_summary() |> 
  kable() |> kable_styling(full_width = FALSE)
variable n_miss pct_miss
initials 56 24.4541485
group 2 0.8733624
reaction_s 2 0.8733624
accuracy 2 0.8733624
trial_number 1 0.4366812
flavour 1 0.4366812

Are those missing values are a problem? I might choose not to care about the person who collected the data (initials), but a missing flavour or accuracy is a big problem!

The dlookr package provides a similar table, plus it adds a count of the number of distinct values for each variable.

diagnose(jelly) |> kable() |> kable_styling(full_width = FALSE)
variables types missing_count missing_percent unique_count unique_rate
initials character 56 24.4541485 75 0.3275109
group character 2 0.8733624 4 0.0174672
trial_number numeric 1 0.4366812 4 0.0174672
flavour character 1 0.4366812 44 0.1921397
reaction_s character 2 0.8733624 201 0.8777293
accuracy numeric 2 0.8733624 3 0.0131004

We still don’t know what most of these variables mean, but already one thing should stand out. The variable accuracy (which originally also had the notation 0_incorrect_1_correct) presumably should only be 0 or 1, but has three different values. We may not care about the initials of the investigator, but we might be concerned about the fact that there are 56 missing values. The reaction time (measured in seconds) is text data, but should be a number.

25.4 Data quality assurance

You can get report on your data using the pointblank package. (See the documentation linked in further reading for examples.) Let’s test a few things:

  • are any values of the accuracy variable not 0 or 1?
  • is reaction time and trial_number numeric? (We know the answer already, but I’ll show how to test this condition.)
  • are there any negative reaction times?
  • are all the rows distinct? (It’s unlikely, but not impossible to get the same result twice.)

First we will write the tests.

my_test <- jelly |> 
  create_agent(actions = action_levels(warn_at = 1)) |>
  col_vals_in_set(accuracy, c(0,1)) |>
  col_is_numeric(vars(trial_number, reaction_s)) |>
  col_vals_gt(reaction_s, 0) |>
  rows_distinct() 

Now you can use the tests to get a report. The report is not shown because it doesn’t fit well into a knitted document. Open up your own R session, read the data, clean it, and produce a report using the following code.

my_test |> interrogate() 

Reaction time (in seconds) is text not a number. Let’s fix that. Probably there will be some non-numeric values in there, because otherwise the data would have been read as numeric. So let’s look for those rows first. If you try to convert text to a number and the text doesn’t look like a number, you’ll get an NA instead. So filter out the NAs that get created when you convert from text to number. R will give us a message to alert us to the fact that there are some NAs created.

jelly |> filter(is.na(as.numeric(reaction_s))) |>
  kable() |> kable_styling(full_width = FALSE)
Warning: There was 1 warning in `filter()`.
ℹ In argument: `is.na(as.numeric(reaction_s))`.
Caused by warning:
! NAs introduced by coercion
initials group trial_number flavour reaction_s accuracy
KS experimental 3 Orange Lemon 0
BP NA NA NA NA NA
NA control 3 yellow and brown NA 1

One of those is obviously an error. (Lemon is not a number!) In a careful analysis, you would contact the person who recorded the data (that’s what the initials are for) and find out what went wrong. Same thing with the NA, where the time didn’t get recorded. For now, we’ll just throw away these numbers.

jelly <- jelly |> 
  mutate(reaction_s = as.numeric(reaction_s)) |>
  filter(!is.na(reaction_s))

The pointblank package also has a helpful function to produce a structured report on your whole dataset. Again the results are not shown here; create the report in your own R session to see what it looks like.

jelly |> scan_data()

25.5 Data cleaning

Freshly collected and tabulated data are often “dirty”: there are errors in the data such as typographical errors, impossible values, or simply more detail than is appropriate for your analysis. Preparing a dataset for analysis is called data cleaning and it can be a complex and lengthy task that is at least as complex as analyzing data. Although there are common tasks performed in data cleaning, every dataset presents its own set of challenges. That’s the reason we usually use “clean” data in courses.

Data cleaning tasks are very individual and can take considerable creativity. Here we’ll just try a few.

Let’s look at the values of flavour. It seems like there are a lot of them.

jelly |> count(flavour) |> arrange(-n) |> 
  kable() |> kable_styling(full_width = FALSE)
flavour n
orange 32
lemon 23
Orange 19
cherry 19
grape 18
coconut 14
Cherry 10
Banana 9
Grape 9
Lemon 9
banana 7
Coconut 5
lime 4
purple 4
Yellow White 3
apple 3
strawberry 3
Purple 2
White 2
Yellow Brown 2
blueberry 2
bubblegum 2
red 2
white 2
yellow 2
yellow and white 2
Apple 1
Bubblegum 1
Coffee 1
Lime 1
Pinapple 1
Plum 1
Red 1
Watermelon 1
cinnamon 1
cocnut 1
lemon/lime 1
licorice 1
marshmallow 1
raspberry 1
watermelon 1
yellow + white 1
yellow and brown 1

We definitely have some duplication. I’ll do one correction – changing all letters to lower case so differences in capitalization don’t duplicate flavours. (This simple change eliminates 13 different values of flavour.) If you look closely at the data, you will find spelling errors, different codings (and, +, or neither), and one missing flavour.

jelly |> mutate(flavour = tolower(flavour)) |>
  count(flavour) |> arrange(-n) |> 
  kable() |> kable_styling(full_width = FALSE)
flavour n
orange 51
lemon 32
cherry 29
grape 27
coconut 19
banana 16
purple 6
lime 5
apple 4
white 4
bubblegum 3
red 3
strawberry 3
yellow white 3
blueberry 2
watermelon 2
yellow 2
yellow and white 2
yellow brown 2
cinnamon 1
cocnut 1
coffee 1
lemon/lime 1
licorice 1
marshmallow 1
pinapple 1
plum 1
raspberry 1
yellow + white 1
yellow and brown 1

Use what you learned to tidy the data. Here are some changes I’ll make before continuing the analysis.

jelly <- jelly |> 
  mutate(flavour = tolower(flavour),
         flavour = case_when(flavour == "cocnut" ~ "coconut", 
                             TRUE ~ flavour), 
         flavour = str_replace(flavour, " and ", " "),
         flavour = str_replace(flavour, "[/+]", " "),
         flavour = str_squish(flavour))

25.6 Analyze a subset of your data

Let’s compute the average reaction time for correct and incorrect answers for each flavour and each treatment. We will count the number of observations for each grouping too.

jelly |> group_by(flavour, group, accuracy) |>
  summarize(count = n(),
            mean_reaction_time = mean(reaction_s),
            .groups = "drop") |> 
  kable() |> kable_styling(full_width = FALSE)
flavour group accuracy count mean_reaction_time
apple experimental 0 4 14.660000
banana control 1 9 12.616667
banana experimental 0 4 24.222500
banana experimental 1 3 17.166667
blueberry control 0 1 5.980000
blueberry experimental 0 1 20.730000
bubblegum control 0 1 7.300000
bubblegum experimental 0 2 17.620000
cherry control 0 4 17.487500
cherry control 1 12 6.785000
cherry experimental 0 11 15.835455
cherry experimental 1 1 12.320000
cherry experimental NA 1 11.000000
cinnamon control 0 1 5.920000
coconut control 0 3 22.596667
coconut control 1 6 6.821667
coconut experimental 0 9 18.861111
coconut experimental 1 2 15.040000
coffee experimental 0 1 7.650000
grape Control 1 1 13.250000
grape control 0 6 23.903333
grape control 1 5 5.926000
grape experimental 0 12 17.655833
grape experimental 1 3 16.473333
lemon control 0 10 10.756000
lemon control 1 8 7.138750
lemon experimental 0 10 17.110000
lemon experimental 1 3 15.613333
lemon NA 1 1 5.460000
lemon lime experimental 1 1 10.800000
licorice experimental 0 1 8.460000
lime control 0 3 10.640000
lime experimental 0 2 12.760000
marshmallow experimental 0 1 12.050000
orange control 0 9 14.307778
orange control 1 16 6.546250
orange experimental 0 16 18.295000
orange experimental 1 10 10.086000
pinapple experimental 0 1 11.770000
plum control 0 1 15.200000
purple control 1 2 14.445000
purple experimental 0 3 14.756667
purple experimental 1 1 36.000000
raspberry experimental 0 1 16.150000
red control 1 1 12.120000
red experimental 0 2 24.000000
strawberry control 0 3 23.716667
watermelon control 0 2 23.800000
white control 1 2 3.300000
white experimental 0 1 21.500000
white experimental 1 1 7.480000
yellow control 1 2 9.265000
yellow brown control 1 2 15.800000
yellow brown experimental 0 1 9.340000
yellow white control 0 1 13.060000
yellow white control 1 1 15.500000
yellow white experimental 0 3 20.200000
yellow white experimental 1 1 11.000000

This is a fairly simple calculation, but it serves to demonstrate the “manual check” method. Filter out just the incorrect, experimental results for apple flavours. Check that the number of rows and average match.

jelly |> filter(flavour == "apple",
                 accuracy == 0,
                 group == "experimental") |> 
  kable() |> kable_styling(full_width = FALSE)
initials group trial_number flavour reaction_s accuracy
FACP experimental 1 apple 14.30 0
VP experimental 1 apple 21.35 0
NA experimental 2 apple 9.99 0
Tw experimental 1 apple 13.00 0

Looks good!

25.7 Use fake data, where you already know the answer

I’ll create some simple data and then compute the averages using exactly the same code I wrote above. I’ll put in a NA in one row just to see what happens.

my_jelly <- tribble(
  ~group, ~ flavour, ~ reaction_s, ~ accuracy,
  "experimental", "lime", 1, 1,
  "experimental", "lime", 2, 1,
  "experimental", "lemon", 3, 1,
  "control", "lemon", 4, 1,
)
my_jelly |> kable() |> kable_styling(full_width = FALSE)
group flavour reaction_s accuracy
experimental lime 1 1
experimental lime 2 1
experimental lemon 3 1
control lemon 4 1

I know exactly what I’m expecting. Stop and decide for yourself before reading on.

my_jelly |> group_by(flavour, group, accuracy) |>
  summarize(count = n(),
            mean_reaction_time = mean(reaction_s),
            .groups = "drop") |> 
  kable() |> kable_styling(full_width = FALSE)
flavour group accuracy count mean_reaction_time
lemon control 1 1 4.0
lemon experimental 1 1 3.0
lime experimental 1 2 1.5

25.8 Do you calculations two different ways

Suppose we weren’t really confident that we knew what mean does? Even simple functions like sd (for standard deviation) and median might not do what you think they should. How can we check it? Let’s write our own function based on sum, length, and division.

my_mean = function(x) sum(x) / length(x)
my_jelly |> group_by(flavour, 
                      group, 
                      accuracy) |>
  summarize(count = length(flavour),
            mean_reaction_time = my_mean(reaction_s),
            .groups = "drop") |>
  kable() |> kable_styling(full_width = FALSE)
flavour group accuracy count mean_reaction_time
lemon control 1 1 4.0
lemon experimental 1 1 3.0
lime experimental 1 2 1.5

Excellent! We got the same answers.

That’s an introduction to testing calculations. These are good checks to make as you calculate and test your analysis. The next step is to write computer code that checks your answers as well, by giving sample data to your calculation code along with the known answers. There is a package (of course!) for helping with that: testthat and tinytest (links below). We won’t be using these in the course, but you should take time to learn about them after the course.

25.9 At the end of your analysis

  • Provide a relatively simple dataset together with analysis results that can be used to verify your code is working the same way in the future, or that someone who develops a new way of analysing the data can use for comparison
  • Document any testing processes you used for your data and for your computer code so that later users will know what problems you looked for
  • Document any problems you found in the data and what steps you took to fix the problems
  • Describe any weaknesses in your method which anticipate might cause problems for future users

25.10 Lessons for this course

We are not integrating testing into our work in this course. This lesson is here to alert you to this problem, ensure you know many people think carefully about this, and to show you the first steps to developing a good quality assurance plan. Be aware that testing takes time – possibly as much time as “the rest” of the work you do for data analysis.

25.11 Packages used

In addition to tidyverse, here and kableExtra we’ve used the following packages

  • janitor for clean_names and to remove columns of purely missing data
  • naniar to summarize missing data
  • pointblank to make more detailed reports on data tables
  • dlookr to summarize data
  • DT for table formatting

25.12 Further reading