25 Checking your work

25.1 An introduction to testing

You’ve made a data visualization. It looks great. All the calculations necessary are in a single Rmd file – reading the data, organizing the data, and creating the figure. You can revise the data and reproduce the plot any time you want.

But then a thought occurs to you – how do I know the visualization is correct? The computer saves us time by doing calculations for us, but the price is that you can’t keep track of everything it does.

What if there is a problem with the data? Maybe the data you analyzed has a lot of observations – too many to check by hand. Or maybe the data are fine, but something went wrong when you read them into R. There are lots of ways for errors to creep in. Missing values when you thought there were none. Unexpected levels of factors (too many or too few). Detectable errors in the data such as impossible values.

The best idea to counteract all these problems is testing. The secret is to get the computer to perform the tests for you. In this lesson we will discuss two kinds of testing: checking your data and checking your calculations.

A lot of testing is done for you before you even start R: most (we hope all) the packages you use are carefully tested to be sure they work as intended. Still – you might misunderstand what the packages are supposed to do. Or you might make a typo and use the wrong variable name somewhere. Or you might get the logic of your calculation wrong. So you should test your work in as many ways as you can.

The most important reason to test is that you will catch mistakes. Possibly the second most important reason for doing testing – and being explicit about it – is that it can help the people who use your analysis have more confidence in it. This includes you in the future!

25.2 Checking analysis and visualizations

Once you have a preliminary analysis, develop a testing plan. Methods that can help:

  • Check for errors in your data
  • Perform your analysis on a small subset of your data that you can understand without computer help. This is the “sample calculation” method of testing
  • Perform your analysis on simulated or fake “data” designed to test certain cases (independent variables, strong dependence, etc.) that will allow you to check your methods and interpretation
  • Perform your calculations or visualizations two different ways, to check your understanding. This is especially useful if you have a fast and a slow way of doing a calculation. Use the slow way as a check on the fast method.

25.3 Data errors

What kinds of errors can appear in data?

  • Misspellings, upper/lower case inconsistency, extra spaces
  • Duplicated observations
  • Miss-coded missing data (-999, 0)
  • Inconsistently formatted dates and times
  • Impossible values arising from typographical errors
  • Data in wrong columns
  • All the data can look correct, but the methods may have changed, creating “silent” errors

Why do data sometimes have errors?

  • To answer this, you need to know about the process used to create the data. Were some data output by a particular machine or software package that has errors? Were the data typed in by a single person? Were many different people, who may have had different understandings of the data collection goals, involved? Were the data collected over a long period of time, when machines, people, and goals may have changed?

How can you find errors in data?

  • Make summary tables and cross-check the summaries
  • Look for missing data
  • Check to be sure the correct number of variables (columns) and rows (observations) are present
  • Plot histograms and densities
  • Write tests to test data belong to a legitimate set of values

What do you do with errors?

  • Keep original data, so you can revert the change in case of a misunderstanding
  • Log changes and describe error
  • Write tests to check for future errros
  • Communicate with the people who collected the data and the people who will receive the analysis

25.3.1 Sample data to test

The following data were collected by a class of students evaluating their ability to identify a jelly bean flavour by blindfolded testing. Much of our sense of taste comes from smell, so there were two treatments: a control with the participant was blindfolded and a treatment where the participant was blindfolded and blocked their nose to reduce the sense of smell. Groups of students entered their data into a single Google docs spreadsheet which was exported to the csv file below.

This dataset has a lot of problems, but they are very typical for data entered by a group of people who are not directly involved in the systematic analysis of the data with a software package like R. (They are not attuned to the problems of data errors.) Take a look at the file and see how many problems you can find before continuing.

jelly <- read_csv(here("static/jelly-bean-data.csv")) %>% clean_names()
## Warning: Missing column names filled in: 'X7' [7], 'X8' [8], 'X9' [9],
## 'X10' [10], 'X11' [11], 'X12' [12], 'X13' [13], 'X14' [14], 'X15' [15],
## 'X16' [16], 'X17' [17], 'X18' [18], 'X19' [19], 'X20' [20], 'X21' [21],
## 'X22' [22], 'X23' [23], 'X24' [24], 'X25' [25], 'X26' [26], 'X27' [27]
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_logical(),
##   Initials = col_character(),
##   `Group (choose from drop down)` = col_character(),
##   `Trial #` = col_double(),
##   Flavour = col_character(),
##   `Reaction time (in sec)` = col_character(),
##   `Accuracy (0=incorrect, 1=correct)` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
# jelly %>% kable() %>% scroll_box()

You should always look at data. For a first look, View, skim (in the skimr package), and glimpse functions are useful and will identify some problems.

glimpse(jelly)
## Rows: 261
## Columns: 27
## $ initials                       <chr> "PKL", "PKL", "PKL", "LAG", "LAG", "LAG…
## $ group_choose_from_drop_down    <chr> "control", "control", "control", "exper…
## $ trial_number                   <dbl> 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, NA,…
## $ flavour                        <chr> "orange", "strawberry", "cherry", "oran…
## $ reaction_time_in_sec           <chr> "7.2", "7.6", "10.1", "10.15", "12.24",…
## $ accuracy_0_incorrect_1_correct <dbl> 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, NA,…
## $ x7                             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x8                             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x9                             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x10                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x11                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x12                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x13                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x14                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x15                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x16                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x17                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x18                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x19                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x20                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x21                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x22                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x23                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x24                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x25                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x26                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x27                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
# skimr::skim(jelly)

There are many columns that have no heading and are purely missing data. Before getting rid of those columns, let’s be sure they are all missing using the miss_var_summary function from the naniar package.

jelly %>% miss_var_summary() %>% kable() %>% kable_styling(full_width = FALSE)
variable n_miss pct_miss
x7 261 100.00000
x8 261 100.00000
x9 261 100.00000
x10 261 100.00000
x11 261 100.00000
x12 261 100.00000
x13 261 100.00000
x14 261 100.00000
x15 261 100.00000
x16 261 100.00000
x17 261 100.00000
x18 261 100.00000
x19 261 100.00000
x20 261 100.00000
x21 261 100.00000
x22 261 100.00000
x23 261 100.00000
x24 261 100.00000
x25 261 100.00000
x26 261 100.00000
x27 261 100.00000
initials 88 33.71648
group_choose_from_drop_down 34 13.02682
reaction_time_in_sec 34 13.02682
accuracy_0_incorrect_1_correct 34 13.02682
trial_number 33 12.64368
flavour 33 12.64368

Now we will get rid of those data.

jelly <- jelly %>% select(-(x7:x27))

The dlookr package provides a similar table, plus it adds a count of the number of distinct values for each variable.

diagnose(jelly) %>% kable() %>% kable_styling(full_width = FALSE)
variables types missing_count missing_percent unique_count unique_rate
initials character 88 33.71648 75 0.2873563
group_choose_from_drop_down character 34 13.02682 4 0.0153257
trial_number numeric 33 12.64368 4 0.0153257
flavour character 33 12.64368 44 0.1685824
reaction_time_in_sec character 34 13.02682 201 0.7701149
accuracy_0_incorrect_1_correct numeric 34 13.02682 3 0.0114943

There are at least 12% missing values in each variable, so there may be some rows that are completely missing. If you take another look at the data you will see that a lot of rows are all NA. Let’s get rid of them. First find the rows that are all NA. First we count the number of missing data in each row (mutate) then we remove the rows where every variable is missing (filter). Whenever I clean data, I like to preview what will happen before modifying the dataset.

jelly %>% rowwise() %>%
  mutate(n_na = sum(is.na(across()))) %>% 
  filter(n_na == ncol(jelly)) %>%
  head() %>% kable()
initials group_choose_from_drop_down trial_number flavour reaction_time_in_sec accuracy_0_incorrect_1_correct n_na
NA NA NA NA NA NA 6
NA NA NA NA NA NA 6
NA NA NA NA NA NA 6
NA NA NA NA NA NA 6
NA NA NA NA NA NA 6
NA NA NA NA NA NA 6

Now remove them:

jelly <- jelly %>% rowwise() %>%
  mutate(n_na = sum(is.na(across()))) %>% 
  ungroup() %>%
  filter(n_na < ncol(jelly)) %>%
  select(-n_na)

We still don’t know what most of thse variables mean, but already one thing should stand out. The variable accuracy_0_incorrect_1_correct presumably should only be 0 or 1, but has three different values. We may not care about the initials of the investigator, but we might be concerned about the fact that there are 88 missing values (about 50 more than the all blank rows we removed.)

25.4 Data quality assurance

You can get report on your data using the pointblank package. (See the documentation linked in further reading for examples.) Let’s test a few things:

  • are any values of the accuracy variable not 0 or 1?
  • is reaction time and trial_number numeric? (We know the answer already, but I’ll show how to test this condition.)
  • are there any negative reaction times?
  • are all the rows distinct? (It’s unlikely, but not impossible to get the same result twice.)

First we will write the tests.

my_test <- jelly %>% 
  create_agent(actions = action_levels(warn_at = 1)) %>%
  col_vals_in_set(accuracy_0_incorrect_1_correct, c(0,1)) %>%
  col_is_numeric(vars(trial_number, reaction_time_in_sec)) %>%
  col_vals_gt(reaction_time_in_sec, 0) %>%
  rows_distinct() 

Now you can use the tests to get a report. (The report is not shown. Experiment with these functions on your own.)

my_test %>% interrogate() 

Reaction time (in seconds) is text not a number. Let’s fix that. Probably there will be some non-numeric values in there, because otherwise the data would have been read as numeric. So let’s look for those rows first. If you try to convert text to a number and the text doesn’t look like a number, you’ll get an NA instead. So filter out the NAs that get created when you convert from text to number. R will give us a message to alert us to the fact that there are some NAs created.

jelly %>% filter(is.na(as.numeric(reaction_time_in_sec))) %>%
  kable() %>% kable_styling(full_width = FALSE)
## Warning in mask$eval_all_filter(dots, env_filter): NAs introduced by coercion
initials group_choose_from_drop_down trial_number flavour reaction_time_in_sec accuracy_0_incorrect_1_correct
KS experimental 3 Orange Lemon 0
BP NA NA NA NA NA
NA control 3 yellow and brown NA 1

One of those is obviously an error. (Lemon is not a number!) In a careful analysis, you would contact the person who recorded the data (that’s what the initials are for) and find out what went wrong. Same thing with the NA, where the time didn’t get recorded. For now, we’ll just throw away these numbers.

jelly <- jelly %>% 
  mutate(reaction_time_in_sec = as.numeric(reaction_time_in_sec)) %>%
  filter(!is.na(reaction_time_in_sec))

The pointblank package also has a helpful function to produce a structured report on your whole dataset. (Again the results are not shown here.)

jelly %>% scan_data()

25.5 Data cleaning

Freshly collected and tabulated data are often “dirty”: there are errors in the data such as typographical errors, impossible values, or simply more detail than is appropriate for your analysis. Preparing a dataset for analysis is called data cleaning and it can be a complex and lengthy task that is at least as complext as analyzing data. Although there are common tasks performed in data cleaning, every dataset presents its own set of challenges. That’s the reason we usually use “clean” data in courses.

Data cleaning tasks are very individual and can take considerable creativity. Here we’ll just try a few.

Let’s look at the values of flavour. It seems like there are a lot of them.

jelly %>% count(flavour) %>% arrange(-n) %>% 
  kable() %>% kable_styling(full_width = FALSE)
flavour n
orange 32
lemon 23
cherry 19
Orange 19
grape 18
coconut 14
Cherry 10
Banana 9
Grape 9
Lemon 9
banana 7
Coconut 5
lime 4
purple 4
apple 3
strawberry 3
Yellow White 3
blueberry 2
bubblegum 2
Purple 2
red 2
white 2
White 2
yellow 2
yellow and white 2
Yellow Brown 2
Apple 1
Bubblegum 1
cinnamon 1
cocnut 1
Coffee 1
lemon/lime 1
licorice 1
Lime 1
marshmallow 1
Pinapple 1
Plum 1
raspberry 1
Red 1
watermelon 1
Watermelon 1
yellow + white 1
yellow and brown 1

We definitely have some duplication. I’ll do one correction – changing all letters to lower case so differences in capitalization don’t duplicate flavours. (This simple change eliminates 13 different values of flavour.) If you look closely at the data, you will find spelling errors, different codings (and, +, or neither), and one missing flavour.

jelly %>% mutate(flavour = tolower(flavour)) %>%
  count(flavour) %>% arrange(-n) %>% 
  kable() %>% kable_styling(full_width = FALSE)
flavour n
orange 51
lemon 32
cherry 29
grape 27
coconut 19
banana 16
purple 6
lime 5
apple 4
white 4
bubblegum 3
red 3
strawberry 3
yellow white 3
blueberry 2
watermelon 2
yellow 2
yellow and white 2
yellow brown 2
cinnamon 1
cocnut 1
coffee 1
lemon/lime 1
licorice 1
marshmallow 1
pinapple 1
plum 1
raspberry 1
yellow + white 1
yellow and brown 1

Use what you learned to tidy the data. Here are some changes I’ll make before continuing the analysis.

jelly <- jelly %>% 
  mutate(flavour = tolower(flavour),
         flavour = case_when(flavour == "cocnut" ~ "coconut", 
                             TRUE ~ flavour), 
         flavour = str_replace(flavour, " and ", " "),
         flavour = str_replace(flavour, "[/+]", " "),
         flavour = str_squish(flavour))

25.6 Analyze a subset of your data

Let’s compute the average reaction time for correct and incorrect answers for each flavour and each treatment. We will count the number of observations for each grouping too.

jelly %>% group_by(flavour, group_choose_from_drop_down, accuracy_0_incorrect_1_correct) %>%
  summarize(count = n(),
            mean_reaction_time = mean(reaction_time_in_sec),
            .groups = "drop") %>% 
  kable() %>% kable_styling(full_width = FALSE)
flavour group_choose_from_drop_down accuracy_0_incorrect_1_correct count mean_reaction_time
apple experimental 0 4 14.660000
banana control 1 9 12.616667
banana experimental 0 4 24.222500
banana experimental 1 3 17.166667
blueberry control 0 1 5.980000
blueberry experimental 0 1 20.730000
bubblegum control 0 1 7.300000
bubblegum experimental 0 2 17.620000
cherry control 0 4 17.487500
cherry control 1 12 6.785000
cherry experimental 0 11 15.835455
cherry experimental 1 1 12.320000
cherry experimental NA 1 11.000000
cinnamon control 0 1 5.920000
coconut control 0 3 22.596667
coconut control 1 6 6.821667
coconut experimental 0 9 18.861111
coconut experimental 1 2 15.040000
coffee experimental 0 1 7.650000
grape control 0 6 23.903333
grape control 1 5 5.926000
grape Control 1 1 13.250000
grape experimental 0 12 17.655833
grape experimental 1 3 16.473333
lemon control 0 10 10.756000
lemon control 1 8 7.138750
lemon experimental 0 10 17.110000
lemon experimental 1 3 15.613333
lemon NA 1 1 5.460000
lemon lime experimental 1 1 10.800000
licorice experimental 0 1 8.460000
lime control 0 3 10.640000
lime experimental 0 2 12.760000
marshmallow experimental 0 1 12.050000
orange control 0 9 14.307778
orange control 1 16 6.546250
orange experimental 0 16 18.295000
orange experimental 1 10 10.086000
pinapple experimental 0 1 11.770000
plum control 0 1 15.200000
purple control 1 2 14.445000
purple experimental 0 3 14.756667
purple experimental 1 1 36.000000
raspberry experimental 0 1 16.150000
red control 1 1 12.120000
red experimental 0 2 24.000000
strawberry control 0 3 23.716667
watermelon control 0 2 23.800000
white control 1 2 3.300000
white experimental 0 1 21.500000
white experimental 1 1 7.480000
yellow control 1 2 9.265000
yellow brown control 1 2 15.800000
yellow brown experimental 0 1 9.340000
yellow white control 0 1 13.060000
yellow white control 1 1 15.500000
yellow white experimental 0 3 20.200000
yellow white experimental 1 1 11.000000

This is a fairly simple calculation, but it serves to demonstrate the “manual check” method. Filter out just the incorrect, experimental results for apple flavours. Check that the number of rows and average match.

jelly %>% filter(flavour == "apple",
                 accuracy_0_incorrect_1_correct == 0,
                 group_choose_from_drop_down == "experimental") %>% 
  kable() %>% kable_styling(full_width = FALSE)
initials group_choose_from_drop_down trial_number flavour reaction_time_in_sec accuracy_0_incorrect_1_correct
FACP experimental 1 apple 14.30 0
VP experimental 1 apple 21.35 0
NA experimental 2 apple 9.99 0
Tw experimental 1 apple 13.00 0

Looks good!

25.7 Use fake data, where you already know the answer

This example will be a bit contrived, but its a good example of the idea. I’ll create some simple data and then compute the averages using exactly the same code I wrote above. I’ll put in a NA in one row just to see what happens.

my_jelly <- tribble(
  ~group_choose_from_drop_down, ~ flavour, ~ reaction_time_in_sec, ~ accuracy_0_incorrect_1_correct,
  "experimental", "lime", 1, 1,
  "experimental", "lime", 2, 1,
  "experimental", "lemon", 3, 1,
  "control", "lemon", 4, 1,
)
my_jelly %>% kable() %>% kable_styling(full_width = FALSE)
group_choose_from_drop_down flavour reaction_time_in_sec accuracy_0_incorrect_1_correct
experimental lime 1 1
experimental lime 2 1
experimental lemon 3 1
control lemon 4 1

I know exactly what I’m expecting. Stop and decide for yourself before reading on.

my_jelly %>% group_by(flavour, group_choose_from_drop_down, accuracy_0_incorrect_1_correct) %>%
  summarize(count = n(),
            mean_reaction_time = mean(reaction_time_in_sec),
            .groups = "drop") %>% 
  kable() %>% kable_styling(full_width = FALSE)
flavour group_choose_from_drop_down accuracy_0_incorrect_1_correct count mean_reaction_time
lemon control 1 1 4.0
lemon experimental 1 1 3.0
lime experimental 1 2 1.5

25.8 Do you calculations two different ways

Suppose we weren’t really confident that we knew what mean does? Even simple functions like sd (for standard deviation) and median might not do what you think they should. How can we check it? Let’s write our own function based on sum, length, and division.

my_mean = function(x) sum(x) / length(x)
my_jelly %>% group_by(flavour, 
                      group_choose_from_drop_down, 
                      accuracy_0_incorrect_1_correct) %>%
  summarize(count = length(flavour),
            mean_reaction_time = my_mean(reaction_time_in_sec),
            .groups = "drop") %>%
  kable() %>% kable_styling(full_width = FALSE)
flavour group_choose_from_drop_down accuracy_0_incorrect_1_correct count mean_reaction_time
lemon control 1 1 4.0
lemon experimental 1 1 3.0
lime experimental 1 2 1.5

Excellent! We got the same answers.

That’s an introduction to testing calculations. These are good checks to make as you calculate and test your analysis. The next step is to write computer code that checks your answers as well, by giving sample data to your calculation code along with the known answers. There is a package (of course!) for helping with that: testthat and tinytest (links below). We won’t be using these in the course, but you should take time to learn about them after the course.

25.9 At the end of your analysis

  • Provide a relatively simple dataset together with analysis results that can be used to verify your code is working the same way in the future, or that someone who develops a new way of analysing the data can use for comparison
  • Document any testing processes you used for your data and for your computer code so that later users will know what problems you looked for
  • Document any problems you found in the data and what steps you took to fix the problems
  • Describe any weaknesses in your method which anticipate might cause problems for future users

25.10 Lessons for this course

We are not integrating testing into our work in this course. This lesson is here to alert you to this problem, ensure you know many people think carefully about this, and to show you the first steps to developing a good quality assurance plan. Be aware that testing takes time – possibly as much time as “the rest” of the work you do for data analysis.

25.11 Further reading