Chapter 25 Checking your work | Data Visualization

25.1 An introduction to testing

You’ve made a data visualization. It looks great. All the calculations necessary are in a single Rmd file – reading the data, organizing the data, and creating the figure. You can revise the data and reproduce the plot any time you want.

But then a thought occurs to you – how do I know the visualization is correct? The computer saves us time by doing calculations for us, but the price is that you can’t keep track of everything it does.

What if there is a problem with the data? Maybe the data you analyzed has a lot of observations – too many to check by hand. Or maybe the data are fine, but something went wrong when you read them into R. There are lots of ways for errors to creep in. Missing values when you thought there were none. Unexpected levels of factors (too many or too few). Detectable errors in the data such as impossible values.

The best idea to counteract all these problems is testing. The secret is to get the computer to perform the tests for you. In this lesson we will discuss two kinds of testing: checking your data and checking your calculations.

A lot of testing is done for you before you even start R: most (we hope all) the packages you use are carefully tested to be sure they work as intended. Still – you might misunderstand what the packages are supposed to do. Or you might make a typo and use the wrong variable name somewhere. Or you might get the logic of your calculation wrong. So you should test your work in as many ways as you can.

The most important reason to test is that you will catch mistakes. Possibly the second most important reason for doing testing – and being explicit about it – is that it can help the people who use your analysis have more confidence in it. This includes you in the future!

25.2 Checking analysis and visualizations

Once you have a preliminary analysis, develop a testing plan. Methods that can help:

Check for errors in your data
Perform your analysis on a small subset of your data that you can understand without computer help. This is the “sample calculation” method of testing
Perform your analysis on simulated or fake “data” designed to test certain cases (independent variables, strong dependence, etc.) that will allow you to check your methods and interpretation
Perform your calculations or visualizations two different ways, to check your understanding. This is especially useful if you have a fast and a slow way of doing a calculation. Use the slow way as a check on the fast method.

25.3 Data errors

What kinds of errors can appear in data?

Misspellings, upper/lower case inconsistency, extra spaces
Duplicated observations
Miss-coded missing data (-999, 0)
Inconsistently formatted dates and times
Impossible values arising from typographical errors
Data in wrong columns
All the data can look correct, but the methods may have changed, creating “silent” errors

Why do data sometimes have errors?

To answer this, you need to know about the process used to create the data. Were some data output by a particular machine or software package that has errors? Were the data typed in by a single person? Were many different people, who may have had different understandings of the data collection goals, involved? Were the data collected over a long period of time, when machines, people, and goals may have changed?

How can you find errors in data?

Make summary tables and cross-check the summaries
Look for missing data
Check to be sure the correct number of variables (columns) and rows (observations) are present
Plot histograms and densities
Write tests to test data belong to a legitimate set of values

What do you do with errors?

Keep original data, so you can revert the change in case of a misunderstanding
Log changes and describe error
Write tests to check for future errros
Communicate with the people who collected the data and the people who will receive the analysis

25.3.1 Sample data to test

The following data were collected by a class of students evaluating their ability to identify a jelly bean flavour by blindfolded testing. Much of our sense of taste comes from smell, so there were two treatments: a control with the participant was blindfolded and a treatment where the participant was blindfolded and blocked their nose to reduce the sense of smell. Groups of students entered their data into a single Google docs spreadsheet which was exported to the csv file below.

This dataset has a lot of problems, but they are very typical for data entered by a group of people who are not directly involved in the systematic analysis of the data with a software package like R. (They are not attuned to the problems of data errors.) Take a look at the file and see how many problems you can find before continuing.

jelly <- read_csv(here("static/jelly-bean-data.csv")) %>% clean_names()

## Warning: Missing column names filled in: 'X7' [7], 'X8' [8], 'X9' [9],
## 'X10' [10], 'X11' [11], 'X12' [12], 'X13' [13], 'X14' [14], 'X15' [15],
## 'X16' [16], 'X17' [17], 'X18' [18], 'X19' [19], 'X20' [20], 'X21' [21],
## 'X22' [22], 'X23' [23], 'X24' [24], 'X25' [25], 'X26' [26], 'X27' [27]

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_logical(),
##   Initials = col_character(),
##   `Group (choose from drop down)` = col_character(),
##   `Trial #` = col_double(),
##   Flavour = col_character(),
##   `Reaction time (in sec)` = col_character(),
##   `Accuracy (0=incorrect, 1=correct)` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.

# jelly %>% kable() %>% scroll_box()

You should always look at data. For a first look, View, skim (in the skimr package), and glimpse functions are useful and will identify some problems.

glimpse(jelly)

## Rows: 261
## Columns: 27
## $ initials                       <chr> "PKL", "PKL", "PKL", "LAG", "LAG", "LAG…
## $ group_choose_from_drop_down    <chr> "control", "control", "control", "exper…
## $ trial_number                   <dbl> 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, NA,…
## $ flavour                        <chr> "orange", "strawberry", "cherry", "oran…
## $ reaction_time_in_sec           <chr> "7.2", "7.6", "10.1", "10.15", "12.24",…
## $ accuracy_0_incorrect_1_correct <dbl> 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, NA,…
## $ x7                             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x8                             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x9                             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x10                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x11                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x12                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x13                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x14                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x15                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x16                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x17                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x18                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x19                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x20                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x21                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x22                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x23                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x24                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x25                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x26                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ x27                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

# skimr::skim(jelly)

There are many columns that have no heading and are purely missing data. Before getting rid of those columns, let’s be sure they are all missing using the miss_var_summary function from the naniar package.

jelly %>% miss_var_summary() %>% kable() %>% kable_styling(full_width = FALSE)

variable	n_miss	pct_miss
x7	261	100.00000
x8	261	100.00000
x9	261	100.00000
x10	261	100.00000
x11	261	100.00000
x12	261	100.00000
x13	261	100.00000
x14	261	100.00000
x15	261	100.00000
x16	261	100.00000
x17	261	100.00000
x18	261	100.00000
x19	261	100.00000
x20	261	100.00000
x21	261	100.00000
x22	261	100.00000
x23	261	100.00000
x24	261	100.00000
x25	261	100.00000
x26	261	100.00000
x27	261	100.00000
initials	88	33.71648
group_choose_from_drop_down	34	13.02682
reaction_time_in_sec	34	13.02682
accuracy_0_incorrect_1_correct	34	13.02682
trial_number	33	12.64368
flavour	33	12.64368

Now we will get rid of those data.

jelly <- jelly %>% select(-(x7:x27))

The dlookr package provides a similar table, plus it adds a count of the number of distinct values for each variable.

diagnose(jelly) %>% kable() %>% kable_styling(full_width = FALSE)

variables	types	missing_count	missing_percent	unique_count	unique_rate
initials	character	88	33.71648	75	0.2873563
group_choose_from_drop_down	character	34	13.02682	4	0.0153257
trial_number	numeric	33	12.64368	4	0.0153257
flavour	character	33	12.64368	44	0.1685824
reaction_time_in_sec	character	34	13.02682	201	0.7701149
accuracy_0_incorrect_1_correct	numeric	34	13.02682	3	0.0114943

There are at least 12% missing values in each variable, so there may be some rows that are completely missing. If you take another look at the data you will see that a lot of rows are all NA. Let’s get rid of them. First find the rows that are all NA. First we count the number of missing data in each row (mutate) then we remove the rows where every variable is missing (filter). Whenever I clean data, I like to preview what will happen before modifying the dataset.

jelly %>% rowwise() %>%
  mutate(n_na = sum(is.na(across()))) %>% 
  filter(n_na == ncol(jelly)) %>%
  head() %>% kable()

initials	group_choose_from_drop_down	trial_number	flavour	reaction_time_in_sec	accuracy_0_incorrect_1_correct	n_na
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6

Now remove them:

jelly <- jelly %>% rowwise() %>%
  mutate(n_na = sum(is.na(across()))) %>% 
  ungroup() %>%
  filter(n_na < ncol(jelly)) %>%
  select(-n_na)

We still don’t know what most of thse variables mean, but already one thing should stand out. The variable accuracy_0_incorrect_1_correct presumably should only be 0 or 1, but has three different values. We may not care about the initials of the investigator, but we might be concerned about the fact that there are 88 missing values (about 50 more than the all blank rows we removed.)

25.4 Data quality assurance

You can get report on your data using the pointblank package. (See the documentation linked in further reading for examples.) Let’s test a few things:

are any values of the accuracy variable not 0 or 1?
is reaction time and trial_number numeric? (We know the answer already, but I’ll show how to test this condition.)
are there any negative reaction times?
are all the rows distinct? (It’s unlikely, but not impossible to get the same result twice.)

First we will write the tests.

my_test <- jelly %>% 
  create_agent(actions = action_levels(warn_at = 1)) %>%
  col_vals_in_set(accuracy_0_incorrect_1_correct, c(0,1)) %>%
  col_is_numeric(vars(trial_number, reaction_time_in_sec)) %>%
  col_vals_gt(reaction_time_in_sec, 0) %>%
  rows_distinct()

Now you can use the tests to get a report. (The report is not shown. Experiment with these functions on your own.)

my_test %>% interrogate()

Reaction time (in seconds) is text not a number. Let’s fix that. Probably there will be some non-numeric values in there, because otherwise the data would have been read as numeric. So let’s look for those rows first. If you try to convert text to a number and the text doesn’t look like a number, you’ll get an NA instead. So filter out the NAs that get created when you convert from text to number. R will give us a message to alert us to the fact that there are some NAs created.

jelly %>% filter(is.na(as.numeric(reaction_time_in_sec))) %>%
  kable() %>% kable_styling(full_width = FALSE)

## Warning in mask$eval_all_filter(dots, env_filter): NAs introduced by coercion

initials	group_choose_from_drop_down	trial_number	flavour	reaction_time_in_sec	accuracy_0_incorrect_1_correct
KS	experimental	3	Orange	Lemon	0
BP	NA	NA	NA	NA	NA
NA	control	3	yellow and brown	NA	1

One of those is obviously an error. (Lemon is not a number!) In a careful analysis, you would contact the person who recorded the data (that’s what the initials are for) and find out what went wrong. Same thing with the NA, where the time didn’t get recorded. For now, we’ll just throw away these numbers.

jelly <- jelly %>% 
  mutate(reaction_time_in_sec = as.numeric(reaction_time_in_sec)) %>%
  filter(!is.na(reaction_time_in_sec))

The pointblank package also has a helpful function to produce a structured report on your whole dataset. (Again the results are not shown here.)

jelly %>% scan_data()

25.5 Data cleaning

Freshly collected and tabulated data are often “dirty”: there are errors in the data such as typographical errors, impossible values, or simply more detail than is appropriate for your analysis. Preparing a dataset for analysis is called data cleaning and it can be a complex and lengthy task that is at least as complext as analyzing data. Although there are common tasks performed in data cleaning, every dataset presents its own set of challenges. That’s the reason we usually use “clean” data in courses.

Data cleaning tasks are very individual and can take considerable creativity. Here we’ll just try a few.

Let’s look at the values of flavour. It seems like there are a lot of them.

jelly %>% count(flavour) %>% arrange(-n) %>% 
  kable() %>% kable_styling(full_width = FALSE)

flavour	n
orange	32
lemon	23
cherry	19
Orange	19
grape	18
coconut	14
Cherry	10
Banana	9
Grape	9
Lemon	9
banana	7
Coconut	5
lime	4
purple	4
apple	3
strawberry	3
Yellow White	3
blueberry	2
bubblegum	2
Purple	2
red	2
white	2
White	2
yellow	2
yellow and white	2
Yellow Brown	2
Apple	1
Bubblegum	1
cinnamon	1
cocnut	1
Coffee	1
lemon/lime	1
licorice	1
Lime	1
marshmallow	1
Pinapple	1
Plum	1
raspberry	1
Red	1
watermelon	1
Watermelon	1
yellow + white	1
yellow and brown	1

We definitely have some duplication. I’ll do one correction – changing all letters to lower case so differences in capitalization don’t duplicate flavours. (This simple change eliminates 13 different values of flavour.) If you look closely at the data, you will find spelling errors, different codings (and, +, or neither), and one missing flavour.

jelly %>% mutate(flavour = tolower(flavour)) %>%
  count(flavour) %>% arrange(-n) %>% 
  kable() %>% kable_styling(full_width = FALSE)

flavour	n
orange	51
lemon	32
cherry	29
grape	27
coconut	19
banana	16
purple	6
lime	5
apple	4
white	4
bubblegum	3
red	3
strawberry	3
yellow white	3
blueberry	2
watermelon	2
yellow	2
yellow and white	2
yellow brown	2
cinnamon	1
cocnut	1
coffee	1
lemon/lime	1
licorice	1
marshmallow	1
pinapple	1
plum	1
raspberry	1
yellow + white	1
yellow and brown	1

Use what you learned to tidy the data. Here are some changes I’ll make before continuing the analysis.

jelly <- jelly %>% 
  mutate(flavour = tolower(flavour),
         flavour = case_when(flavour == "cocnut" ~ "coconut", 
                             TRUE ~ flavour), 
         flavour = str_replace(flavour, " and ", " "),
         flavour = str_replace(flavour, "[/+]", " "),
         flavour = str_squish(flavour))

25.6 Analyze a subset of your data

Let’s compute the average reaction time for correct and incorrect answers for each flavour and each treatment. We will count the number of observations for each grouping too.

jelly %>% group_by(flavour, group_choose_from_drop_down, accuracy_0_incorrect_1_correct) %>%
  summarize(count = n(),
            mean_reaction_time = mean(reaction_time_in_sec),
            .groups = "drop") %>% 
  kable() %>% kable_styling(full_width = FALSE)

flavour	group_choose_from_drop_down	accuracy_0_incorrect_1_correct	count	mean_reaction_time
apple	experimental	0	4	14.660000
banana	control	1	9	12.616667
banana	experimental	0	4	24.222500
banana	experimental	1	3	17.166667
blueberry	control	0	1	5.980000
blueberry	experimental	0	1	20.730000
bubblegum	control	0	1	7.300000
bubblegum	experimental	0	2	17.620000
cherry	control	0	4	17.487500
cherry	control	1	12	6.785000
cherry	experimental	0	11	15.835455
cherry	experimental	1	1	12.320000
cherry	experimental	NA	1	11.000000
cinnamon	control	0	1	5.920000
coconut	control	0	3	22.596667
coconut	control	1	6	6.821667
coconut	experimental	0	9	18.861111
coconut	experimental	1	2	15.040000
coffee	experimental	0	1	7.650000
grape	control	0	6	23.903333
grape	control	1	5	5.926000
grape	Control	1	1	13.250000
grape	experimental	0	12	17.655833
grape	experimental	1	3	16.473333
lemon	control	0	10	10.756000
lemon	control	1	8	7.138750
lemon	experimental	0	10	17.110000
lemon	experimental	1	3	15.613333
lemon	NA	1	1	5.460000
lemon lime	experimental	1	1	10.800000
licorice	experimental	0	1	8.460000
lime	control	0	3	10.640000
lime	experimental	0	2	12.760000
marshmallow	experimental	0	1	12.050000
orange	control	0	9	14.307778
orange	control	1	16	6.546250
orange	experimental	0	16	18.295000
orange	experimental	1	10	10.086000
pinapple	experimental	0	1	11.770000
plum	control	0	1	15.200000
purple	control	1	2	14.445000
purple	experimental	0	3	14.756667
purple	experimental	1	1	36.000000
raspberry	experimental	0	1	16.150000
red	control	1	1	12.120000
red	experimental	0	2	24.000000
strawberry	control	0	3	23.716667
watermelon	control	0	2	23.800000
white	control	1	2	3.300000
white	experimental	0	1	21.500000
white	experimental	1	1	7.480000
yellow	control	1	2	9.265000
yellow brown	control	1	2	15.800000
yellow brown	experimental	0	1	9.340000
yellow white	control	0	1	13.060000
yellow white	control	1	1	15.500000
yellow white	experimental	0	3	20.200000
yellow white	experimental	1	1	11.000000

This is a fairly simple calculation, but it serves to demonstrate the “manual check” method. Filter out just the incorrect, experimental results for apple flavours. Check that the number of rows and average match.

jelly %>% filter(flavour == "apple",
                 accuracy_0_incorrect_1_correct == 0,
                 group_choose_from_drop_down == "experimental") %>% 
  kable() %>% kable_styling(full_width = FALSE)

initials	group_choose_from_drop_down	trial_number	flavour	reaction_time_in_sec
FACP	experimental	1	apple	14.30
VP	experimental	1	apple	21.35
NA	experimental	2	apple	9.99
Tw	experimental	1	apple	13.00

Looks good!

25.7 Use fake data, where you already know the answer

This example will be a bit contrived, but its a good example of the idea. I’ll create some simple data and then compute the averages using exactly the same code I wrote above. I’ll put in a NA in one row just to see what happens.

my_jelly <- tribble(
  ~group_choose_from_drop_down, ~ flavour, ~ reaction_time_in_sec, ~ accuracy_0_incorrect_1_correct,
  "experimental", "lime", 1, 1,
  "experimental", "lime", 2, 1,
  "experimental", "lemon", 3, 1,
  "control", "lemon", 4, 1,
)
my_jelly %>% kable() %>% kable_styling(full_width = FALSE)

group_choose_from_drop_down	flavour	reaction_time_in_sec	accuracy_0_incorrect_1_correct
experimental	lime	1	1
experimental	lime	2	1
experimental	lemon	3	1
control	lemon	4	1

I know exactly what I’m expecting. Stop and decide for yourself before reading on.

my_jelly %>% group_by(flavour, group_choose_from_drop_down, accuracy_0_incorrect_1_correct) %>%
  summarize(count = n(),
            mean_reaction_time = mean(reaction_time_in_sec),
            .groups = "drop") %>% 
  kable() %>% kable_styling(full_width = FALSE)

flavour	group_choose_from_drop_down	accuracy_0_incorrect_1_correct	count	mean_reaction_time
lemon	control	1	1	4.0
lemon	experimental	1	1	3.0
lime	experimental	1	2	1.5

25.8 Do you calculations two different ways

Suppose we weren’t really confident that we knew what mean does? Even simple functions like sd (for standard deviation) and median might not do what you think they should. How can we check it? Let’s write our own function based on sum, length, and division.

my_mean = function(x) sum(x) / length(x)
my_jelly %>% group_by(flavour, 
                      group_choose_from_drop_down, 
                      accuracy_0_incorrect_1_correct) %>%
  summarize(count = length(flavour),
            mean_reaction_time = my_mean(reaction_time_in_sec),
            .groups = "drop") %>%
  kable() %>% kable_styling(full_width = FALSE)

flavour	group_choose_from_drop_down	accuracy_0_incorrect_1_correct	count	mean_reaction_time
lemon	control	1	1	4.0
lemon	experimental	1	1	3.0
lime	experimental	1	2	1.5

Excellent! We got the same answers.

That’s an introduction to testing calculations. These are good checks to make as you calculate and test your analysis. The next step is to write computer code that checks your answers as well, by giving sample data to your calculation code along with the known answers. There is a package (of course!) for helping with that: testthat and tinytest (links below). We won’t be using these in the course, but you should take time to learn about them after the course.

25.9 At the end of your analysis

Provide a relatively simple dataset together with analysis results that can be used to verify your code is working the same way in the future, or that someone who develops a new way of analysing the data can use for comparison
Document any testing processes you used for your data and for your computer code so that later users will know what problems you looked for
Document any problems you found in the data and what steps you took to fix the problems
Describe any weaknesses in your method which anticipate might cause problems for future users

25.10 Lessons for this course

We are not integrating testing into our work in this course. This lesson is here to alert you to this problem, ensure you know many people think carefully about this, and to show you the first steps to developing a good quality assurance plan. Be aware that testing takes time – possibly as much time as “the rest” of the work you do for data analysis.

25.11 Further reading

Data cleaning overview
pointblank documentation and examples
dlookr documentation and examples
dataMaid documentation for data cleaning
For testing R code, look at tinytest and testthat

initials	group_choose_from_drop_down	trial_number	flavour	reaction_time_in_sec	accuracy_0_incorrect_1_correct	n_na
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6

initials	group_choose_from_drop_down	trial_number	flavour	reaction_time_in_sec	accuracy_0_incorrect_1_correct	n_na
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6

initials	group_choose_from_drop_down	trial_number	flavour	reaction_time_in_sec	accuracy_0_incorrect_1_correct	n_na
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6
NA	NA	NA	NA	NA	NA	6