class: center, middle, inverse, title-slide # Data Visualization ## Checking your work ### Andrew Irwin,
a.irwin@dal.ca
### Math & Stats, Dalhousie University ### 2021-03-05 (updated: 2021-03-09) --- class: middle # Plan * Why is testing important? * Testing data * Testing code * Examples * Application in this course --- class: middle ### Data error <img src="../static/error-excel-covid.png" width="1612" /> --- class: middle ### Computing error <img src="../static/testing-error-austerity.png" width="2576" /> --- class: middle ### Why is testing important? * Visualizations are powerful and help people draw conclusions * Data errors and misunderstandings can corrupt your work * Mistakes in analysis (summarize, grouping, calculations) can lead to wrong conclusions * Checking (manually and automatically) your work improves confidence * If you return to a project later, or give it to someone else, misunderstandings can lead to misinterpretation --- class: middle ### Testing data Things to check * Were the data read correctly by R (numbers, text, dates, missing values)? * Are the expected numbers of rows and columns present? Is there a way to check? * Are any values impossible? (Negative counts or lengths.) * Changes in units? (Lots of "outliers") Human-coded numbers (commas, spaces, scales like M, K for millions and thousands) * Spelling errors, abbreviations, or variants? Capitalization. Extra spaces. * Duplicated data? * Date formatting. Times and time zones. --- class: middle ### How to test data? Most powerful and easiest to use techniques: * Summary tables: counts, means, ranges * Simple visualizations: histograms, boxplots, scatter plots Check the "obvious" things. --- class: middle ### Testing calculations * Test your code on sample or simulated data * Perform a part of the calculation by hand to independently check a result * Positive and negative controls (just like experiments) * Provide a test dataset and correct report for future users to check --- class: middle ### Example: Jelly bean data Key variables: treatment, flavour, reaction time, accuracy. ```r jelly %>% count(treatment) %>% kable() %>% kable_styling(full_width = FALSE) ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> treatment </th> <th style="text-align:right;"> n </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> control </td> <td style="text-align:right;"> 112 </td> </tr> <tr> <td style="text-align:left;"> Control </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> experimental </td> <td style="text-align:right;"> 114 </td> </tr> <tr> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> --- class: middle ### Example: Jelly bean data ```r jelly %>% count(flavour) %>% kable() %>% kable_styling(full_width = FALSE) ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> flavour </th> <th style="text-align:right;"> n </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> apple </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> Apple </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> banana </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> Banana </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> blueberry </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> bubblegum </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Bubblegum </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> cherry </td> <td style="text-align:right;"> 19 </td> </tr> <tr> <td style="text-align:left;"> Cherry </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> cinnamon </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> cocnut </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> coconut </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> Coconut </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Coffee </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> grape </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:left;"> Grape </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> lemon </td> <td style="text-align:right;"> 23 </td> </tr> <tr> <td style="text-align:left;"> Lemon </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> lemon/lime </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> licorice </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> lime </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Lime </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> marshmallow </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> orange </td> <td style="text-align:right;"> 32 </td> </tr> <tr> <td style="text-align:left;"> Orange </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:left;"> Pinapple </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Plum </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> purple </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Purple </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> raspberry </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> red </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Red </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> strawberry </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> watermelon </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Watermelon </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> white </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> White </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> yellow </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> yellow + white </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> yellow and brown </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> yellow and white </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Yellow Brown </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Yellow White </td> <td style="text-align:right;"> 3 </td> </tr> </tbody> </table> --- class: middle ### Example: Jelly bean data ```r jelly %>% count(tolower(flavour)) %>% kable() %>% kable_styling(full_width = FALSE) ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> tolower(flavour) </th> <th style="text-align:right;"> n </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> apple </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> banana </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> blueberry </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> bubblegum </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> cherry </td> <td style="text-align:right;"> 29 </td> </tr> <tr> <td style="text-align:left;"> cinnamon </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> cocnut </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> coconut </td> <td style="text-align:right;"> 19 </td> </tr> <tr> <td style="text-align:left;"> coffee </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> grape </td> <td style="text-align:right;"> 27 </td> </tr> <tr> <td style="text-align:left;"> lemon </td> <td style="text-align:right;"> 32 </td> </tr> <tr> <td style="text-align:left;"> lemon/lime </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> licorice </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> lime </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> marshmallow </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> orange </td> <td style="text-align:right;"> 52 </td> </tr> <tr> <td style="text-align:left;"> pinapple </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> plum </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> purple </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> raspberry </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> red </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> strawberry </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> watermelon </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> white </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> yellow </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> yellow + white </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> yellow and brown </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> yellow and white </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> yellow brown </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> yellow white </td> <td style="text-align:right;"> 3 </td> </tr> </tbody> </table> --- class: middle ### Example: Sorting ```r jelly %>% count(tolower(flavour)) %>% arrange(n) %>% kable() %>% kable_styling(full_width = FALSE) ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> tolower(flavour) </th> <th style="text-align:right;"> n </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> cinnamon </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> cocnut </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> coffee </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> lemon/lime </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> licorice </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> marshmallow </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> pinapple </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> plum </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> raspberry </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> yellow + white </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> blueberry </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> watermelon </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> yellow </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> yellow and brown </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> yellow and white </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> yellow brown </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> bubblegum </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> red </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> strawberry </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> yellow white </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> apple </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> white </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> lime </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> purple </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> banana </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> coconut </td> <td style="text-align:right;"> 19 </td> </tr> <tr> <td style="text-align:left;"> grape </td> <td style="text-align:right;"> 27 </td> </tr> <tr> <td style="text-align:left;"> cherry </td> <td style="text-align:right;"> 29 </td> </tr> <tr> <td style="text-align:left;"> lemon </td> <td style="text-align:right;"> 32 </td> </tr> <tr> <td style="text-align:left;"> orange </td> <td style="text-align:right;"> 52 </td> </tr> </tbody> </table> --- class: middle ### Summary: testing in this course * We are not routinely testing data or code in this course * You should be aware of the way errors get into analyses and that there are methods for guarding against them * You should do some checking of the data in your term project, but it's not an assigned part of the work --- class: middle # Further reading * Course notes --- class: middle, inverse ## Task * No task for this lesson