Testing

We test code in order to address risk. It is your job to assess risk. It’s a job that requires expert understanding of the research task. We can talk here about how to find risky parts of research applications and how to mitigate that risk.

Reading code for risk

The section on Organizing Code suggests that we arrange functions into functional classes:

Parsing and validation of arguments and settings files
Input and output
Data movement to transform data structures
Computational algorithms and modeling

Each of these functional classes is more likely to have different kinds of errors in the code. We can think of the code to fail to function in a few ways:

Syntax errors, language doesn’t do what you thought it did.
Logical errors, such as off-by-one errors, failure to check for NULL, control flow failing to check for a condition.
Data errors, failure to validate, incorrect column reading.
Mathematical errors in grouping of terms, choice of function calls, inputs outside a function’s domain.

We can focus on mathematical errors in the computational part of the code and focus on data problems in the input and output section.

We know from public health that risk is an interaction among the probability of an event, the severity of an event, and our ability to mitigate adverse outcomes. For code, we can understand this as fault, error, and failure.

A fault is the mis-typed part of the code, the letters on the screen that are typed wrong. An error is the event that a running code encounters this fault and computes a number that will affect the result. If that error causes an exception, or if it is seen by validation code, then it becomes a failure. This leads to four kinds of risk, shown in the following table.

	not fail	fail
faultless	great	annoying
fault	terrible	fineiguess

It’s the fault without the failure that causes the most problems. Sometimes it’s obvious:

df <- dataframe(vals = c(3, 7, NA, 9, 11))
df[is.na(df)] <- 0

Most times, it’s harder to spot than this.

There are tools to measure complexity in code, and they can highlight places the code is hard to reason about. It looks like the most popular, Sonarqube, still doesn’t support R. By eye, you would look for two properties of code:

Cyclomatic complexity - This is the number of paths through code, which corresponds to the number of if-then branches. Beware that each selection of indices in a dataframe is another branch in the code. You can check this with a package in R.
Halstead complexity - This is the number of mathematical operatons (addition, multiplication), and calls to math functions like sin. In other words, it’s where your math and models lie, but you were staring at that code, anyway.

Addressing risk

There are three ways to find a bug: reading code, running code, or testing code. The pull-review process is about maximizing effectiveness of readers. This section is about how to maximize effectiveness of tests.

Better tests

We think about tests that are higher or lower, where a low test examines a function that doesn’t call any other functions in your code. The highest-level test would simulate input to the research application, run the code, and check output, so it is end-to-end in its exercise of the code.

The term unit test refers to testing of small, or low, units of code. For R packages, it’s a misnomer, because the testthat testing framework is called a unit test, but you can put any kind of tests into the testthat framework, no matter whether they exercise the whole application or a small part of it.

The second concept for unit testing is about information. You can write a test that checks for a result.

test_that("the function runs", {
  assert(!is.null(safetyguide:::internalfunction(37)))
})

Another test could check a result in much more detail.

test_that("the function runs", {
  result <- safetyguide:::internalfunction(37)
  assert(result[["x"]] == 3)
  assert(result[["y"]] == 7)
})

The second function checks more about the code. The implementation of that function could change a lot, and the first check would pass, but the second one depends, in more detail, on the function working properly.

We can now discuss two regimes of testing that are, in some sense, best.

Tests at the highest level of the code, that check lots of information, are very strong. This is what we, as researchers, do when we run code in a notebook to make graphs and charts. The trouble with these tests is that we generally can’t try all the possible inputs. Not even close. Even if every input we try works, we’ve hardly checked a hundredth of the inputs that will happen when we apply a function to a large dataset.
There are functions in the code where we can check most of the relevant inputs to ensure they all work. Maybe we made a core mathematical function, such as our own implementation of the Langevin equation, or a custom regression. We can work through its corner cases pretty well.

There are other tests we might use during development, but these very high tests and these crucial intermediate tests stand out as powerful places to test research code.

Floating-point code

Sometimes it matters that the numbers in R’s code aren’t numbers on the real number line. I ran into this for life tables, but it’s generally known as numerical analysis of IEEE floating point.

Finite precision in an infinite world, by Cameron and Chartier, Math Horizons, 2019.
Overton, Numerical Computing with IEEE Floating Point Arithmetic, is a short book available through the UW library, and it gives a basic howto for getting through floating-point calculations when things look incorrect.
Higham’s Accuracy and Stability of Numerical Algorithms is a tome on the subject.

Comparison of results

As researchers, most of how we understand the correctness of our code is by comparison with some other result that teaches us what to expect. There are so many ways we compare code.

to a published dataset
to a published table
to a published graph
to the same code’s previous result (regression testing)
to a less-exact calculation
to samples of a more-detailed simulation
to a less efficient version

These become unit tests when you put the data into the testthat directory and choose an L1, L2, or L-infinity norm with which to compare the two results.

Comparison with theory

It gives me confidence in code when I can compare its output with an exact result. This may only work for particular settings, or for a reduced form of the problem, but it means that, in at least one instance, it was absolutely correct.

For example, we forecast future populations using Leslie matrices. If I synthesize a population with a constant growth rate, then the Leslie matrix, constructed from that population, should have a lead eigenvalue with that same growth rate. That’s a strong check.

But we discussed above the idea that some tests have more information than others. There is a closed-form solution for a population that oscillates in size. This Leslie matrix will have a second eigenvalue that describes that oscillation. If my simulation matches that solution, I have much more confidence that there aren’t errors.

Testing stochastic functions

If you need to test a function that returns random values, use you statistical knowledge to help your software task.

Because unit tests aren’t for publication, there is no reason not to favor robust statistics. Go ahead and use a trimmed mean and trimmed variance to evaluate a custom distribution. These will be more stable against outliers. Nobody wants to have to go back and redo a quantile plot, once code is in use.

Before calling functions in a unit test, pin the random seed to a single value with set.seed(9273424). Otherwise, a unit test will sometimes fail, even after it passed, and that kind of spurious signal means people won’t pay attention to failures. If you want to ensure testing is more thorough, then, when you modify a function’s source code, run its stochastic tests for a longer burn-in period. Let them run an hour, if you want, before reducing draws so that unit tests take a reasonable time.

Symmetry of results

I can explain this with an example from demography. The variables used for life tables have units, and they appear paired as unitless quantities. For example, \(n\:{}_nm_x\) appear together, and the mean age of death, \(a\), appears with a duration below it, \(a/n\). Therefore, functions that act on these quantities should scale such that, if you double \(n\), you can double \(a\) and half \(m\), and get the same answer.

That’s an example of a symmetry in a function. We don’t always think about symmetries when we write a function, and it’s unlikely that a function with a mathematical error will disobey such a symmetry.

Software testing literature calls this metamorphic testing. We would respect them even without the obscurantist language.