Reproducibility
The reproducibility crisis is seen by some scientists, especially in the physical sciences, as more of a biology and psychology problem. Small sample sizes resulting in conclusions that cannot be found in larger studies or by repeating the same experiment.
Physics is, at its very heart, reproducible. The laws of physics don’t (as far as we know) change from moment to moment (time translation), a concept baked into the standard model of particle physics. However, as physicists we cannot simulate these things perfectly, and the assumptions and simplifications we make can sometimes cause us problems.
When was the last time you ran a piece of software you wrote twice on the same data, and compared the outputs? Never? I don’t want to alarm you, but it may be something you want to consider.
From the perspective of high performance computing, it can be daunting to check reproducibility of large simulations or blocks of data analysis. Computing time is expensive in both time, money and (on facilities that draw from non-renewable energy sources) the environment. However, checking reproducibility is an important step in determining whether our programs are working correctly. Even small variations in computer programs that should be perfectly reproducible can affect science.
It’s up to individuals to determine what level of variation is acceptable (e.g. floating point precision accounts for errors in around 1 in 10^7 calculations at single precision) without affecting science. But how do you go about checking your work is reproducible on large scales?
Establishing a suite of tests that check outputs for consistency
Establishing how to test computer programs is as time consuming as writing them. Some collaborations enforce external reviewing of all science code and will have standardised tests, but if you are not in one of these collaborations how do you do this. Some suggestions:
Establish a test data set. It should ideally be a subset of the data you want to actually run on. You can also feed in data that is ‘ideal’ rather than ‘realistic’ (e.g. if your program is designed to use coloured noise with the odd non-stationary feature, feed in Gaussian white noise). You also want a test data set that touches all areas of the codebase - if you have a lot of conditional code, you need to define a test set for every condition.
What are the major data products going into your program? These should be checked for consistency (i.e. ensure the data looks the same going in. How you do this is up to you)
What are the data products that come out? If it is a science product, how do you determine that the outputs are the same when you do the same thing twice? Ideally, if you can show this as some sort of plot rather than just diff-ing files, it can help diagnose issues
What are the intermediate products? Your program is probably going through several stages - identify tests that can be switched on and off to dump intermediate information and check that it is the same for each run with the same inputs*
Implementing the suite of tests
Tests don’t work if you don’t implement them. Actually checking reproducibility (establish a ‘base’ git branch, for example, and then compare non-science changes to your code like optimisation back to this) should be a fundamental part of the development process.
Interrogating issues that arise from testing
A good suite of tests will produce products (plots, reports) that can help you diagnose issues. You can even use issues that surface to define new and improved tests.
Consider simplifying things if human error is a problem
If you find you need to make lots of little tweaks here and there to get something to be perfectly reproducible, all these need to be noted down. Especially if you plan on publicly releasing your code. These tweaks should not be hidden away. And ensure that the version of the code and it’s initialisation that was actually used to obtain the result is shared, not just an idealised earlier version before you made a ‘minor tweak’ to perfect something. Especially if you plan on saying the result can be reproduced.
If you are releasing things publicly, also note down the hardware that was used, and how long the code took to actually execute whatever it does.
POSSIBLY CONTROVERSIAL OPINION: Do not include examples on your github/gitlab that are not easily tractable unless you give disclaimers in the documentation as to run times and hardware requirements. No 1-month-to-run example scripts without warning that it takes a long time, especially if it’s mingled in with an example that completes within ~10 minutes. I have encountered more than one piece of software which does this
POSSIBLY CONTROVERSIAL OPINION: If it’s released on GitHub, ensure you’ve actually tested all your use cases, not just the mode that you use all the time. Untested code should not be flagged for public release, and if you get a bug report, the answer to that report is not to just ignore it because you never use the software in that configuration. If it is untested, it should not be released without a warning.
Not so controversial: stop pushing untested changes to master to ‘tidy up’. You’re making a mess for the next person.
Once you’ve identified an issue, it needs to be resolved. Do not cross your fingers and hope it will go away. Even if it cannot be resolved (floating point precision etc), it needs to be adequately understood so that it doesn’t cause a knock on effect anywhere. Reproducible (within acceptability criteria you define, and these should not be overstretched) is a necessary requirement for your code to be producing the ‘correct’ result (but NOT sufficient! You should also compare your results to what is expected! Just because it reproduces, doesn’t mean it is right. However, just because it’s ‘right’ doesn’t mean it reproduces).
I’ve amassed the following list over the past few months. It may be extended in the future as I find other issues in my own work.
Is there a variation in the raw data or pre-processing?
Sounds stupid, but the most obvious sanity check is to diff the information going in to your computation. Human error is real. Is a random seed being initialised differently? Did a flag get set differently? Did you accidentally modify something you shouldn’t have?
Are you assuming something about your data that isn’t true? For example, are you assuming that you have stationary, Gaussian coloured noise when there is actually a large non-stationary feature in the data? If you are trying to whiten data with non-stationary features, it can cause problems down the line (see point 3).
Is the source of the variation in your algorithm?
Is there a race condition somewhere? Are there multiple asynchronous tasks running that sometimes finish at slightly different times when run two different times? Example: Your program dumps data to a file every ~3 minutes that is read in by other processes every ~30 minutes. Unless you carefully sync everything up, sometimes a file will write before or after the read in process if you dump files by wall time rather than say number of samples analysed, however this is necessary for a lot of real-time applications.
Have you introduced some sort of randomness somewhere? E.g. a random seed that always initialises differently? In this case, if everything converges at the end to the same answer, you’re probably OK, but if you’re here then it probably isn't
Is the source of variation in another library you are using?
Machine precision is the most obvious culprit. GPUs/CUDA are known to have problems. If you cut down to single precision, you’ll have more issues. You can tell if this is the core problem, or part of it, by looking at the level of variation. If you have differences on the order of 1 in 10^7 (for single precision), this is pretty consistent with machine precision problems. While not being bitwise reproducible, this should not break your science.
FFTW has modes that do not yield the same output on every run, and you get ~floating point precision from them. FFTW_PATIENT and FFT_MEASURE vary depending on performance, so if you run the same thing twice and on one day the computer is busier, it will be different, but for most applications should not break the science. For FFTW3 to be bitwise reproducible, you need to either use FFTW_ESTIMATE or set up a WISDOM file that pre-computes how to slice up the Fourier transform, specific to the hardware you’re using. FFTW_ESTIMATE seems to be the optimal middle ground if you run on different hardware, though
CUDA libraries sometimes cause issues beyond just the floating point error associated with the GPU itself - consider switching out atomicAdd.
If it happened one time, and you’re using GPUs, it was probably a cosmic ray
Is the source of the variation your post processing and checking?
Check you have not inadvertently introduced some sort of issue when you post-process information. For example, reading in an output of your program, and then doing something with it that involves a random seed that is initialising differently each time?
While this isn’t my finest blog post ever, nor my most polished, I wanted to share some information. I imagine this list will actually grow and evolve over time. If you find something to be added, I am also happy to do so with appropriate credit to be given to you (you can DM me on Twitter, @fipanther).
*A note on processes that are random by design (yes, if you’re using any popular inference software, this is directed at you): Inference should still be reproducible. You should still get identical results - e.g the same posterior - once your algorithm converges. More information about testing convergence can be found in this very informative blog post. If you are making any inference code public, and the intention is that it is reproducible, detailed information should also be provided so that anyone can get it to work. If it is only reproducible on a large compute cluster, that information should also be included - never assume the user knows this implicitly. Many people see public code, and assume it will run out of the box on their own machine.