Reproducing intention in research programming

A research workflow can be invalid if it fails to implement the intention that motivated its construction. Here, I discuss the assessment of such failures, some methods for improving the likelihood of workflows that reproduce an intention, and approaches for dealing with mistakes.

I recently gave a seminar on “Reproducibility in research: Strategies for avoiding reproducibility pitfalls”. In the seminar, we focused on two key components of reproducibility: the capacity to reproduce a set of results from a fixed dataset and the capacity to write code that reproduces the intention of a computational process. Here, I would like to expand on the latter component—which I think could use more attention in the ongoing discussions around reproducibility.

This seminar was the first time that I used a web-based platform to prepare and deliver the material; I used reveal.js, rather than what I typically use (beamer). An advantage of using a web-based platform, rather than my typical PDF-based approach, is the ease of including multimedia content such as videos. This allowed me to have some fun preparing a title slide, shown above, around the video game Pitfall!—one of the first games I ever played. Note the presence of the ‘bug’ in the pit (the scorpion), and the accompanying—but potentially dangerous—snake (surely a Python).

We can think of a computational workflow for research as having three main components:

  1. Data. Immutable digital material.
  2. Process. A computational procedure that operates on the data.
  3. Results. Digital material that is created as a consequence of applying the process to the data.

There is increasing recognition of the importance of considering the computational reproducibility of such workflows—that is, can we run the workflow and produce the same results, now and into the future. For example, Rougier and Hinsen issued the Ten years reproducibility challenge as “an invitation for researchers to try to run the code they’ve created for a scientific publication that was published more than ten years ago” (also see Perkel, 2020).

I want to discuss here another aspect to the issue of reproducibility—whether the workflow reproduces the intention of its authors. The assembly of a workflow is done with the aim of it implementing some abstract intention, and a workflow can be invalid and potentially misleading if it fails to reproduce the intention that motivated its construction. This failure of what I refer to as “reproducibility of intention” (somewhat awkwardly) can occur despite perfect computational reproducibility.

For example, let’s say that I have a relatively simple intention: to compute the sample standard deviation of an array of numbers. I might put together a workflow based on the Python programming language with the numpy package. Assuming the data is represented in a variable called data, the key step in the workflow process might use the np.std function:

data_std = np.std(a=data)

There is a problem, though—this workflow does not reproduce the intention that I had for its construction. This is because the np.std function computes the population standard deviation by default, rather than the sample standard deviation that I was intending to calculate.

This sort of mistake terrifies me when I think about my own research workflows—it is easy to make and hard to detect. Considering that a given workflow will likely consist of between tens and tens of thousands of such lines of code, there is lots of potential for such mistakes to silently corrupt the integrity of the results.

Here, we will consider:

Methods for assessment

Failures of a workflow to reproduce an intention can be difficult to identify. Here, I will go over a few heuristics that can be involved in assessing for such failures.

Note that I unfortunately do not have any formal evidence of the broad utility of these methods—just that I have found them to be useful. Additionally, I focus here on methods that I consider to be practical for a typical research scenario; additional methods that would undoubtedly be useful but that I think are more feasible for the creation of software libraries, such as test-drive development, are not covered.

Alternative workflows

It can be informative to replicate a workflow (or, even better, have someone else replicate a workflow) using a different approach and then assess whether the same results are produced. If the reason for the mistake is specific to the approach taken in the original workflow, this will be able to reveal the failure. For example, I often find it useful to informally re-run an analysis, that was implemented in a script-based workflow, using a GUI like JASP instead (where appropriate). Even within the same general approach, it can be useful to compare a few different implementations. For example, checking the output of np.std against the output of the Python built-in statistics.stdev would reveal a discrepancy (with the latter calculating the sample standard deviation).

Expectations of others

Having one or more other people, whether they be members of the research team or external reviewers, have a careful read through the code is one of the best ways to identify problems. Even just knowing that someone else will be reading through the code can be enough impetus for the author to check the extra assumption or to undertake the additional introspection that allows a mistake to be identified. I am a strong believer in the usefulness of including code review as part of a research workflow—I don’t think we read each other’s code nearly enough, in general.

Known constraints and consequences

We can sometimes identify steps within our workflow process where values must satisfy particular known constraints or consequences if the workflow is consistent with our intention. For example, a probability needs to be between zero and one, a standard deviation parameter needs to be positive, a categorical factor can only take one of a limited set of potential values, a set of coefficients must sum to zero, etc. Even if we don’t have such hard constraints, we often have a vague idea of the expected scale or type of representation—which we can use as a coarse check.

Intuition

This occurs when we are motivated by a part of the code or the results that are produced just not feeling right. This feeling doesn’t identify a failure of reproducibility of intention in itself, but can prompt the use of other methods to probe the alignment with intention. The basis for the intuition doesn’t quite reach the level of formality of a known constraint or consequence, as discussed above, but is just the general intuition that something is awry. It is likely related to the concept of a “code smell”—some aspect of the code that, while not necessarily incorrect in itself, gives an impression of trouble or fragility elsewhere. It can also be initiated by the results that are produced—maybe a control condition is giving values that are possible, but unlikely.

It is important to note that acting on intuition can also be dangerous by introducing bias into the recognition of mistakes. Mistakes that are identified via intuition typically involve some failure of the code or results to match expectations. Hence, mistakes that have the effect of violating expectations are more likely to be identified and resolved than mistakes where the code or results are consistent with expectations.

Strategies for improvement

There are strategies that we can adopt in developing our workflows that seem likely to reduce the probability of mistakes. Here, we will go over some strategies within the ecosystem of the programming languages that we work with, the mechanics of writing code, and some personal factors.

Language ecosystem

Programming languages do not exist in isolation, with each having its associated tools, conventions, approaches, and culture—which we can broadly refer to as its ‘ecosystem’. An awareness of the ecosystem of a language can help us use it in ways that reduce the probability of mistakes.

Run a ‘linter’ over your code

Most programming languages have tools that can help to identify errors and to warn about potentially problematic code—often called ‘linters’. They will always be imperfect, but routinely using such tools is a great and inexpensive (in time and effort) way to pick up on potential issues. A few linting tools for different languages are pylint (Python), mlint (Matlab), eslint (JavaScript), and lintr (R).

Consider using an auto-formatter

Most programming languages also have tools that will automatically format code to follow a particular set of conventions. This is particularly useful in spotting problems because of how it facilitates visual pattern recognition—by standardising the formatting and presentation of the code, it makes the content more visible and helps to highlight any unusual elements. It is also very useful for code review by removing any distracting debates about formatting preferences.

I particularly like black for Python because it doesn’t really have any options—you just have to accept its conventions. This took a bit of getting used to at first as I learned to get over some of my formatting preferences, but the benefits of consistent formatting strongly outweigh such concerns. For other languages, there is prettier (JavaScript), MBeautifier (Matlab), and styler (R).

Be particularly careful with notebooks

Notebooks (e.g., jupyter) have become a common way to write code and allow it to be interleaved with prose and graphical output. The use of such notebooks have many appealing qualities, particularly for tutorials and demonstrations. However, the way that they are typically utilised can introduce potential issues. For example, Pimentel, Murta, Braganholo, & Freire (2021) highlight “hidden states, unexpected execution order with fragmented code, and bad practices in naming, versioning, testing, and modularizing code” (p. 2).

Writing code

There are many ways of writing code that implements a given computational process. However, some approaches to writing code are more conducive to producing outcomes that are more valid and robust.

Use your language’s strengths and be careful with its weaknesses

An awareness of the strengths and weaknesses of your chosen programming language will allow you to capitalise on the former and exercise caution around the latter. For example, the ability to use named function arguments is a great feature of Python. It allows me to call a function via something like np.std(a=data, axis=0), rather than just np.std(data, 0). Capitalising on this strength of Python would mean using these named arguments wherever possible. Conversely, if I am using a language that doesn’t have this feature, then I would need to be particularly careful that the order in which I supply the function arguments matches the order of function parameters.

Part of this is also choosing your data representations and abstractions carefully. For example, I typically find multi-dimensional arrays to be a great way to represent the data that are part of my typical workflows. For a long time, I based this representation around the ndarray of numpy. I was always concerned, though, that my code would often include statements like:

mean = data.mean(axis=2)

That is, I would need to keep track of the axis indices to be able to specify which axis to operate along. This was always worrying, and I would often include many checking statements to make sure that the shape of the array was as I had expected.

I have since moved to mostly using the wrapper over numpy that is provided by xarray. This allows the use of named dimensions, which makes such operations much clearer and less error-prone:

mean = data.mean(dim="ppant")
Consider using assertions

Most programming languages have something equivalent to an assert statement, in which a logical expression is stated and an error emitted if it evaluates to being false. This is a great way to document and enforce your expectations about the state of something in your code at a particular point (such as those described above in Known constraints and consequences).

For example, let’s say that I use a convention where each participant in an experiment is assigned an identifier that begins with the character ‘p’—such as "p1001". At some point in my Python code, I am working with such an identifier in a variable called ppant_id. To express my expectation about the structure of this identifier, I could include something like:

assert ppant_id.startswith("p")

This will raise an error if the contents of the variable does not match this requirement.

When using assertions, it is important to be mindful of the challenges of assessing floating-point equality. For example, this statement will typically cause an error even though it appears to be correct:

assert (0.1 + 0.2) == 0.3

Instead, we need to use a logical expression that is aware of the floating-point approximations, such as np.isclose, which will not raise an error:

assert np.isclose(a=(0.1 + 0.2), b=0.3)
Use comments sparingly and judiciously

Including comments in code does not necessarily make the code more understandable or reliable. Comments can sometimes just re-describe what is being implemented in the code, but with less precision and with the vulnerability to becoming outdated if the code changes. Instead, aim to make the code as clear as possible without requiring comments, by using informative variable names and expression structures, and reserve comments for explaining why the code has its particular form.

For example, say if you are running a study in which participants respond to a prompt about whether they completed the study with appropriate diligence or not. A participant accidentally selected ‘no’ when they meant to select ‘yes’, and they immediately email you to report their mis-click. You decide to modify the data when it is being loaded to reflect their diligence as reported in the email (it is perhaps debatable about whether this is an appropriate response), and you include something in the function that loads the data like:

if participant_id == "p1001":
    diligent = True

You might decide to add a comment to this code:

# set p1001's diligence to true
if participant_id == "p1001":
    diligent = True

However, this comment doesn’t really add anything beyond what is already being expressed in the code. Instead, a better use of comments here would be to give some context and explain why the code is there:

# participant p1001 emailed on 2022/1/6 to say that
# they mis-clicked on the diligence question
if participant_id == "p1001":
    diligent = True

This is also a good opportunity to include an assert statement. We have reached a point where we have a strong assumption about the system state (that diligent is False). If that is not actually the case, then there is some faulty logic somewhere (maybe it was a different participant who mis-clicked, or maybe we edited the raw data rather than doing it at the data loading stage) that we need to be aware of.

# participant p1001 emailed on 2022/1/6 to say that
# they mis-clicked on the diligence question
if participant_id == "p1001":
    assert not diligent
    diligent = True

A useful approach for preparing comments is to take the perspective of somebody (perhaps your future self) who has access to the code but does not have any additional information or context. If something in the code is surprising or has an unclear rationale under these conditions, it is likely to be a good candidate for a comment explaining why your current self, with your additional contextual knowledge, wrote the code in this way.

Consider ease of exploration

A common and productive way to develop research code is to combine writing code into a file (e.g., analysis.py) with interactive exploration via a read-eval-print loop (REPL) shell such as ipython. This approach is facilitated by structuring the code so that it is able to be easily explored in the interactive shell.

For example, say you are writing code to implement a data analysis workflow. You might structure your code as a single monolithic function:

def run_analysis():

    # load the data
    # ...

    # compute descriptive statistics
    # ...

    # compute inferential statistics
    # ...

This structure makes it difficult to isolate the different components of the analysis so that they can be explored individually. A good indicator of this difficulty is if you find yourself inserting temporary return statements into the function so that you can explore variables that are defined within it.

Instead, it is often useful to break it up into a set of individual functions with much more limited responsibilities; e.g., load_data, calc_descriptives, and calc_inferential—each of which may themselves be decomposed into separate functions. This increase in modularity makes exploration easier, and is also likely to improve the overall understandability of the code.

Personal factors

The behavioural and cognitive aspects of programming can also affect the likelihood of producing a workflow that appropriately reproduces its intention.

Keep a mental record of past mistakes

There will inevitably be times in which you recognise a failure of your workflow to reproduce your intention. It is very useful to not ‘waste’ such situations and instead use them as an opportunity to reflect on how the mistakes happened and how they can be avoided in the future. Without this mental footprint, it is likely that you will either directly repeat the mistake in the future or else infer some incorrect causal mechanism. It is particularly important to avoid being satisfied by being able to resolve the mistake but without knowing why or how.

I have made many programming mistakes that stand out in my mind, but here are a few specific examples:

  • I was coding an experiment that involved interacting with an external audio device. For the initial version of the experiment, I was communicating with the device over the parallel port and this imposed a limitation on the available number of audio tracks that could be played (about 250). Mindful of this restriction, I stored the track numbers in an array with an 8-bit unsigned integer data type (i.e., a data type that could store values between 0 and 255). This limit later became overly restrictive, so we moved to a serial communication method that allowed many more tracks. But, unfortunately, I didn’t update the data type of the array. This had the consequence that when I would do something like track_nums[2] = 300, the value stored inside track_nums[2] would be 44 rather than 300 (an integer overflow). Luckily, I caught this problem during piloting—but I now avoid using such specific data types unless necessary (e.g., for images).
  • I was selecting some random image indices for display in a vision experiment and hard-coding them into the script. There were quite a few of them, and I had each index on a different line in the code. To make them line up nicely, I zero-padded the numbers so that they always had three digits (e.g., 32 was written as 032). Unbeknownst to me, these leading zeros actually caused Python to interpret the numbers as being in an octal numbering system—so my 32 was being stored as a 26. Thankfully, this behaviour was changed in Python 3 to instead raise an error. This taught me that programming languages can have some surprising quirks, and the importance of interactive exploration of data structures.
  • I had written some code to present visual stimuli in a functional magnetic resonance imaging (fMRI) scanner. I was participating in a pilot version of the experiment inside the scanner when I noticed that there was a timing bug in the code that had the effect of the visual presentation becoming subtly but increasingly out of sync with the scanner acquisition over time. Recognising this bug made me very annoyed with my mistake—this annoyance then had physical effects on me that caused the goggles that I was wearing to fog up so I couldn’t see the screen! This bug thus had both direct and indirect consequences for the validity of the data. It was a good lesson in the subtleties of timing, particularly across different display systems, and the importance of experiencing experiments yourself where possible.
Be careful when switching languages

Although we are likely to have a preferred language, we can sometimes need to write, edit, or review code in other languages. For example, I prefer Python but need to be able to write in JavaScript for web situations, Matlab to collaborate with colleagues, and R to cross-check analyses under different approaches.

Unfortunately, switching between languages can introduce the potential for errors by incorrectly applying different language conventions. For example, I recently ruined a web experiment that I had written in JavaScript because I forgot to increment a counter variable. A likely contributor to this mistake was that I was still thinking in a Python-based approach, in which loops are typically constructed in a way that don’t require counters. I also spent a long time in R recently trying to work out why I was receiving an obscure (to me) error message—until I remembered that R uses one-based indexing, unlike Python and JavaScript.

The above example about the mistake in the web experiment also exemplifies the role of Intuition described earlier. The data from this experiment had conformed to reasonable expectations, except for one aspect that just didn’t seem quite right. We had noticed it earlier but could not come up with any explanation for what sort of error in the code could have caused it. But the nagging feeling persisted and prompted me to do a thorough code audit, when the mistake was discovered.

Other than simply being mindful that you are working in a different programming language than usual, it can also be helpful to include other contextual cues when switching. For example, you might consider using different text editors for different languages. Additionally, when switching between languages that have elements with options, you might make your selections so that they are different between languages. For example, when I began switching my dominant language from Matlab to Python, I made the conscious decision to use the " symbol rather than the ' symbol for strings (both of which are acceptable in Python) because I was using ' in Matlab. For a similar reason, I name my variables using camelCase in JavaScript and snake_case in Python.

Be careful when coming back to code after a period of time

It is often interesting to come back to code that you wrote a while ago—it can be surprising how much that was seemingly clear at the time is no longer readily understandable. In addition to being a good opportunity to reflect on the choice of coding structures and the use of comments so as to improve future intelligibility, this is also an indicator that caution is needed when making any changes to the code.

Responding to peer review of a manuscript can be a particularly dangerous time for introducing mistakes into a code base. A few months or more have typically passed since you last looked at the code, and now it needs to be changed to be able to address issues that are raised in peer review. By their nature, there is a good chance that the code wasn’t initially structured to be able to readily implement such changes. They need to be handled with particular care.

Dealing with mistakes

It is inevitable that mistakes will be made when writing code to implement a research workflow. Many of us have little to no formal training in computer programming, and the nature of research is that we are often doing things that haven’t been done before. When we make such a mistake, the best that we can do is to identify it as early as possible, resolve it, communicate it, learn from it, and move on. This is facilitated by being embedded in a research culture that has the minimisation of mistakes at its foundation—see Rouder, Haaf, & Snyder (2019) for a great perspective on such a culture within a lab.

It is useful to interpret mistakes that persist in a research workflow as due to a failure of process, rather than a personal failing. That is, there should be systems in place that minimise the frequency of mistakes and ensure their early identification and resolution—they are not indicative of a lack of aptitude or diligence by their author. Indeed, mistakes are an important part of learning and can be an indicator of a successful process; for example, a critical mistake that is identified by internal code review within the research team before being communicated externally.

It can be very difficult to attribute mistakes to anything other than a personal failure of competence. Indeed, this is something that I struggle with—it manifests in my often counter-productive tendency to over-complicate workflows and to react defensively and negatively to criticism. It is something that I recognise in myself and hope to improve in the future.

It is also important to recognise and acknowledge that we sometimes need to write code that can have substantial and tangible consequences for others. For example:

  • Auditory stimuli can affect hearing if at an excessively high level.
  • Visual stimuli can have negative effects, such as motion sickness with wide-field displays or the adverse effects of flicker.
  • The presentation and interpretation of study outcomes can affect factors such as policy, community sentiment, interventions, and funding directions.

We need to be mindful of such considerations and devote our resources such that the degree of caution involved is proportional to the severity of the potential consequences.

Summary

Here, we were concerned with the fundamental issue of code correctness within research workflows—does the workflow reproduce the intention for which it was constructed.

We first discussed how we can assess for failures of reproducibility of intention: evaluating the output of alternative workflows, examining the expectations of others, testing against known constraints and consequences, and listening to an intuition that something is awry.

We then discussed some strategies that we can apply to improve the probability of a workflow reproducing our intention. Within the domain of a programming language ecosystem, we discussed the use of linters and auto-formatters and promoted the cautious use of computational notebooks. For the mechanics of writing code, we examined the utilisation of language strengths, the usage of assertions, appropriate commenting strategies, and constructing our code with ease of exploration in mind. We then discussed some personal factors: how it is useful to keep a mental record of past mistakes and the particular caution that is needed when switching between languages and when coming back to code after a period of time.

Finally, we considered how to deal with the mistakes that will inevitably occur. We discussed the appropriate reaction to mistakes, with particular recognition of the role of processes rather than personal factors. We also considered the impact of mistakes, and how the allocation of resources can be proportional to the severity of the expected consequences of mistakes.

Overall, I hope that this has provided a framework for thinking about the validity of research workflows and has described some useful and concrete strategies and tools.

Acknowledgements

Thank you to A. Prof Tamara Watson for organising the seminar and the MARCS Institute for Brain, Behaviour and Development at Western Sydney University for hosting.

References

  1. Perkel, J.M. (2020) Challenge to scientists: does your ten-year-old code still run? Nature, 584, 656–658.
  2. Pimentel, J.F., Murta, L., Braganholo, V., & Freire, J. (2021) Understanding and improving the quality and reproducibility of Jupyter notebooks. Empirical Software Engineering, 26(65), 1–55.
  3. Rouder, J.N., Haaf, J.M., & Snyder, H.K. (2019) Minimizing mistakes in psychological science. Advances in Methods and Practices in Psychological Science, 2(1), 3–11.