Benefits of storing and accessing research data in zip files

2022-Aug-17

Embedding multiple files within a single zip file can simplify the verification of data integrity while still allowing the individual files to be readily accessible within code.

Data can often consist of multiple files. For example, there might be a raw data file for each participant that completed a study. Here, I am going to argue that it can be useful and convenient to store such a set of files in a single zip file rather than as separate files.

An important benefit of a single zip file is that it simplifies hash-based verification of data integrity. In this approach, a hash function (such as MD5) is used to produce a small summary (a hash) of the data file. Because any change to the data file produces a change in the hash, we can use the hash as a simple method of verifying that the data file matches our expectations. This is an example of using a known constraint to improve computational reproducibility. With just a single zip file, we only need to compute and check a single hash—rather than needing to do so for every separate file if we hadn’t combined them into the zip file.

In Python, we can compute the hash of a file via a function such as:

import hashlib
import pathlib


def compute_hash(path):

    path = pathlib.Path(path)
    path_bytes = path.read_bytes()
    path_md5 = hashlib.md5(path_bytes)
    path_hash = path_md5.hexdigest()

    return path_hash

We can then create a function that compares a provided file path with an expected hash and raises an error if the hashes do not match:

def verify_path(path, valid_hash):

    path_hash = compute_hash(path=path)

    if path_hash != valid_hash:
        raise RuntimeError(
            f"Hash of file `{path}` ({path_hash}) does not match the "
            + f"valid hash ({valid_hash})."
        )

For example, we can define a function that is used to load the raw data for a study. After we have computed the hash of the zipped raw data file, we can use this hash within the function to check that the raw data is as we expect:

def load_data(path):

    required_hash = "a065568468ee5349222242497e5e205f"
    verify_path(path=path, valid_hash=required_hash)

    # ...

This is a useful way of verifying that the data going into an analysis matches an expectation. It can identify situations such as having an outdated or incomplete version of the data, and prevents a misleading analysis from proceeding under these circumstances.

But how do we now actually access the individual files, given that we have combined them into a single zip file? The neat thing is that we don’t actually have to extract the individual files—we can access them completely within the code (in Python, at least). We can do that by using the zipfile built-in Python module. For example:

import zipfile


def load_data(path):

    required_hash = "a065568468ee5349222242497e5e205f"
    verify_path(path=path, valid_hash=required_hash)

    with zipfile.ZipFile(file=path) as path_zip:
        # iterate over all the files in the zip file
        for zip_filename in path_zip.namelist():
            # open the file within the zip file for reading
            with path_zip.open(name=zip_filename) as file_handle:
                # parse the data file as required
                # ...

The main potential downsides of this approach that I can see are a possible performance cost (but, even if present, it is unlikely to be much of an issue for most analyses) and the possibility for less visibility into the files that comprise the data ('unzip -l ${ZIPFILE}' can be used on the command line to list the files within a zip file). However, overall, I think that it can be beneficial to store data in a zip file rather than as separate files.