Saving data

Objectives

  • Be able to check for the existence of files on disk.
  • Understand how to represent data to facilitate saving to disk.
  • Be able to save data to a file on disk.

The typical purpose of running an experiment is to generate some data that we can use to address our question of interest. Hence, it is important that we are able to store the data generated by our experiments to a file on disk.

Creating and checking a file location

The first step in saving data from an experiment is knowing where to save the data to. A common scenario is to save the data from each session (subject and repeat, perhaps condition) to a separate file. This approach maximises the flexibility for subsequent analyses. We may end up with a file name such as p1000_exp_cond_1_rep_1.tsv.

A useful step at this point is to check that a file with this name doesn’t already exist—it is very frustrating to lose data because it was overwritten! This is also best done right at the start of an experiment, before any data is collected, so that the impact of a wrong filename is minimised. We can check whether the file already exists by using os.path.exists:

import os

data_path = "p1000_exp_cond_1_rep_1.tsv"

data_path_exists = os.path.exists(data_path)

print data_path_exists
False

If the file already exists, a safe strategy would be to exit the program at this point and tell the user about the problem. One way to do this is to use the sys.exit function:

import os
import sys

data_path = "p1000_exp_cond_1_rep_1.tsv"

data_path_exists = os.path.exists(data_path)

# we will pretend that it does exist
data_path_exists = True

if data_path_exists:
    sys.exit("Filename " + data_path + " already exists!")
Filename p1000_exp_cond_1_rep_1.tsv already exists!

Organising data output

While the data you collect in an experiment may come in many forms, it is best to work out a way in which the data can be represented in a “tabular” form; that is, consisting of rows and columns. This is the easiest format with which to store data on disk.

For example, say your experiment consists of 10 trials where each trial shows a grating at a particular orientation and records whether the participant pressed the left or right arrow key in response. This could be organised in the data file as 10 rows and 2 columns; each row represents a trial, and the two columns give the grating orientation and the response on that particular trial. For example:

import random
import pprint

data = []

for trial in range(10):

    data.append(
        [
            random.uniform(0, 180),
            random.choice(["left", "right"])
        ]
    )

pprint.pprint(data)
[[170.2380134766284, 'left'],
 [21.43877741975255, 'left'],
 [44.29799618149051, 'right'],
 [123.78765580362592, 'right'],
 [90.4004819386519, 'left'],
 [65.2586502574488, 'left'],
 [67.08706605950744, 'left'],
 [18.31637412188478, 'right'],
 [36.415401853733, 'left'],
 [89.6258850472638, 'left']]

Tip

Note that we are using a “pretty printer” (import pprint) to show the contents of data, as it makes the tabular organisation clearer.

However, to make it easier to save and load data to disk, we often convert strings into numbers by coding them. For example, rather than having “left” and “right”, we might use the numbers 1 and 2 to refer to the keys that were pressed.

import random
import pprint

data = []

for trial in range(10):

    data.append(
        [
            random.uniform(0, 180),
            random.choice(["left", "right"])
        ]
    )

pprint.pprint(data)

coded_data = []

for data_row in data:

    if data_row[1] == "left":
        data_row[1] = 1
    elif data_row[1] == "right":
        data_row[1] = 2

    coded_data.append(data_row)

pprint.pprint(coded_data)
[[15.841285395394317, 'left'],
 [77.24308806155922, 'right'],
 [120.31863826174141, 'left'],
 [68.20413796190277, 'right'],
 [140.60816319586255, 'left'],
 [29.668900179043224, 'left'],
 [90.33576145547305, 'left'],
 [162.39870297804484, 'left'],
 [22.306018494281254, 'left'],
 [53.76279809035283, 'left']]
[[15.841285395394317, 1],
 [77.24308806155922, 2],
 [120.31863826174141, 1],
 [68.20413796190277, 2],
 [140.60816319586255, 1],
 [29.668900179043224, 1],
 [90.33576145547305, 1],
 [162.39870297804484, 1],
 [22.306018494281254, 1],
 [53.76279809035283, 1]]

Saving data

Once we have our data assembled in a suitable format (i.e. list of lists, as above), we can save it to disk using the np.savetxt function. The first argument is the filename, and the second is the variable containing the data to be saved. We will also specify an optional argument called delimiter, which we will set as "\t". This tells the savetxt function that we want columns to be separated by a TAB character. This is a common way of storing files, and is reflected in the data’s filename (with “tsv” standing for “tab separated values”).

import random
import numpy as np
import pprint

data = []

for trial in range(10):

    data.append(
        [
            random.uniform(0, 180),
            random.choice([1, 2])
        ]
    )

pprint.pprint(data)

np.savetxt(
    "p1000_exp_cond_1_run_2.tsv",
    data,
    delimiter="\t"
)
[[111.72308946245312, 2],
 [0.23890846120006692, 2],
 [175.28011985115702, 1],
 [112.06603323103002, 1],
 [86.06792584342334, 2],
 [44.22366893318996, 1],
 [101.4029141498375, 1],
 [115.703634373552, 1],
 [100.9374884735026, 1],
 [8.21493430720146, 2]]

We can then go and load the data using loadtxt to verify that we are indeed able to do so:

import numpy as np
import pprint

data = np.loadtxt(
    "p1000_exp_cond_1_run_2.tsv",
    delimiter="\t"
)

pprint.pprint(data.tolist())
[[111.72308946245312, 2.0],
 [0.23890846120006692, 2.0],
 [175.28011985115702, 1.0],
 [112.06603323103002, 1.0],
 [86.06792584342334, 2.0],
 [44.22366893318996, 1.0],
 [101.4029141498375, 1.0],
 [115.703634373552, 1.0],
 [100.9374884735026, 1.0],
 [8.21493430720146, 2.0]]