# Fundamentals of data and coding for psychology

## Data storage and navigation

We have considered how data is represented in a computer in binary form.
We often don't want such data to be lost once the computer loses power - we would like to store some data on a device such as a 'hard disk'.
How we go about storing our data is an important consideration for research projects&mdash;particularly when dealing with sensitive and confidential information, we need to know how and where the data is stored.

The learning objectives for this lesson are for you to be able to:
* Describe the components of filenames and file paths.
* Navigate a filesystem using absolute and relative paths.
* Develop a data storage/organisation strategy for a research project.

### Files

We can define a 'file' as a set of bytes that are stored on some medium, such as a disk, and the associated 'meta-data'.
Such meta-data can include things like the time the file was created and changed, permissions on who can access it, etc.

We are going to consider two aspects of the meta-data: the *filename* and the *file path*.

#### Filenames

You are likely to all be familiar with the concept of filenames - those are what you type when you want to save a file, for example.
They consist of two components: a *stem* and an (optional) *extension* (or *suffix*).

For example, consider the filename `my_essay.docx`.
Here, the stem would be `my_essay` and the extension would be `.docx`.

(A trend in some operating systems, such as Windows, is to *hide* the extension by default when using a viewer like Windows Explorer.
It is important to know that the extension is still there and is still part of the filename.
I recommend turning off this 'hide extension' functionality, if possible.)

Note that not all letters are allowed in filenames, and which letters are allowed depends on the specific filesystem&mdash;so best to keep them simple.

Spaces in filenames are typically allowed (e.g. `my image.jpg`), but I'd suggest avoiding them as they can be a bit of a pain.
A frequent convention is to replace them with underscores, like we have done previously (e.g., `my_image.jpg`), or dashes (e.g., `my-image.jpg`).

Files can also have multiple extensions, such as `my_image.tar.gz`.
This is less common on Windows platforms.

##### Filenames in Python

Let's have a look at how filenames are handled in Python.
There is special functionality in Python for obtaining information about filenames called `pathlib`.
To make this additional functionality available to use, we use the special Python command `import`.

In [None]:
import pathlib

We can now access the `pathlib` functionality by methods and objects that are stored in the `pathlib` variable.
Of particular interest to us is the `Path` object:

In [None]:
doc_path = pathlib.Path("my_essay.docx")

As you can see, we provide it with a string that contains the filename and it returns an object that we store in a variable we have called `image_path`.
This variable then contains a bunch of useful methods and properties.
Like what?
There is a good list in the [Python documentation](https://docs.python.org/3/library/pathlib.html#methods-and-properties).

Let's have a look at a few:

In [None]:
print(doc_path.stem)

In [None]:
print(doc_path.suffix)

In [None]:
print(doc_path.name)

In [None]:
print(doc_path.suffixes)

Another useful path method is `with_suffix`, which returns a path object where the suffix has been replaced.

Inspect the help for this method, and then use it to convert `my_essay.docx` into `my_essay.xcod`:

In [None]:
doc_path.with_suffix?

##### File extensions don't really matter for the data (but they do matter for people working with the files)

Something that I have noticed has come up in many previous years of this class is that students tend to think of file *contents* as being determined by file *extensions*.
These things often covary, but this is not necessary; your `my_essay.docx` is no less of a Word document if you rename it `my_essay.xcod`.
Relatedly, an image called `my_img.jpg` does not become a Word document if you rename it to `my_img.docx`.
As we have discussed, the critical thing is the underlying *bytes*&mdash;which are not changed by altering the extension.

Note 1: this is only true for *renaming* files. If you explicitly ask the program to save it in a different format (e.g., using 'Save As'), that is likely to change the underlying bytes.

Note 2: you might have difficulties opening your `my_essay.xcod` file your usual way.
This is because your operating system (Windows, Mac, Linux) uses the file extension to know what program to open it with.

Self-contained files containing Python code (i.e., code that is just code, rather than being part of a notebook like this one) are typically given the `.py` extension.
However, as we've just gone through, this is not necessary&mdash;it only need contain bytes that are valid Python code.

The above concerns the (lack of a necessary) relationship between a file's name and its data contents.
However, **file names are still super important**!
They are important because we as people need to interact with the files, and file names are a critical component of understanding what the file contains.
We will discuss this concept more later in the lesson.

#### File paths

The other aspect of files of interest is their *path*; that is, their location on the storage medium.

Paths are readily understood using a 'tree' metaphor.
At the top of the tree is the root node, which can contain files and also sub-trees called *directories* (or *folders*).
These directories can themselves contain sub-trees, resulting in a hierarchical structure.
Such hierarchies are shown visually in programs like 'Windows Explorer' or 'Mac Finder'.
However, we also need a way of expressing such structures in text.

The most important aspect is the use of *slashes* to represent directories, and the progression from left to right moving from higher-level (closer to the root node) to lower-level sub-directories.

Here are a few example paths:
* `C:\Users\damien\project\`
* `/Users/Damien/project/`
* `/home/damien/project/`

Note that the way these are written is a bit different on different operating systems.
In Windows computers, the slash character is the 'backslash' (`\`), whereas it is the 'forward slash' (`/`) in Mac and Linux computers.

(We will soon see some complications that ensue from this.)

##### File paths in Python

We can again use `pathlib` to represent file paths in Python:

In [None]:
project_path = pathlib.Path("/Users/Damien/project/")

We can also inspect some of the useful information it provides:

In [None]:
print(project_path.parts)

In [None]:
print(project_path.root)

In [None]:
print(project_path.parent)

In [None]:
print(list(project_path.parents))

We can also use a `/` operator to create new paths:

In [None]:
subproject_path = project_path / "study_1" / "exp_1"
print(subproject_path)

##### So why is `\` a problem?

Glad you asked!

It comes back to the representation of text that we encountered last lesson.
If you take a look at the ASCII standard again, you will see that there are a bunch of [control characters](https://en.wikipedia.org/wiki/ASCII#Control_characters).
These don't represent characters per-se, but instead modify how the characters are presented.

The problem is that many of these control characters are indicated using backslashes...

For example, say if you wanted to indicate that there should be a new line part way through a string.
You can use the `\n` control character to do that:

In [None]:
multiline = "Class name: Internship\nClass code: PSYC3361"
print(multiline)

You can see that Python has interpreted `\n` as a single 'thing', rather than separate `\` and `n` characters.

We can see that by inspecting the length of the string:

In [None]:
demo = "ab\ncd"
print(len(demo))

Another useful control character is `\t`, which inserts a 'tab'.
Often raw data from experiments is saved in files that are called 'tab separated values'.
For example:

In [None]:
data = "Cond1\tCond2\tCond3"
print(data)

You can see how the combination of control characters and the direction of the slash used in Windows might be a problem for a path like `C:\new_project`.

Not to mention something like `C:\behav`.
What does it get represented as? Why?

There are a couple of ways around it.
First, we can prepend an `r` to the string to indicate that it is 'raw':

In [None]:
demo = r"ab\ncd"
print(demo, len(demo))

We can also add an additional `\`:

In [None]:
demo = "ab\\ncd"
print(demo, len(demo))

We can also often just use a `/`, even on Windows&mdash;Python is smart enough to know what we mean:

In [None]:
project_path = pathlib.PureWindowsPath("C:/Users/damien/project/")

(Note that we are explicitly asking for a Windows path here by using `PureWindowsPath`.
We need to specify the `Pure` bit because our notebook is actually running on a Linux computer in some data warehouse somewhere, and it would need to be running Windows to access some of the functionality in a `WindowsPath`.)

##### Absolute and relative paths

A very useful concept when thinking about interacting with data in a research context is the difference between *absolute* and *relative* paths.

An *absolute* path includes the full directory structure, including the root node.
These paths listed earlier are each absolute paths:
* `C:\Users\damien\project\`
* `/Users/Damien/project/`
* `/home/damien/project/`

In [None]:
project_path = pathlib.Path("/home/damien/project/")

print(project_path.is_absolute())

A *relative* path is instead interpreted with respect to the 'current working directory':

In [None]:
print(project_path.cwd())

When using relative paths, there are two important special directory names to be aware of:
* `.` refers to the current directory.
* `..` refers to the parent directory.

While the former (`.`) does not come up all that often in common usage, the latter (`..`) is very often encountered.

For example, say if we are storing the code for an experiment in `/home/damien/project/code` and we will be running the code from within that directory.
We want that experiment to save some data in the directory `/home/damien/project/data`.

One option is to use an absolute path, as we have done before:

In [None]:
data_path_abs = pathlib.Path("/home/damien/project/data")

What's wrong with that?
The trouble is that it locks us into that particular complete directory structure.
What if another user wanted to run the code&mdash;they cannot access my home directory, so it would be unable to run.

Instead, we can use a relative path for the data by using `..` to indicate the parent directory and then `data` to indicate the directory within that parent directory:

In [None]:
data_path_rel = pathlib.Path("../data/")

In [None]:
print(data_path_rel)

In [None]:
print(data_path_rel.is_absolute())

Using a relative path, it would be able to accommodate differences in the structure of the parent directories&mdash;it only needs to have a `data` parent directory.

We can use many of the `..` operators:

In [None]:
longer_data_path_rel = pathlib.Path("../../different_project/data")

Note that a filename just by itself is a relative path:

In [None]:
doc_path = pathlib.Path("my_essay.docx")

print(doc_path.is_absolute())

We can also use the `.` that we talked about earlier to explicitly indicate the current directory:

In [None]:
dot_doc_path = pathlib.Path("./my_essay.docx")

Which is the same thing:

In [None]:
print(dot_doc_path == doc_path)

(Note the use of `==` (double equals sign) in the above; in Python, the `==` operator tests for *equality*)

### Data storage structure for research projects

With a bit more knowledge about how data is stored on media such as hard drives, we can start to think about how you might like to structure the data that you require to run your study and are produced from running your study. Things like directory structure and filename conventions are a very important part of running reliable experiments.

Unfortunately, there are very few standardised ways of approaching the development of a data storage structure.
This is likely due to the heterogeneity of the studies that are conducted, but everyone ends up doing things in different ways.

One exception is in the neuroimaging (e.g., fMRI) research area, who have recently proposed a standard called the "Brain Imaging Data Structure" (BIDS).
Here is an example image from their paper describing the standard ([The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments](http://www.nature.com/articles/sdata201644)):

![BIDS layout example](https://media.nature.com/lw926/nature-assets/sdata/2016/sdata201644/images_hires/sdata201644-f1.jpg)

On the left is a depiction of the filesystem 'tree' for the raw data that is produced by the scanning instrument.
Note the lack of meaning in the file and directory names.

On the right is a depiction of the filesystem 'tree' after the raw data has been converted into the BIDS format.
Note the meaningful hierarchy and file and directory names.

The upside is that the structure of this data can be readily interpreted if it is in BIDS format.
It could be easily shared with other researchers, but it also helps the primary research team ensure that the data is all stored correctly.

Note that they also have made available custom Python code ([pybids](https://bids-standard.github.io/pybids/)) for working with BIDS-structured data.
This can be installed and `import`ed to make its functionality available.

There is some more general advice in the [Project organization](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510#sec009) section of [Good enough practices in scientific computing](https://doi.org/10.1371/journal.pcbi.1005510).
In particular, have a look at their 'Box 3' for an example filesystem layout.

Another example comes from a minor discussion in [Minimizing Mistakes in Psychological Science](https://journals.sagepub.com/doi/full/10.1177/2515245918801915) (emphasis mine):

> A well-organized lab should have a specified organizational structure. Particular attention should be given to the following: standardization of experimental metadata, **standardization of folder-naming conventions**, and standardization in versioning. Standardization in metadata means that all records of all experiments should look similar. The lab should have a standard format for recording information on, for example, participants, sessions, and IRB protocols. Of course, variables differ across experiments, but standardization of the naming conventions across experiments is always helpful. **Likewise, we find it helpful to have a standardized naming convention across directories and files so that future understanding of projects is seamless.**

#### Exercise

Have a think about the data storage approach you will take for your internship study, and produce a mockup of a potential filesystem 'tree'.

Some questions to consider:

* Are there any examples from other projects in the lab to look at? For example, there might be a shared lab folder with current or past projects.
* Are there particular ethical requirements for the data associated with your study? Note that this is not just in the data produced, but also in the data required&mdash;some face image databases are protected, for example.
* Is the size of the data relevant for how you might structure your tree? Some research (including some of mine) involves hundreds of GB of data.
* Say you come back to your project a year later. Would you be able to look at your tree and be able to understand what the data corresponds to?