<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Fundamentals-of-data-and-coding-for-psychology" data-toc-modified-id="Fundamentals-of-data-and-coding-for-psychology-1">Fundamentals of data and coding for psychology</a></span><ul class="toc-item"><li><span><a href="#Applying-your-new-knowledge-to-some-scenarios" data-toc-modified-id="Applying-your-new-knowledge-to-some-scenarios-1.1">Applying your new knowledge to some scenarios</a></span><ul class="toc-item"><li><span><a href="#Learning-outcomes-covered-so-far" data-toc-modified-id="Learning-outcomes-covered-so-far-1.1.1">Learning outcomes covered so far</a></span><ul class="toc-item"><li><span><a href="#Getting-started-with-notebooks" data-toc-modified-id="Getting-started-with-notebooks-1.1.1.1">Getting started with notebooks</a></span></li><li><span><a href="#Data-associated-with-psychology-research" data-toc-modified-id="Data-associated-with-psychology-research-1.1.1.2">Data associated with psychology research</a></span></li><li><span><a href="#Digital-representation" data-toc-modified-id="Digital-representation-1.1.1.3">Digital representation</a></span></li><li><span><a href="#Data-storage-and-navigation" data-toc-modified-id="Data-storage-and-navigation-1.1.1.4">Data storage and navigation</a></span></li><li><span><a href="#Representing-integer-numbers,-with-an-application-in-images" data-toc-modified-id="Representing-integer-numbers,-with-an-application-in-images-1.1.1.5">Representing integer numbers, with an application in images</a></span></li><li><span><a href="#Representing-decimal-numbers,-with-an-application-in-sounds" data-toc-modified-id="Representing-decimal-numbers,-with-an-application-in-sounds-1.1.1.6">Representing decimal numbers, with an application in sounds</a></span></li></ul></li><li><span><a href="#Scenarios" data-toc-modified-id="Scenarios-1.1.2">Scenarios</a></span><ul class="toc-item"><li><span><a href="#Code-review" data-toc-modified-id="Code-review-1.1.2.1">Code review</a></span></li><li><span><a href="#EEG-markers" data-toc-modified-id="EEG-markers-1.1.2.2">EEG markers</a></span></li><li><span><a href="#Face-database" data-toc-modified-id="Face-database-1.1.2.3">Face database</a></span></li><li><span><a href="#Random-noise-images" data-toc-modified-id="Random-noise-images-1.1.2.4">Random noise images</a></span></li><li><span><a href="#Feedback-sounds" data-toc-modified-id="Feedback-sounds-1.1.2.5">Feedback sounds</a></span></li><li><span><a href="#Alternative-feedback-sounds" data-toc-modified-id="Alternative-feedback-sounds-1.1.2.6">Alternative feedback sounds</a></span></li><li><span><a href="#Data-exclusions" data-toc-modified-id="Data-exclusions-1.1.2.7">Data exclusions</a></span></li><li><span><a href="#Analysis-and-visualisation-of-a-within-subjects-experiment" data-toc-modified-id="Analysis-and-visualisation-of-a-within-subjects-experiment-1.1.2.8">Analysis and visualisation of a within-subjects experiment</a></span></li><li><span><a href="#Effect-of-optional-stopping-on-$p$-values-(advanced)" data-toc-modified-id="Effect-of-optional-stopping-on-$p$-values-(advanced)-1.1.2.9">Effect of optional stopping on $p$ values (advanced)</a></span></li></ul></li></ul></li></ul></li></ul></div>

# Fundamentals of data and coding for psychology

## Applying your new knowledge to some scenarios

In this final class in this series of lessons on data and coding for psychology, you will be applying some of the knowledge and skills that you have learned in previous lessons to some scenarios in which they might be of use in psychology research.

### Learning outcomes covered so far

First, let's look at an inventory of all the learning outcomes that we have addressed thus far.
This is a reminder of all the things that you have learned, and also gives a quick overview if you need to go back to a lesson to review some of the details.

#### Getting started with notebooks

* Create, run, and change the type of notebook cells.
* Switch between 'Command mode' and 'Edit mode'.
* Describe the differences between `Markdown` and `Code` cells.
* Use `Markdown` cells to produce formatted prose.
* Use `Code` cells to run Python code.

#### Data associated with psychology research

* Identify the data associated with a research study by reading the journal article reporting its results.
* Describe the types of data that are associated with conducting research in psychology across a range of specialities.
* Use a notebook to store and share written work.

#### Digital representation

* Describe how data is represented in memory on computers.
* Use functions and methods in Python.
* Interpret binary sequences in different ways.

#### Data storage and navigation

* Describe the components of filenames and file paths.
* Navigate a filesystem using absolute and relative paths.
* Develop a data storage/organisation strategy for a research project.

#### Representing integer numbers, with an application in images

* Use `numpy` to create, manipulate, and access one- and multi-dimensional arrays of integers.
* Plot array-based data using methods appropriate to their dimensionality (line, scatter, image plots).
* Apply `for` loops to allow statements to be repeatedly executed.

#### Representing decimal numbers, with an application in sounds

* Compute with non-integers using the `float` datatype.
* Generate and interrogate sounds using `numpy` arrays of floats.
* Compute statistical tests and use simulations to demonstrate their properties.

### Scenarios

We will now go through a series of semi-realistic psychology research scenarios, and you will have the task addressing some of the requirements for conducting the research&mdash;often, but not always, using Python.

A few points:

* I say 'semi-realistic' because I have tried to cobble together aspects from a few different areas of psychology, most of which I only have a vague familiarity with. They certainly shouldn't be taken as examples of meaningful studies or of good research practices.
* Each task includes a 'Potential solution' section, and many also have a 'Hints' section; I suggest only using these as necessary.
* You will often not know the answer immediately from memory&mdash;you will typically need to review the relevant lesson and your notes.
* The scenarios are independent, so you can skip a task that you are having trouble with (and come back to it later).
* There are also quite a few, so feel free to concentrate on the ones you find most interesting.
* Some aspects are marked as 'Advanced'; those in particular are good candidates for skipping on the first pass through.

#### Code review

##### Task

You are working on a project with a fellow researcher, and they have written some code to generate and save an image of random noise.
However, it is not working and they have asked you to take a look at their code.

Your task is to inspect their code and make it run as required (without actually saving anything).

In [None]:
import pathlib
import numpy as np

# I want to generate an image that is 100 pixels wide and 50 pixels high
my image = np.random.randomint(low=0, high=1, size=(100px, 50px))

# I want to save it in a directory called "images" above the current directory
# with the filename "image.txt"
path = pathlib.Path("c:\project\images\\image" .txt)

##### Potential solution

In [None]:
my image = np.random.randomint(low=0, high=1, size=(100px, 50px))

Problems:

* `my image`: can't have spaces in variable names.
* `np.random.randomint`: the function is called `np.random.randint`.
* `high=1`: want the random integers to be up to 255.
* `size=(100px, 50px)`: Python doesn't know about `px`.
* `size=(100px, 50px)`: image sizes are (height, width) and not (width, height)

In [None]:
my_image = np.random.randint(low=0, high=256, size=(50, 100))

In [None]:
path = pathlib.Path("c:\project\images\\image" .txt)

Problems:
* `"c:\project\images\\image"`: use of backslashes without attention to the interpretation as control characters
* `"c:\project\images\\image"`: use of absolute paths despite requirement being better suited to relative.
* `"c:\project\images\\image" .txt`: malformed syntax

In [None]:
path = pathlib.Path("../images/image.txt")

#### EEG markers

##### Task

You are planning on running an EEG study, in which a continuous record of electrical signals are recorded from the scalp of your participant.
You are presenting your trials on a different computer to the one that is recording the EEG, and you want a way of being able to indicate the precise onset time of a trial within the EEG record.

The EEG system provides a cable that you can connect to your computer for this purpose.
It has 8 'pins', and the state of each pin ('low'/0 or 'high'/1) can be set by sending an integer to the cable using Python.
The first pin is used to indicate that you want a marker to be set in the file; this only happens when the first pin goes from 'low' to 'high'.
The remaining pins are used to set the value of the marker (the 'marker code') as an unsigned integer.

For example, the moment that the signal on the first pin goes from low to high, the system reads the values of the other seven pins and uses those values to store a number at that point in the EEG record.

In developing your study, you realise that you can understand the operation of the cable using the binary representations of integers that you have covered in these lessons.
That is, each pin is a different 'bit', with the first pin being the first bit and the other pins being the remaining seven bits (going from left to right).

You also realise that you need to be able to answer the following:

* What does the following binary representation mean with respect to the state of the pins? `00101111`.
* How many unique marker codes are available to use?
* What integer number do you need to send to register a marker code of 76, assuming the first pin is 'low' before you send the number?
* Remembering that a marker is only recorded when the first pin goes from low to high, you realise that you need to 'reset' the first pin each time you record a marker code. What number do you need to send to reset the first pin?

##### Potential solutions

* What does the following binary representation mean, in terms of the state of the pins? `00101111`.

The first two pins are 'low', the third is 'high', the fourth is 'low', and the fifth through eighth pins are 'high'.

* How many unique marker codes are available to use?

The first pin is used to signal that a marker code is to be stored, so there are seven pins available to set the marker code.
Each pin can be either 0 or 1, so there are $2^7 = 128$ potential marker codes.

* What number do you need to send to register a marker code of 76, assuming the first pin is 'low' before you send the number?

We want to end up with a marker of 76.
In binary, this is:

In [None]:
bin(76)

But, we also want to set the first bit, so we actually want to send the integer corresponding to `11001100`:

In [None]:
int("11001100", base=2)

* Remembering that a marker is only recorded when the first pin goes from low to high, you realise that you need to 'reset' the first pin each time you record a marker code. What number do you need to send to reset the first pin?

The first pin is reset whenever the first binary digit is `0`.
That happens whenever the number is less than 128, so any number that is between 0 and 127 will work.

#### Face database

##### Task

For an upcoming experiment, you would like to show images of faces.
You decide that you want to use the [AR Face Database](http://www2.ece.ohio-state.edu/~aleix/ARdatabase.html), so you obtain permission to use it and then go ahead and download the files.

However, you soon run into a problem.
The database is a series of files with names like `w-060-9.raw`, and no image viewer on your computer is able to display them.

Having a closer read of the website, you see that it says:

> All images are stored in 10 different CD-ROMs as RGB RAW files (pixel information). Images are of 768 by 576 pixels and of 24 bits of depth.

Your task is to interpret this information to answer the following questions, which will help you work out how to use the files:

* What is '24 bits of depth' likely to mean?
* Assuming 1024 kilobytes in a byte, how many kilobytes would each image contain?
* What would the shape of each image be if represented using a multidimensional array?
* (Advanced) Is there enough information in these two sentences to allow you to view the images?

##### Hint

* You may want to review the 'Digital representation' and the 'Representing integer numbers, with an application in images' lessons.

##### Potential solution

* What is '24 bits of depth' likely to mean?

Images typically have three channels (red, green, and blue) that are each represented by 8-bit unsigned integers.
Thus, '24 bits of depth' is likely to refer to 3 colour channels times 8 bits per channel.

* Assuming 1024 kilobytes in a byte, how many kilobytes would each image contain?

Each pixel is 24 bits, which is 3 bytes (24 / 3).
There are 442,576 pixels (768 x 576), meaning there are 1,327,104 bytes (442576 x 3).
That means there are 1,296 kilobytes per image (1327104 / 1024).

* What would the shape of each image be if represented using a multidimensional array?

Images are typically stored in multidimensional arrays as (height, width, channels), so here it would be (576, 768, 3).
That is, if we assume '768 by 576 pixels' means 768 horizontal pixels.

* (Advanced) Is there enough information in these two sentences to allow you to view the images?

No, there isn't.
We don't know the order that pixels are stored in the binary representation.
For example, they could start with a particular pixel location and then have the 3 bytes for that pixel's colour before moving onto the next location, or they could start with a colour channel and then have all the bytes for the different locations before moving onto the next colour channel.

#### Random noise images

##### Task

For an upcoming experiment, you would like to present images of faces for a very short duration.
However, you are finding that there is a perceptual 'afterimage' that is lingering after the image disappears.
You would like to generate an image of coloured random noise that you can display after the face, in an attempt to disrupt such an afterimage.

Your task is to generate the random noise image and display it in the notebook.
The image is to be 96 pixels horizontally and 64 pixels vertically, and each pixel in each colour channel (RGB) is to be drawn at random from the integers 0 to 255, inclusive.

##### Hint

* Think about the `size` of the array that you want to produce.
* Consider using `np.random.randint`.

##### Potential solution

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np

img = np.random.randint(low=0, high=256, size=(64, 96, 3), dtype=np.uint8)

plt.imshow(img);

#### Feedback sounds

##### Task

You would like to indicate to a participant whether their response was correct or incorrect by playing different sounds.
A correct response is to be followed by a 'beep' and an incorrect sound is to be followed by a 'boop'.

Your task is to generate the two sounds, with a notebook player for each.
They are both to be pure tones of 200ms duration, sampled at 44100 Hz, and single channel (mono rather than stereo).
The 'beep' is 1000 Hz and the 'boop' is 500 Hz.

In [None]:
# here is a start

import numpy as np

import IPython.display

sample_rate = 44100
duration_s = 0.2

num_samples = int(sample_rate * duration_s)

# the rest of your solution goes here; use as many cells as you need

##### Hint

* Consider starting by making an array that indicates the time (in seconds) of each sample; so here it will vary between 0 and 0.2
* Then, use the same principles we have covered previously to calculate the argument to the `sin` function.

##### Potential solution

In [None]:
import numpy as np

import IPython.display

In [None]:
sample_rate = 44100
duration_s = 0.2

num_samples = int(sample_rate * duration_s)

In [None]:
t = np.linspace(start=0.0, stop=duration_s, num=num_samples)

In [None]:
beep_freq = 1000.0
boop_freq = 500.0

In [None]:
beep = np.sin(t * 2.0 * np.pi * beep_freq)

In [None]:
player = IPython.display.Audio(data=beep, rate=sample_rate)

In [None]:
player

In [None]:
boop = np.sin(t * 2.0 * np.pi * boop_freq)

In [None]:
player = IPython.display.Audio(data=boop, rate=sample_rate)

In [None]:
player

#### Alternative feedback sounds

##### Task

When piloting, you realised that a 'boop' isn't sending a clear enough signal to the participants that their response was incorrect.
You decide to instead use 'white noise'.

Your task is to generate such a white noise sound.
It is again 200ms in duration and mono, but now the samples are to be drawn from a uniform distribution (see `np.random.uniform`) between -0.1 and +0.1.

##### Hint

* Inspect the help for `np.random.uniform` and think about what arguments you would need to supply to the parameters to meet the requirements.

##### Potential solution

In [None]:
import numpy as np

import IPython.display

In [None]:
sample_rate = 44100
duration_s = 0.2

num_samples = int(sample_rate * duration_s)

In [None]:
noise = np.random.uniform(low=-0.1, high=+0.1, size=num_samples)

In [None]:
# note that, depending on the version of `IPython`, this player may
# automatically (annoyingly!) 'normalise' the waveform amplitude
player = IPython.display.Audio(data=noise, rate=sample_rate)

In [None]:
player

#### Data exclusions

##### Task

You are running an experiment in which you are collecting response times, and you have the data from one participant in an array called `rt_data`.
Your goal is to use the calculate the mean response time across the set of trials.

However, you have noticed that some of the values in the array are `999.0`.
After speaking to your supervisor, you learn that some trials were skipped and the response times on those trials was given a value of `999.0` to indicate this.

You want to be able to identify the 'skipped' trials, so that you can ignore them in your mean calculation.

* What is a potential pitfall in identifying those values in the array that equal `999.0`?
* What is a 'safe' approach for identifying the 'skipped' trials?
* How many skipped trials are there in `rt_data`?

In [None]:
# don't worry about understanding this cell - it reads the example data and loads it into an array
!curl https://webutils.psy.unsw.edu.au/internship_coding/notebooks/scenarios/rt_data.txt -o rt_data.txt > /dev/null 2>&1
rt_data = np.loadtxt("rt_data.txt")

##### Hint

* Think about one of the potentially surprising aspects of how computers represent non-integer numbers.

##### Potential solution

* What is a potential pitfall in identifying those values in the array that equal `999.0`?

The danger is that the values might not be *exactly* `999.0`, so a simple test of equality (using `==`) might be misleading.
Note that it isn't in this instance&mdash;it produces the correct output.
However, the goal here is to identify *potential* problems.

* What is a 'safe' approach for identifying the 'skipped' trials?

A safer approach is to use a function that is aware of such issues, such as `np.isclose`:

In [None]:
np.isclose(rt_data, 999.0)

* How many skipped trials are there in `rt_data`?

We can see from the above that there are 5 `True` entries.
Via code:

In [None]:
print(sum(np.isclose(rt_data, 999.0)))

#### Analysis and visualisation of a within-subjects experiment

##### Task

You have run a within-subjects experiment with two conditions, with 30 participants.
You have the data stored in the `ws_data` variable.

Your tasks are to:
* Run a paired-sample $t$-test on the data.
* Produce a 'scatter plot', where each point shows a single participant's score on both conditions.
* Advanced: add a 'line of equality' to the plot; that is, a line that indicates where the points would lie if both the conditions were equal.

In [None]:
# don't worry about understanding this cell - it reads the example data and loads it into an array
!curl https://webutils.psy.unsw.edu.au/internship_coding/notebooks/scenarios/ws_data.txt -o ws_data.txt > /dev/null 2>&1
ws_data = np.loadtxt("ws_data.txt")

##### Hints

* Inspect the help for `scipy.stats.ttest_rel` for the paired-sample $t$ test.
* Review the end of the 'images' lesson for information on scatter plots.
* For the 'line of equality', think about the (x, y) coordinates of the two endpoints of the line. Then, use the two $x$ values as the first argument to `plt.plot` and the two $y$ values as the second argument to `plt.plot`.

##### Potential solution

In [None]:
import scipy.stats
(t, p) = scipy.stats.ttest_rel(a=ws_data[:, 0], b=ws_data[:, 1])
print(t, p)

In [None]:
plt.figure();
plt.scatter(ws_data[:, 0], ws_data[:, 1]);
plt.plot([-2.5, 2.5], [-2.5, 2.5]);

#### Effect of optional stopping on $p$ values (advanced)

##### Task

You are planning on running a study where you will ask two different groups to rate their current wellbeing on a free scale (i.e., they can give any number).
You aren't sure how many participants to run, so you are thinking of running 25 participants per group and then running an independent-samples $t$-test to test the hypothesis that the group means are different.
You figure that if such a test is not significant, you will run another 25 participants per group and then test again with all of the data (50 participants per group).

You have a vague inkling that this might affect your false positive rate, but you aren't sure whether it does or by how much (if it does).

Your task is to run such a simulation to identify the false positive rate when the null hypothesis is true and the planned 'optional stopping' strategy is applied.
That is, how many of 10,000 simulated experiments will be declared as statistically significant (at $\alpha = 0.05$) when each simulated experiment could have a second round of data collection?

Note that there are two new Python aspects that would be useful in a solution.

First, note that `if` statements can be followed by `else` statements.
The statements indented after the `else` statement are executed if the boolean following the `if` is `False`.

For example:

In [None]:
p = 0.4

if p < 0.05:
    print("Significant")
else:
    print("Not significant")

Second, the `np.append` function can be used to extend an array with new elements.

For example:

In [None]:
test_array = np.random.uniform(low=0, high=1, size=3)
print(test_array)

test_array = np.append(test_array, np.ones(2))
print(test_array)

##### Potential solution

In [None]:
import numpy as np
import scipy.stats

# null is true, so groups have the same mean
group_a_mean = 50.0
group_b_mean = 50.0
group_sd = 10.0

n_per_set = 25

n_sims = 10000

n_significant = 0

for i_sim in range(n_sims):
    
    group_a_data = np.random.normal(loc=group_a_mean, scale=group_sd, size=n_per_set)
    group_b_data = np.random.normal(loc=group_b_mean, scale=group_sd, size=n_per_set)
    
    (t, p) = scipy.stats.ttest_ind(a=group_a_data, b=group_b_data)
    
    if p < 0.05:
        n_significant += 1
        
    else:
        # if not significant, collect a new set of data and combine it with the old set
        group_a_data = np.append(group_a_data, np.random.normal(loc=group_a_mean, scale=group_sd, size=n_per_set))
        group_b_data = np.append(group_b_data, np.random.normal(loc=group_b_mean, scale=group_sd, size=n_per_set))
    
        (t, p) = scipy.stats.ttest_ind(a=group_a_data, b=group_b_data)
        
        if p < 0.05:
            n_significant += 1

In [None]:
n_significant / n_sims