# Fundamentals of data and coding for psychology

## Digital representation

As amazing as it seems, all data (and indeed instructions / programs) on your computer are fundamentally represented by 0s and 1s.
That is a *binary* system, and the smallest unit of information (whether something is a 0 or a 1) is called a *bit*.

Here, we are going to examine this fundamental representation, and see how we can interpret sequences of bits in different ways.

The learning objectives for this lesson are for you to be able to:

* Describe how data is represented in memory on computers.
* Use functions and methods in Python.
* Interpret binary sequences in different ways.

### Creating an example file on the (virtual) computer

To begin exploring digital representations, we are going to create a file and store some text in it:

* Within your 'project' on Azure Notebooks, click on `New` and then select `Blank file`.
* An empty text box will be created within your file list. Give the file the name `plain.txt` and press `Enter` to create it.
* Click on `plain.txt` to open it inside an editor.
* Click on `Edit file`.
* In the editor, type `This is some text.` and click on `Save file`.

### Loading the file contents using Python

Now, we are going to use Python to investigate this file and its contents.

First, we will read the contents of the file into a variable called `contents`:

In [None]:
file = open(file="plain.txt", mode="rb")
contents = file.read()
file.close()

A lot of new stuff happened in those three lines of code!

Let's work through it line-by-line.

#### Opening the file

In the first line, we 'open' a file - we make it available for us to interrogate.


```python
file = open(file="plain.txt", mode="rb")
```

We do this by using the special Python command `open` (such commands are called "functions").
The `open` function has a set of "parameters", and we can supply this information using "arguments".

(Hint: We will encounter quite a few new terms in this lesson. Remember about the "Glossary" section in your learning log notebook - consider adding terms and their definitions, in your own words, as we go through.)

Here, we supply arguments for two parameters: `file` and `mode`.

The `file` parameter gives the name of the file to be opened as a sequence of characters called a "string".
We can create a string in Python by enclosing the letters in quotation marks.

The `mode` parameter sets the way in which we would like to interact with the file.
The `r` means that we only want to read from the file, not write to it.
The `b` means that we want to open it in 'binary' mode - that will be explained more below.

You can see how we provide the arguments for the parameters - via the parameter name and then an equals sign and then the argument.
That is not the only way to pass arguments though, and we will soon come across others.

The `open` function then returns the information about the file, and we assign it to the variable called `file`.
We chose the name `file` - we could have called it something else if we wanted to.
The only restrictions are that variable names can't contain spaces, can't start with a number, and can't be from a set of special Python reserved words (like `open`).
By convention, variables of this sort in Python are typically written in lowercase and with underscores (`_`) as a separator, if necessary.

How do you know what parameters a function has?
For many functions, you can obtain interactive 'help' by typing the function name followed by `?` and then evaluating the cell.
Note that the help can often be overwhelming! It is important to learn to skip over parts that are excessively technical.

Try evaluating the cell below and skim through the help:

In [None]:
open?

#### Reading the file contents

```python
contents = file.read()
```

The variable in which we stored the result of the `open` function (`file`) has a function attached to it - these are called "methods".
Here, we use the method `read` to extract the contents of the file.

Note that even though we don't need to pass any arguments to `read`, we still use the parentheses to indicate that we wish to use ("call") the function.

Again, have a look at the help for this method:

In [None]:
file.read?

The information that is return is stored in the variable `contents` - the naming of this was also up to us.

#### Closing the file

```python
file.close()
```

The last thing we need to do is to leave the file in a nice state by "closing" it, by using its `close` method.

In [None]:
file.close?

(An aside, but note that this is not the best and most ["Pythonic"](https://blog.startifact.com/posts/older/what-is-pythonic.html) way of doing what we want to do here. However, it is much easier to work through when beginning Python - remind me to go through the better way with you later in the course.)

### Bits and bytes

OK, so now we have read the contents of the file `plain.txt` into our variable called `contents`.

As mentioned above, the fundamental unit of computing is binary, and that is typically represented as 0 and 1.
A single binary element is called a *bit*.
By convention, we often group sets of bits together into chunks of 8 elements - these 8-bit chunks are called *bytes*.

Accordingly, Python represents the contents of `plain.txt` as a sequence of bytes.
We can see how many by using the `len` ("length") function (in combination with the `print` function, to output the result):

In [None]:
print(len(contents))

We can see that there are 18 bytes in the file (there might also be 19 bytes, depending on whether the editor stores a 'newline' at the end of the sentence).

In Python, we can access individual elements of a sequence using square brackets (`[]`) notation.
Inside the square brackets, we can put the number of the element that we would like to access - this number is called an *index*.
Note that the sequence indices begin at zero!

Let's see what the binary representation (using the `bin` function) of the first byte looks like:

In [None]:
print(bin(contents[0]))

The `0b` at the start just indicates that the following digits are binary.
Then, we have seven 0s or 1s.
Why only seven when I said there were eight bits in a byte? This is because any 0 values at the start are removed.

These are the 'atoms' of digital representation - every single file and application on your computer can be boiled down to sequences of 0s and 1s.

### Decimal interpretation

We can *interpret* binary sequences in different ways.
Perhaps the most familar is to express it as a "decimal" (base-10) number.
In this system, each 'slot' in the 8-bit sequences represents the presence of a corresponding power of 2.

We can convert our first item in `contents` (`01010100`) using the following process:

* We start out with giving our decimal version a value of `0`.
* Looking at the first slot on the right, it is `0` - so there is no $2^0 (1)$.
* Same with the next slot, so there is no $2^1 (2)$.
* Now we do encounter a `1`, so there is a $2^2 (4)$, and our value becomes `4`.
* There is no $2^3 (8)$.
* There is a $2^4 (16)$, so our value becomes $4 + 16 = 20$.
* There is no $2^5 (32)$.
* There is a $2^6 (64)$, so our value becomes $4 + 16 + 64 = 84$.
* There is no $2^7 (128)$).

So our decimal representation is `84`.


We can get this in Python using the `int` ("integer") function.

Note that the first parameter to the `int` function does not accept being "named", so we just provide its argument by the position within the parentheses.

In [None]:
print(int(contents[0]))

### Hexadecimal interpretation

Another useful way of expressing bytes is by "hexadecimal" (hex) representations.
By hex, we mean using a base-16 system.

We won't go through the process of converting binary to hex manually - we will just use the function `hex`:


In [None]:
print(hex(contents[0]))

The `0x` at the start flags that it is in hex format, and then the `54` is the hex representation.

It is easiest to think about it as separately encoding the binary byte in two 4-bit sections: `0101` and `0100`.

In [None]:
print(int("0101", base=2), int("0100", base=2))

Note that a 4-bit sequence can represent decimals between 0 and 15 ($2^4 = 16$ in total).
Here we've seen a 4 and a 5 - but what happens when we need to go above 9?

In [None]:
print(hex(10))

We get `a`.
In hex, the allowable values are 0123456789abcdef.

In [None]:
print(int("f", base=16))

Those of you that have done any graphic design, photo editing, website creation, etc., will have likely come across hex before - likely in the contents of colours.
For example, you might have seen white expressed as `#ffffff`, or red as `#ff0000`.
We'll be talking more about those concepts in a future lesson.
But now, you should have more of an understanding of what those hex values represent in terms of the underlying binary.

### Letter interpretation

Finally, we can also represent bytes in terms of letters - after all, that is the way we entered the data into the file in the first place!

We can do this because we have agreed on the mapping between bytes and letters.
We will have (a lot) more to say on this topic later, but for now it is sufficient to know that there is a standard called [ASCII](https://en.wikipedia.org/wiki/ASCII) that defines the mapping in this case.

We can represent the bytes as ASCII letters using the `chr` function:

In [None]:
print(chr(contents[0]))

That matches what we would expect, based on the first character in our file.

### Summary

The key points are:

* The binary *bit* is the fundamental unit of data representation.
* These are often chunked into 8-bit *bytes*.
* These bytes can be interpreted in different ways. The following are *different ways of looking at the same thing*:
    * Binary: `01010100`
    * Decimal: `84`
    * Hex: `54`
    * Letters (ASCII): `T`

### Exercises

* There are 8 bits in a byte, 1024 bytes in a kilobyte, 1024 kilobytes in a megabyte, and 1024 megabytes in a gigabyte. Imagine that you have 4 GB of RAM ('random access memory') free on your laptop. How many bits do you have available to use?

  Note that these definitions might be a bit controversial! Sometimes multiples of 1000 are used rather than multiples of 1024 - so there are 1000 megabytes in a gigabyte instead of 1024.

* Using the [ASCII table of characters](https://en.wikipedia.org/wiki/ASCII#Printable_characters), predict what the different representations of a given item in `contents` will be (e.g., what will its binary, decimal, and hex representations be). Then, use Python code to test your predictions.

* If you read that something is stored as an 8-bit integer, how many unique values are possible to be stored in that data point? How about a 32-bit integer? 64-bit?