# Data—arrays and `numpy`

Objectives

• Be able to create arrays using numpy functions.
• Know how to access subsets of array elements using indexing.
• Understand how operations are applied to both the items in an array and over items in an array.
• Be able to save/load arrays to/from disk.

In this lesson, we will be introducing a critical data type for data analysis—the array.

We will be using an external Python package called numpy for our array functionality. As you may recall, we need to use the `import` command to make such additional functionality available to our Python scripts. For numpy, we do something slightly different; because we will be using it so much, it is conventional to shorten `numpy` to `np` in our code. We modify our usual `import` statement to be:

```import numpy as np
```

This code imports the numpy functionality and allows us to refer to it as `np`. Typically, this line will appear at the top of all the Python scripts we will use in these lessons.

## Creating arrays

There are quite a few ways of producing arrays. First, we will consider the array analogue of the `range` function that we used previously to produce a list. The equivalent function in numpy is `arange`:

```import numpy as np

r_list = range(10)
print type(r_list), r_list

r_array = np.arange(10)
print type(r_array), r_array
```
```<type 'list'> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
<type 'numpy.ndarray'> [0 1 2 3 4 5 6 7 8 9]
```

As you can see, both methods have produced a collection of integers from 0 through 9. However, they are different data types—the `range` function produces a `list`, whereas the `arange` function produces an `array`. This distinction is important, because the different data types have different functionality associated with them.

One of the most useful things about arrays is that they can have multiple dimensions. A frequently encountered form of data is a table, with a number of rows and a number of columns. We can represent such a structure by creating a two-dimensional array. For example, the `ones` function creates an array of a particular size with all elements having the value of 1—we can use this to create a data structure with 5 rows and 3 columns.

```import numpy as np

data = np.ones(shape=[5, 3])

print data
```
```[[ 1.  1.  1.]
[ 1.  1.  1.]
[ 1.  1.  1.]
[ 1.  1.  1.]
[ 1.  1.  1.]]
```

Here, we’ve provided the `ones` function with a keyword argument called `shape`, which is a list that specifies the number of items along each dimension. As you can see, it creates a structure with 5 rows and 3 columns.

Tip

You can see how this would be a useful data structure if you think that the rows might represent individual participants in an experiment and the columns might 3 conditions in a within-subject design.

Once created, we can access various useful properties of the array. For example, `.shape` returns a representation of the number of items along each dimension of the array, `.size` returns the total number of items in the array, and `.ndim` returns the number of dimensions in the array:

```import numpy as np

data = np.ones(shape=[5, 3])

print data.shape
print data.size
print data.ndim
```
```(5, 3)
15
2
```

## Indexing arrays

We can access the items in an array using similar techniques to what we used with lists:

```import numpy as np

r_array = np.arange(10)

print r_array[:5]
```
```[0 1 2 3 4]
```

Indexing becomes more advanced when we have arrays with more than one dimension. Here, we will use another function to generate a multidimensional array—the numpy equivalent of `random.random` that we encountered earlier. We can access individual items in the array by separating the dimensions by a comma:

```import numpy as np

data = np.random.random(size=[5, 3])

print data

# first row, first column
print data[0, 0]
# second row, first column
print data[1, 0]
# second row, last column
print data[1, -1]
```
```[[ 0.90168308  0.17133111  0.57730204]
[ 0.83406998  0.38666967  0.3698428 ]
[ 0.48163748  0.43250003  0.95742637]
[ 0.17254113  0.40056304  0.35827753]
[ 0.95301857  0.13373673  0.69812848]]
0.9016830831
0.834069979484
0.36984279585
```

Importantly, we can also access all the items along a particular dimension:

```import numpy as np

data = np.random.random(size=[5, 3])

print data

# all rows, first column
print data[:, 0]
# first row, all columns
print data[0, :]
```
```[[ 0.1961776   0.53541907  0.7453791 ]
[ 0.57018969  0.62871156  0.04582693]
[ 0.83923676  0.29456444  0.72562237]
[ 0.77830901  0.50678319  0.18410882]
[ 0.56276205  0.0579545   0.92955741]]
[ 0.1961776   0.57018969  0.83923676  0.77830901  0.56276205]
[ 0.1961776   0.53541907  0.7453791 ]
```

We can also extract items using arrays of boolean values. For example, if we use a `>` operator on an array, it returns an array of booleans. If we then use this boolean array to index the data, it returns those items where the corresponding item in the boolean array is `True`:

```import numpy as np

data = np.random.random(10)

print data

gt_point_five = data > 0.5

print gt_point_five

print data[gt_point_five]
```
```[ 0.41374836  0.20518716  0.78958201  0.97271562  0.37304847  0.42976821
0.02028967  0.91801355  0.05767821  0.73583786]
[False False  True  True False False False  True False  True]
[ 0.78958201  0.97271562  0.91801355  0.73583786]
```

## Operations on arrays

We can use the conventional maths operators with arrays where, unlike lists, they operate on each item in the array. For example:

```import numpy as np

data = np.ones(4)

print data + 1
print data * 3
print data - 2
```
```[ 2.  2.  2.  2.]
[ 3.  3.  3.  3.]
[-1. -1. -1. -1.]
```

We can also use operators that operate over items in an array. For example, we could add together all the items in an array:

```import numpy as np

data = np.ones(4)

print data
print np.sum(data)
```
```[ 1.  1.  1.  1.]
4.0
```

When applied to multidimensional arrays, such functions typically can be given an `axis` argument. This argument specifies the axis over which the operation is applied. For example, to sum over rows and columns:

```import numpy as np

data = np.ones([4, 3])
data[1, :] = 2
data[2, :] = 3
data[3, :] = 4

print data

print "Rows:"
print np.sum(data, axis=0)

print "Columns:"
print np.sum(data, axis=1)
```
```[[ 1.  1.  1.]
[ 2.  2.  2.]
[ 3.  3.  3.]
[ 4.  4.  4.]]
Rows:
[ 10.  10.  10.]
Columns:
[  3.   6.   9.  12.]
```

When using one or two dimensional arrays, a straightforward way to load and save data is in the form of a text file. This can be opened with any editor, and maximises the interoperability of the data with other programs. To do so, we use `np.savetxt` and `np.loadtxt`. For example, we can save some random data to a text file and the load it back in again to verify its contents have not changed:

```import numpy as np

data = np.random.random([3, 2])

print data

np.savetxt("data.txt", data)

print saved_data
```
```[[ 0.92961609  0.31637555]
[ 0.18391881  0.20456028]
[ 0.56772503  0.5955447 ]]
[[ 0.92961609  0.31637555]
[ 0.18391881  0.20456028]
[ 0.56772503  0.5955447 ]]
```

We can also inspect the file that is saved on disk, `data.txt`:

```9.296160928171478544e-01 3.163755545817859005e-01
1.839188116770944514e-01 2.045602785530397094e-01
5.677250290816866496e-01 5.955447029792515501e-01
```

There are two notable aspects of the above that are worth comment:

1. The data are represented in exponential notation.
2. Columns are separated by a space character.

To make it easier to visually inspect the data file, and for compatibility with other programs that are expecting a CSV (‘comma separated values’) format, we can provide arguments to the `np.savetxt` function:

```import numpy as np

data = np.random.random([3, 2])

np.savetxt("data2.txt", data, fmt="%.4f", delimiter=",")
```
```0.9296,0.3164
0.1839,0.2046
0.5677,0.5955
```
• The `fmt` argument allowed us to specify 4 decimal places (see the Python documentation on its format specification mini-language for more details).
• The `delimiter` argument allowed us to specify a comma as the separator (‘delimiter’) between columns.

If the array has more than two dimensions, then saving as a plain text file often isn’t practical. In such circumstances, you can use `np.save` and `np.load` with array files, which are typically given the extension `.npy`.