Data—arrays and numpy

Objectives

  • Be able to create arrays using numpy functions.
  • Know how to access subsets of array elements using indexing.
  • Understand how operations are applied to both the items in an array and over items in an array.
  • Be able to save/load arrays to/from disk.

In this lesson, we will be introducing a critical data type for data analysis—the array.

We will be using an external Python package called numpy for our array functionality. As you may recall, we need to use the import command to make such additional functionality available to our Python scripts. For numpy, we do something slightly different; because we will be using it so much, it is conventional to shorten numpy to np in our code. We modify our usual import statement to be:

import numpy as np

This code imports the numpy functionality and allows us to refer to it as np. Typically, this line will appear at the top of all the Python scripts we will use in these lessons.

Creating arrays

There are quite a few ways of producing arrays. First, we will consider the array analogue of the range function that we used previously to produce a list. The equivalent function in numpy is arange:

import numpy as np

r_list = range(10)
print type(r_list), r_list

r_array = np.arange(10)
print type(r_array), r_array
<type 'list'> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
<type 'numpy.ndarray'> [0 1 2 3 4 5 6 7 8 9]

As you can see, both methods have produced a collection of integers from 0 through 9. However, they are different data types—the range function produces a list, whereas the arange function produces an array. This distinction is important, because the different data types have different functionality associated with them.

One of the most useful things about arrays is that they can have multiple dimensions. A frequently encountered form of data is a table, with a number of rows and a number of columns. We can represent such a structure by creating a two-dimensional array. For example, the ones function creates an array of a particular size with all elements having the value of 1—we can use this to create a data structure with 5 rows and 3 columns.

import numpy as np

data = np.ones(shape=[5, 3])

print data
[[ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]]

Here, we’ve provided the ones function with a keyword argument called shape, which is a list that specifies the number of items along each dimension. As you can see, it creates a structure with 5 rows and 3 columns.

Tip

You can see how this would be a useful data structure if you think that the rows might represent individual participants in an experiment and the columns might 3 conditions in a within-subject design.

Once created, we can access various useful properties of the array. For example, .shape returns a representation of the number of items along each dimension of the array, .size returns the total number of items in the array, and .ndim returns the number of dimensions in the array:

import numpy as np

data = np.ones(shape=[5, 3])

print data.shape
print data.size
print data.ndim
(5, 3)
15
2

Indexing arrays

We can access the items in an array using similar techniques to what we used with lists:

import numpy as np

r_array = np.arange(10)

print r_array[:5]
[0 1 2 3 4]

Indexing becomes more advanced when we have arrays with more than one dimension. Here, we will use another function to generate a multidimensional array—the numpy equivalent of random.random that we encountered earlier. We can access individual items in the array by separating the dimensions by a comma:

import numpy as np

data = np.random.random(size=[5, 3])

print data

# first row, first column
print data[0, 0]
# second row, first column
print data[1, 0]
# second row, last column
print data[1, -1]
[[ 0.90168308  0.17133111  0.57730204]
 [ 0.83406998  0.38666967  0.3698428 ]
 [ 0.48163748  0.43250003  0.95742637]
 [ 0.17254113  0.40056304  0.35827753]
 [ 0.95301857  0.13373673  0.69812848]]
0.9016830831
0.834069979484
0.36984279585

Importantly, we can also access all the items along a particular dimension:

import numpy as np

data = np.random.random(size=[5, 3])

print data

# all rows, first column
print data[:, 0]
# first row, all columns
print data[0, :]
[[ 0.1961776   0.53541907  0.7453791 ]
 [ 0.57018969  0.62871156  0.04582693]
 [ 0.83923676  0.29456444  0.72562237]
 [ 0.77830901  0.50678319  0.18410882]
 [ 0.56276205  0.0579545   0.92955741]]
[ 0.1961776   0.57018969  0.83923676  0.77830901  0.56276205]
[ 0.1961776   0.53541907  0.7453791 ]

We can also extract items using arrays of boolean values. For example, if we use a > operator on an array, it returns an array of booleans. If we then use this boolean array to index the data, it returns those items where the corresponding item in the boolean array is True:

import numpy as np

data = np.random.random(10)

print data

gt_point_five = data > 0.5

print gt_point_five

print data[gt_point_five]
[ 0.41374836  0.20518716  0.78958201  0.97271562  0.37304847  0.42976821
  0.02028967  0.91801355  0.05767821  0.73583786]
[False False  True  True False False False  True False  True]
[ 0.78958201  0.97271562  0.91801355  0.73583786]

Operations on arrays

We can use the conventional maths operators with arrays where, unlike lists, they operate on each item in the array. For example:

import numpy as np

data = np.ones(4)

print data + 1
print data * 3
print data - 2
[ 2.  2.  2.  2.]
[ 3.  3.  3.  3.]
[-1. -1. -1. -1.]

We can also use operators that operate over items in an array. For example, we could add together all the items in an array:

import numpy as np

data = np.ones(4)

print data
print np.sum(data)
[ 1.  1.  1.  1.]
4.0

When applied to multidimensional arrays, such functions typically can be given an axis argument. This argument specifies the axis over which the operation is applied. For example, to sum over rows and columns:

import numpy as np

data = np.ones([4, 3])
data[1, :] = 2
data[2, :] = 3
data[3, :] = 4

print data

print "Rows:"
print np.sum(data, axis=0)

print "Columns:"
print np.sum(data, axis=1)
[[ 1.  1.  1.]
 [ 2.  2.  2.]
 [ 3.  3.  3.]
 [ 4.  4.  4.]]
Rows:
[ 10.  10.  10.]
Columns:
[  3.   6.   9.  12.]

Loading and saving arrays from/to disk

When using one or two dimensional arrays, a straightforward way to load and save data is in the form of a text file. This can be opened with any editor, and maximises the interoperability of the data with other programs. To do so, we use np.savetxt and np.loadtxt. For example, we can save some random data to a text file and the load it back in again to verify its contents have not changed:

import numpy as np

data = np.random.random([3, 2])

print data

np.savetxt("data.txt", data)

saved_data = np.loadtxt("data.txt")

print saved_data
[[ 0.92961609  0.31637555]
 [ 0.18391881  0.20456028]
 [ 0.56772503  0.5955447 ]]
[[ 0.92961609  0.31637555]
 [ 0.18391881  0.20456028]
 [ 0.56772503  0.5955447 ]]

We can also inspect the file that is saved on disk, data.txt:

9.296160928171478544e-01 3.163755545817859005e-01
1.839188116770944514e-01 2.045602785530397094e-01
5.677250290816866496e-01 5.955447029792515501e-01

There are two notable aspects of the above that are worth comment:

  1. The data are represented in exponential notation.
  2. Columns are separated by a space character.

To make it easier to visually inspect the data file, and for compatibility with other programs that are expecting a CSV (‘comma separated values’) format, we can provide arguments to the np.savetxt function:

import numpy as np

data = np.random.random([3, 2])

np.savetxt("data2.txt", data, fmt="%.4f", delimiter=",")
0.9296,0.3164
0.1839,0.2046
0.5677,0.5955
  • The fmt argument allowed us to specify 4 decimal places (see the Python documentation on its format specification mini-language for more details).
  • The delimiter argument allowed us to specify a comma as the separator (‘delimiter’) between columns.

If the array has more than two dimensions, then saving as a plain text file often isn’t practical. In such circumstances, you can use np.save and np.load with array files, which are typically given the extension .npy.