# Data Analysis with Jupyter Notebooks.

# Tutorial 3

Benjamin J. Morgan, University of Bath.

# Contents

- [Data Types](#data_types)
    - [Integers and Floats](#numbers)
        - [Scientific Notation](#scientific_notation)
    - [Strings](#strings)
    - [Lists](#lists)
- [Numpy and arrays](#numpy)

## Data types<a id="data_types"></a>

So far we have talked about &ldquo;data&rdquo; and &ldquo;results&rdquo;, but what are the pieces of information that we want to manipulate? Typically numbers (or groups of numbers) or text (or lists of text). Different kinds of data can be used for different things: numbers can be combined in mathematical expressions, text can be printed, searched, or reorganised; numbers can be arranged by magnitude, names can be arranged by alphabetical order. In Python, these differences are represented by different **data types**.

### Numbers: *int* and *float*<a id="numbers"></a>

We will discuss two kinds of numeric types: integers and floating point numbers. Python has other built in numeric data types, including complex numbers, which are useful in specialised cases.

Whole numbers, without decimal points are integers, e.g. <span style="color:#108714; font-family:monospace">1</span>, <span style="color:#108714; font-family:monospace">6</span>, <span style="color:#108714; font-family:monospace">2331</span>.  
Numbers with decimal points are floating point numbers or &ldquo;floats&rdquo;, e.g. <span style="color:#108714; font-family:monospace">1.0</span>, <span style="color:#108714; font-family:monospace">232.141</span>, <span style="color:#108714; font-family:monospace">1.3e5</span>.  
That last example, <span style="color:#108714; font-family:monospace">1.3e5</span>, uses scientific notation and is shorthand for <span style="color:#108714; font-family:monospace">130000.0</span>.

### Scientific Notation<a id="scientific_notation"></a>
Very large and very small numbers can be written using **scientific notation**. For example, instead of 0.0000241, we would normally write 2.41&times;10<sup>-5</sup>. In Python this would be written `2.41e-5` or `2.41e-05`.

>```python
2.41e-5 == 0.0000241
```

### Strings<a id="strings"></a>

Strings are any sequence of text. We indicate that a sequence of text is a string, and not a Python command, by enclosing it in single or double quotes. Being able to use either quote type allows strings that themselves contain quotes.

>```python
'this is a string using single quotes'
```

>```python
"this is a string using double quotes"
```

>```python
'this string has "nested quotes"'
```

### Lists<a id="lists"></a>

Python also contains built-in data types for collections of things. For data analysis we often deal with sets of numbers. These can be collected in **lists**.

A list is denoted by a series separated by commas, and enclosed in square brackets:

>```python
my_list = [ 1, 2, 3, 4 ]
mylist
```

although lists can contain any set of Python objects, even other lists:

>```python
my_other_list = [ 4, 1.5, 'peach' ]
my_other_list
```

>```python
both_lists = [ my_list, my_other_list ]
both_lists
```

To refer to one element in a list, use the **index** of that element. Index numbering counts the number of jumps along the sequence, so starts at zero.

>```python
# 1st element (zero jumps along the sequence)
print( my_other_list[0] )
# 2nd element (one jump along the sequence)
print( my_other_list[1] ) 
# 3rd element (two jumps along the sequence)
print( my_other_list[2] ) 
```

Using an index outside the range of elements in the list will produce an error. For example, `my_other_list` has three elements, but `my_other_list[3]` tries to return the *4th* element (which does not exist)

>```python
print( my_other_list[3] )
```

You can also refer to a sequence of elements by giving a *range* as the index:

In [None]:
# run this cell to create the list `alphabet`
alphabet = [ 'a', 'b', 'c', 'd', 'e', 'f', 'g', 
             'h', 'i', 'j', 'k', 'l', 'm', 'n', 
             'o', 'p', 'q', 'r', 's', 't', 'u',
             'v', 'w', 'x', 'y', 'z' ]

>```python
alphabet[3:8]
```

→ start from three jumps, finish at eight jumps, i.e. elements 4 to 9.

Negative numbers count backwards from the end of the sequence.

>```python
alphabet[-8:-3]
```

→ 9th from the end up to 4th from the end.

And leaving out one of the numbers in the range will include all elements up to the start or end of the sequence.

>```python
alphabet[14:]
```

>```python
alphabet[:14]
```

# numpy and arrays<a id='numpy'></a>

Although lists can be very useful for handling ordered collections of things, for data manipulation we usually deal with ordered lists of only numbers. The flexibility of lists means using them is (relatively) computationally slow. This is not an issue for small data sets, but can be prohibitive for large data sets, with perhaps millions or more entries.

An alternative data type, specifically designed for manipulating (large) numerical data sets is the **numpy array**. `numpy` is a module for numerical scientific computing with Python, and is conventionally imported via

```python
import numpy as np
```

This is similar to the <span style="color:#108714; font-family:monospace; font-weight:bold">import</span> <span style="font-family:monospace">math</span> we saw [above](#functions_and_modules), but uses the <span style="color:#108714; font-family:monospace; font-weight:bold">as</span> keyword to make `numpy` more convenient to work with.

>```python
import math
math.sqrt(4)
```

In [None]:
import math

Having imported `numpy` (as `np`) we can store lists of numbers as `numpy` arrays.

>```python
import numpy as np
a = np.array( [ 1, 2, 3, 4 ] )
a
```

You can think of a 1-dimensional `numpy` array as a vector, and we can use very compact code to perform *vector* mathematical operations on the entire array.

>```python
a + 1
```

>```python
a**2
```

Remember that `**` is the $power$ operator. This code calculates $a^2$ for every number stored in `a`.

In both these cases, the mathematical operation (add one; square) is applied to every element in the array, and a new array with *all* the results is returned.

If the mathematical expression contains two (or more) arrays, then an **element-by-element** operation is performed:

e.g. vector addition:

>```python
b = np.array( [ 5, 6, 7, 8 ] )
a + b
```

>```python
a * b
```

Let us try to calculate the square root of all the numbers in `a`:

>```python
from math import sqrt
sqrt(a)
```

In [None]:
import numpy as np
a = np.array( [ 1, 2, 3, 4 ] )
a
np.sqrt(a)

This gives an error.  

Because `numpy` is not part of the standard Python library, the `sqrt` function provided by the `math` module does not know how to treat a `numpy` array of numbers. To do what we want we can use the `sqrt` function in `numpy` instead. 

>```python
np.sqrt(a)
```

<div class="alert alert-success">
Edit the previous code cell to use the <span style='font-family:monospace;'>numpy</span> version of <span style='font-family:monospace;'>sqrt</span> instead of the standard function from the <span style='font-family:monospace;'>math</span> module.
</div>

In [None]:
# This cell tests your answers from the three previous code cells.
# You do not need to edit it
assert _[0] == math.sqrt(1)
assert _[1] == math.sqrt(2)
assert _[2] == math.sqrt(3)
assert _[3] == math.sqrt(4)

`numpy` contains a great many functions for performing mathematical operations on arrays of numbers, which are all listed on the [`numpy` website](https://docs.scipy.org/doc/numpy/reference/routines.math.html).

To limit the number of decimal places in our result we can use `round()`:

>```python
np.round( np.sqrt(a), 2 ) # round the result to 2 decimal places
```

Notice that here the first argument is `np.sqrt(a)`, which is itself a `numpy` function. This is analogous to a function of a function in mathematics: $f(g(x))$. Nesting functions like this helps write compact code without storing the intermediate results. Nesting several functions can make your code confusing to read, however, and your primary goal should be to write clear understandable code.

### Generating sequences of numbers

Often, we will want to use `numpy` arrays to store experimental data. Other times we might just want a list of number, e.g. from 1 to 20. We could write these out to create the array:

```python
one_to_twenty = np.array( [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 ] )
```

To save typing (and make your code easier to read) `numpy` contains a function for creating lists of numbers:

>```python
n = np.arange(1,21)
n
```

Notice that `arange` gives us numbers starting from 1, up to, but not including, 21.  

We can generate lists of numbers with different spacings by providing a step-size (which has a default value of 1)

>```python
m = np.arange(2,21,2)
m
```

Another way to generate an evenly spaced list of number is to use [`linspace()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html).

>```python
p = np.linspace(0,10,50)
p
```

`linspace()` takes three arguments: the starting number, the end number, and the total number of values in the sequence.  

`linspace()` is particularly useful for generating evenly spaced points that are not integers.