# Guide to using Python with Jupyter

In this file you can find some of the most important things about how Python works and different functions that might be helpful with getting started. Also including some examples of how they work.

About using this document: you should run the cell found in **section 2** first (every time you use this document) and then move to the section/example you wanted to check out. For this reason some functions are introduced multiple times throughout the document, so don't let it confuse you.

If you can't remember how to do something in notebook, just press **H** while not in edit mode and you can see a list of shortcuts you can use in Jupyter.

1. [At first](#first)
2. [Modules](#modules)
3. [Data types and modifying data](#data)
4. [Basic calculus and syntax](#basics)
5. [Creating random data](#random)
6. [Plotting diagrams](#plot)
7. [Animations](#anim)
8. [Maps and heatmaps](#maps)
9. [Problems? Check this](#prblm)


### 1. At first

In programming you can save different values in to **variables** which you can use or change later. Different kinds of variable are integers (int), floating-point numbers (float) or strings (str) for example. In Python creating variables is easy, since you don't have to initialize them.

Sometimes bits of memory can be 'left' in the **kernel** running the program, which makes the program not run correctly. It happens regularly, and is nothing to worry about. Just press ***Kernel*** from the top bar menu and choose ***Restart & Clear output***. This resets the kernel memory and clears all output, after which you can start over again. This doesn't affect any changes in the text or code, so it's not for fixing those errors.


### 2. Modules

Python is widely used in scientific community for computing, modifying and analyzing data, and for these purposes Python is greatly optimized. Part of Python is to use different kind of *modules*, which are files containing definitions (functions) and statements. These modules are imported using **import**-command, and even if at first it seems some kind of magic as to which modules to import, it gets easier with time.

If you check the materials used in the Open Data -project, you'll probably notice that each Github-folder contains a text file 'requirements.txt'. These contains the module names used in the notebooks so for example [MyBinder](www.mybinder.org) can build a working platform for Jupyter. The most important modules we're going to use are:

In [None]:
# Most essential modules:

import pandas as pd # includes tools used in reading data
import numpy as np # includes tools for numerical calculus
import matplotlib.pyplot as plt # includes tools used in plotting data

# Other useful modules:

import random as rand # includes functions in generating random data
from scipy import stats # includes tools for statistical analysis
from scipy.stats import norm # tools for normal distribution
import matplotlib.mlab as mlab # more plotting tools for more complicated diagrams

# Not a module, but essential command which makes the output look prettier in notebooks
%matplotlib inline

Remember to run the cell above if you want the examples in this notebook to work. 
You can write the above ```import -- as``` shorter without **as**, which just renames the modules, but it makes your future much easier. If you want to read more about the used modules, select 'Help' from the top bar and you can find some links to documentation.

Of course there are a lot of other modules as well, which you can easily google if need be. Thanks to Python being used so widely, you can find thousands of examples online. If you have some problems/questions, [StackExchange](https://stackexchange.com/) and [StackOverflow](https://stackoverflow.com/) are good places to start. Chances are that someone has ran in to the exact same you are facing before.


### 3. Data types and modifying data

**Summary of data-manipulation:**

Reading .csv $\rightarrow$ 
``` Python 
name = pd.read_csv('path', varargin)
``` 
Reading tables $\rightarrow$ 
``` Python
pd.read_table('path', varargin)
``` 
Checking what's in the file $\rightarrow$ 
``` Python
name.head(n) 
``` 
Length $\rightarrow$ 
``` Python
len(name) 
``` 
Shape $\rightarrow$ 
``` Python
name.shape 
``` 
Columns $\rightarrow$ 
``` Python
name.column 
name['column'] 
``` 
Choosing data within limits $\rightarrow$ 
``` Python
name[(name.column >= lower_limit) & (name.column <= upper_limit)] 
``` 
Searching for text $\rightarrow$ 
``` Python
name['column'].str.contains('part_of_text') 
``` 
Add columns $\rightarrow$ 
``` Python
name = name.assign(column = info) 
``` 
Remove columns $\rightarrow$ 
``` Python
name.drop(['column1','column2'...], axis = 1)
``` 




Open data from CMS-experiment is in .csv (comma-separated-values) files. For a computer, this kind of data is easy to read using *pandas*-module. Saving the read file in a variable makes the variable type *dataframe*. If you're interested more in dataframes and what you can do to it, you can check [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for more information.

The easiest way to read data are **pandas.read_csv** and **pandas.read_tabe**. If the data if nice (as in separated by commas, headings are nice, fonts aren't too exotic..), you don't usually need any extra steps.

In [None]:
# Let's load a dataset about particles and save it in to a variable:

doublemu = pd.read_csv('http://opendata.cern.ch/record/545/files/Dimuon_DoubleMu.csv')

This kind of form ('...//opendata.cern...') fetches the data directly from the website. It could also be of the form **'Dimuon_doubleMu.csv'**, if the data you want to read is in the same folder with the notebook. Or if the file is in another folder it could be of the form **'../folder/data.csv'**.

If the data is not in .csv, you can read it using the more broad **pandas.read_table**-command, which can read multiple types of files and not just csv. The most common problem is data being separated with something else than comma, such as ; or -. In this case you can put an extra argument in the command: **pandas.read_table('path', sep='x')**, with x being the separator. Another common problem is if the ordinal number of rows starts with something different than zero, or if the heading for columns is somewhere else than the first row. In this case you might want to put an extra argument **header = n**, where n is the number of the row headers are in. NOTE! In computing you always start counting at zero, unless otherwise mentioned.

More information about possible arguments [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). 

Under this you can see an example about data which doesn't have a line for headers. In the file there is saved data about the Sun since 1992. If you want to see what each column holds in, you can see the meaning in [here](http://sidc.oma.be/silso/infosndhem).

In [None]:
# Load a set of Sun's data and name it the way we want

sunDat = pd.read_table('http://sidc.oma.be/silso/INFO/sndhemcsv.php', sep = ';', encoding = "ISO-8859-1")

For clarity let's see what does the data look like. For this, the command **name.head(n)** is nice, which shows the first n rows of the chosen data. By assumption n = 5, in case you don't give it any value.

In [None]:
doublemu.head()

In [None]:
sunDat.head()

Above you can see that **sunDat**-variable's first real row is now a header, which is nasty for 1) headings are confusing and 2) we are missing one line of data. Let's solve this by putting a header argument in read_table like this: **read_table('path', header = -1)**, which tells the program that header doesn't exist. The command automatically creates a new first line with numbers as headings.

In [None]:
sunDat = pd.read_table('http://sidc.oma.be/silso/INFO/sndhemcsv.php', sep = ';', encoding = "ISO-8859-1", header = -1)

In [None]:
sunDat.head()

Which isn't very informative for us... We can of course rename them to make it easier (for a human) to read by using **names = ['name1', 'name2', 'name3'..]** command.

In [None]:
sunDat = pd.read_table('http://sidc.oma.be/silso/INFO/sndhemcsv.php', sep=';', encoding = "ISO-8859-1", header = None, 
names = ['Year','Month','Day','Fraction','$P_{tot}$','$P_{nrth}$','$P_{sth}$','$\sigma_{tot}$','$\sigma_{nrth}$',
 '$\sigma_{sth}$','$N_{tot}$','$N_{nrth}$','$N_{sth}$','Prov'])

In [None]:
sunDat.head()

Apart from **name.head()**-command there are couple more commands which are useful when checking out the shape of data. **len(name)** tells you the amount of rows (length of the variable) and **name.shape** tells both amount of rows and columns.

In [None]:
# Usually the code sells show only the last line of the code in output. With print()-command you can get more of the values
# visible. You can try what happens if you remove the print().

print (len(sunDat))
print (sunDat.shape)

When the data is saved in a variable, we can start to modify it the way we want. More often than not, we are interested in a single variables in the data. In this case you want to be able to take single columns of the data, or choose just the rows where the values are within certain limits.

You can choose a column by writing **data_name.column** or **data_name['column']**. The latter is useful if the column name starts with a number (in this scase the computer probably thinks the number as an ordinal number). If you want to make your life a bit easier and don't care about other columns or rows that you've chosen, you might want to save them to a new variable (with _very_ large datasets this might cause memory related problems, but you probably don't have to worry about it). You can do this by just writing **new_variable_name = data_name = data_name.column** and use the new variable instead. Using the new variable also helps in case of different errors, as smaller amount data is easier to handle and possible mistakes are easier to notice (for example if the program starts to draw a histogram of multiple variables and looks like it's stuck in an infinite loop).

In [None]:
# Let's save the data of invariant masses (column named M in the data) in to a new variable 

invMass = doublemu.M

In [None]:
invMass.head()

An easy way to choose certain rows is to create a new variable, in which you save the values from the original data that fulfill certain conditions. In this case choosing values between limits would look like this:
```Python
new_var = name[(name.column >= lower_limit) & (name.column <= upper_limit)]
```
Of course the condition might be any other logical element, such as a certain number (value == number) or a piece of text in a non-numerical data.

In [None]:
# As an example, let's isolate the rows from the original data, where both of the particles energies are at least 30 GeV

highEn = doublemu[(doublemu.E1 >= 30) & (doublemu.E2 >= 30)]

In [None]:
highEn.head()

In [None]:
print ('Amount of particles with energy >= 30 GeV: ', len(highEn))
print ('Amount of total particles: ',len(doublemu))

If you want to search text, you can try **name.loc[ ]**-function:
```Python
new_var = old_var.loc[old_var['column'] == 'wanted_thing']
```

In this case you of course have to know exactly what you're looking for. If you want to choose rows more blindly (as in you know what the column _might_ contain), you can try **str.contains**-function (str.contains() actually returns a boolean value depending whether the column contains the text or not, that's why we have to choose the certain rows from the data for which the statement is true):

```Python
new_var = old_var[old_var['column'].str.contains('contained_text')]
```
This creates a new variable, that contains all the rows in which the 'column' contains 'contained_text' somewhere in it's value. By assumption str.contains() is case-sensitive, but it can be set off:
```Python
new_var = old_var[old_var['column'].str.contains('contained_text', case = False)]
```

Also negation works, as for example below where we delete all Ltd-companies (Oy or Oyj in Finnish, Ab in Swedish) from a data containing all Finnish companies producing alcoholic beverages. (This may also delete companies having -oy- somewhere in the name, so you should be careful with this method.)

In [None]:
alcBev = pd.read_csv('http://avoindata.valvira.fi/alkoholi/alkoholilupa_valmistus.csv', 
 sep = ';', encoding = "ISO-8859-1", na_filter = False)

In [None]:
# Sorry about the Finnish headings!

alcBev.head()

In [None]:
producers = alcBev[alcBev['Nimi'].str.contains('Oy|Ab') == False]
producers.head()


If you want to add or remove columns from the data, you can use **name = name.assign(column = information)** to add columns and 
**name.drop(['column1', 'column2',...], axis = 1)** to drop columns. In drop **axis** is meanful so the command targets columns specifically instead of rows.

In [None]:
# Removing a column with .drop.
# Sometimes .drop doesn't work correctly (we don't know why, gotta look in to it), so let's just save the result to the old 
# variable to avoid it

alcBev = alcBev.drop(['Nimi'], axis = 1)
alcBev.head()

In [None]:
# Inserting a column using assign
# Let's insert a column R with some numbers in it. Remember to check that the length of the column is correct

numb = np.linspace(0, 100, len(alcBev))
 
alcBev = alcBev.assign(R = numb)
alcBev.head()



### 4. Basic calculus and syntax

**Summary of basic calculus:**

Absolute values $\rightarrow$ 
```Python
abs(x) 
``` 
Square root $\rightarrow$
```Python
sqrt(x) 
``` 
Addition $\rightarrow$ 
```Python
x + y 
``` 
Substraction $\rightarrow$
```Python
x - y 
``` 
Division $\rightarrow$
```Python
x/y 
``` 
Multiplying $\rightarrow$
```Python
x*y 
``` 
Powers $\rightarrow$ 
```Python
x**y 
``` 
Maximum value $\rightarrow$
```Python
max(x) 
``` 
Minimum value $\rightarrow$ 
```Python
min(x) 
``` 
Creating own function $\rightarrow$ 
```Python
def name(input):
 do something to input
 return 
 
``` 


The basic operations are very basic, you write them as you would in any computer-based calculator. If you want the program to print out more than one thing, remember to use **print()**. You can also combine text and numbers. Function **repr(numbers)** might come in handy as it transforms the number to a more printable datatype. In [this](https://docs.python.org/3/library/functions.html) you can find all the functions you can use in Python without importing any modules. In [here](https://docs.python.org/3/library/stdtypes.html) you can find pretty much everything that's built-in in the Python interpreter, in case you're interested.

In [None]:
# You can change what kind of calculation (result) is saved in to the 'num'-variable

num = 14*2+5/2**2
text = 'The result of the day is: '
print (text + repr(num))

In [None]:
# max() finds the largest number in the set

bunch_of_numbers = [3,6,12,67,578,2,5,12,-34]

print('The largest number is: ' + repr(max(bunch_of_numbers)))

The more interesting case is creating your own functions in your own needs. This works by **defining** the function as follows:

``` Python
def funcName(input): 
 do stuff
 return
```

Function doesn't actually have to return anything, if for example it's only used to print stuff.

In [None]:
# Let's create a function that prints out half of the given number

def divide_2(a):
 print(a/2)
 
divide_2(6)

In [None]:
# Let's make a addition-function, that asks the user for integers

def add(x, y):
 summ = x + y
 text = '{} and {} together are {}.'.format(x, y, summ)
 print(text)

def ownChoice():
 a = int(input("Give an integer: "))
 b = int(input("And another one: "))
 add(a, b)

ownChoice() 

In [None]:
# How about a function that returns a given list of radians in degrees. While-loop loops through the list from the first (i=0)
# element to the last one (len(list) - 1) and does the operation to each one

def angling(a):
 b = a.copy() # list.copy() is useful so the original list doesn't change
 i=0
 while i < len(a):
 b[i] = b[i]*360/(2*np.pi)
 i+=1
 return b;

rads = [5,2,4,2,1,3]
angles = angling(rads)
print('Radians: ', rads)
print('Angles: ', angles)

In [None]:
# The same using for-loop:

def angling2(a):
 b = a.copy()
 for i in range(0,len(a)):
 b[i] = b[i]*360/(2*np.pi)
 return b;
 
rad = [1,2,3,5,6]
angle = angling2(rad)
print('Radians: ', rad)
print('Angles: ', angle)


### 5. Creating random data

**Summary:**

Random integer between lower and upper $\rightarrow$ 
```Python
rand.randint(lower,upper)
``` 
Random float between 0 and 1 $\rightarrow$ 
```Python
rand.random() 
``` 
Choose a random (non-uniform) sample $\rightarrow$
```Python
rand.choices(set, probability, k = amount) 
``` 
Generate a random sample of a given size $\rightarrow$ 
```Python
rand.sample(set, k = amount) 
``` 
Normal distribution $\rightarrow$
```Python
rand.normalvariate(mean, standard deviation) 
``` 
Evenly spaced numbers over interval $\rightarrow$ 
```Python
np.linspace(begin, end, num = number of samples) 
``` 
Evenly spaced numbers over interval $\rightarrow$ 
```Python
np.arange(begin, end, stepsize)
``` 


It is sometimes interesting and useful to generate simulated or random data among real data. Generating more complex simulations (such as [Monte Carlo](https://en.wikipedia.org/wiki/Monte_Carlo_method) for example) are outside of the goals of this guide, we can still look at different ways to generate random numbers. Of course you have to remember that the usual random generation methods are pseudorandom, so you might not want to use these to hide your banking accounts or to generate safety numbers. Leave that to more complex and heavier methods (you probably should just forget it and leave it to professionals).

In [None]:
# Let's generate a random integer between 1 and 100

lottery = rand.randint(1,100)
text = 'Winning number of the day is: '
print (text + repr(lottery))

In [None]:
# Generate a random float number between 0 and 1 and multiply it by 5

num = rand.random()*5
print(num)

In [None]:
# Let's pick random elements from a list, but make certain elements more likely

kids = ['Pete','Jack','Ida','Nelly','Paula','Bob']
probabilities = [10,30,20,20,5,5]

# k is how many we want to choose, choices-command might take the same name multiple times

names = rand.choices(kids, weights = probabilities, k = 3)
print(names)

In [None]:
# Let's do the same without multiple choices (this is useful for teachers to pick 'volunteers')

volunteers = rand.sample(kids, k = 3)
print (volunteers)

In [None]:
# Random number from a given normal distribution (mean, standard dev.)

num = rand.normalvariate(3, 0.1)
print(num)

In [None]:
# Let's create an evenly spaced list of numbers between 1 and 10, and randomize it a bit

numbers = np.linspace(1, 10, 200)

def randomizer(a):
 b = a.copy()
 
 for i in range(0,len(b)):
 b[i] = b[i]*rand.uniform(0,b[i])
 return b

result = randomizer(numbers)
# print(numbers)
# print(result)

fig = plt.figure(figsize=(15, 10))
plt.plot(result,'g*')
plt.show()

In [None]:
# Another method to create a list of evenly spaced numbers [a,b[ is by arange(a,b,c), where c is the stepsize.
# Notice that b is not included in the result. (The result might be inconsistant if c is not an integer)

numbers = np.arange(1,10,2)
print(numbers)


### 6. Plotting diagrams

**Summary:**

Basic plot $\rightarrow$
```Python
plt.plot(name, 'style and colour', varargin)
``` 

Scatterplot $\rightarrow$
```Python
plt.scatter(x-data, y-data, marker = 'markerstyle', color = 'colour', varargin)
```

Histogram $\rightarrow$
```Python
plt.hist(data, amount of bins, range = (begin,end), varargin)
```

Legend $\rightarrow$
```Python
plt.legend()
```

show plot $\rightarrow$
```Python
plt.show()
```

Fitting normal distribution in data $\rightarrow$
```Python
(mu, sigma) = norm.fit(data)
... et cetera
```

Formatting $\rightarrow$
```Python
plt.xlabel('x-axis name')
plt.title('title name')
fig = plt.figure(figsize = (horizontal size, vertical size))
```

Plotting errors$\rightarrow$
```Python
plt.errorbar(val1, val2, xerr = err1, yerr = err2, fmt = 'none')
```

Diagrams might very well be the reason to use programming in scientific teaching. Even for bigger datasets it is somewhat quick and effortless to create clarifying visualizations. Next we're going to see how plotting works with Python.

You can freely (end easily) change the colours and markers of the diagrams. [Here](https://matplotlib.org/api/markers_api.html?highlight=markers#module-matplotlib.markers) you can find the most important things used in plotting data, which of course is different marker styles.

In [None]:
# Basic diagram with plot-function. If the parameters contains only one line of data, x-axis is the ordinal numbers
numbers = [1,3,54,45,52,34,4,1,2,3,2,4,132,12,12,21,12,12,21,34,2,8]
plt.plot(numbers, 'b*')

# plt.show() should always be used if you want to see what the plot looks like. Otherwise the output shows the memory
# location of the picture among other things, which we probably don't want to look at. So use this

plt.show()

In [None]:
# It's good practice to name different plots, so the readers can understand what's going on
# Here you can see how to name different sets

# Two random datasets

result1 = np.linspace(10, 20, 50)*rand.randint(2,5)
result2 = np.linspace(10, 20, 50)*rand.randint(2,5)

# Draw them both

plt.plot(result1, 'r^', label = 'Measurement 1')
plt.plot(result2, 'b*', label = 'Measurement 2')

# Name the axes and title, with fontsize-parameter you can change the size of the font
plt.xlabel('Time (s)', fontsize = 15)
plt.ylabel('Speed (m/s)', fontsize = 15)
plt.title('Measurements of speed \n', fontsize = 15) # \n creates a new line to make the picture look prettier

# Let's add legend. If the loc-parameter is not defined, legend is automatically placed somewhere where it fits, usually

plt.legend(loc='upper left', fontsize = 15)

# and show the plot

plt.show()

In [None]:
# Just as easily we can plot trigonometric functions
# Let the x-axis be an evenly spaced number line

x = np.linspace(0, 10, 100)

# Define the functions we're going to plot

y1 = np.sin(x)
y2 = np.cos(x)

# and draw

plt.plot(x, y1, color = 'b', label = 'sin(x)')
plt.plot(x, y2, color = 'g', label = 'cos(x)')

plt.legend()

plt.show()

In [None]:
# The basic size of the pictures looks somewhat small. Figsize-command is going to help us making them the size we want

x = np.linspace(0, 10, 100)

y1 = np.sin(x)
y2 = np.cos(x)

# Here we define the size, you can try what different sizes look like

fig = plt.figure(figsize=(15, 10))

plt.plot(x, y1, color = 'b', label = 'sin(x)')
plt.plot(x, y2, color = 'g', label = 'cos(x)')

plt.legend()

plt.show()

Another traditional diagram is a [scatterplot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html), where both axes are variables. This is very common in for example physics research.

In [None]:
def randomizer(a):
 b = a.copy()
 for i in range(0,len(b)):
 b[i] = b[i]*rand.uniform(0,1)
 return b

# Let's generate random data, where the other value is between 0 and 5, and the other between 0 and 20

val1 = randomizer(np.linspace(3,5,100))
val2 = randomizer(np.linspace(10,20,100))

fig = plt.figure(figsize=(10,5))
plt.scatter(val1, val2, marker ='*', color = 'b')
plt.show()

In [None]:
# Another scatter-example. Now both values are scattered by normal distribution, not uniforlmy random

def randomizer(a):
 b = a.copy()
 for i in range(0,len(b)):
 b[i] = b[i]*rand.normalvariate(1, 0.1)
 return b

val1 = randomizer(np.linspace(3,5,100))
val2 = randomizer(np.linspace(10,20,100))

fig = plt.figure(figsize=(10,5))
plt.scatter(val1, val2, marker ='*', color = 'b', label = 'Measurements')

# Just for fun: let's fit a line there using linear regression

slope, intercept, r_value, p_value, std_err = stats.linregress(val1, val2)
plt.plot(val1, intercept + slope*val1, 'r', label='Linreg. fit')

plt.legend(fontsize = 15)
plt.show()

# If you want to know more about the fitted line, you can write print(slope), print(r_value) etc.

Another significant diagram is a histogram, which represents the amount of different results in the data. Histograms are fairly common, for example in (particle) physics, medical science and social sciences.

In [None]:
# Let's make a random age distribution and create a histogram out of it
def agegenerator(a):
 b = a.copy()
 for i in range(0, len(b)):
 b[i] = b[i]*rand.randint(1,100)
 return b;

ages = agegenerator(np.ones(1000))

fig = plt.figure(figsize = (10,5))
plt.hist(ages, bins = 100, range = (0,110))

plt.xlabel('Ages', fontsize = 15)
plt.ylabel('Amount', fontsize = 15)
plt.title('Age distribution in a sample of %i people \n' %(len(ages)), fontsize = 15 ) 

plt.show()

In [None]:
# Let's see what a histogram for particle collisions look like
doublemu = pd.read_csv('http://opendata.cern.ch/record/545/files/Dimuon_DoubleMu.csv')

# So this histogram is about the distribution of invariant masses (column M of the data)

fig = plt.figure(figsize = (10,5))
plt.hist(doublemu.M, bins = 300, range = (0,150))

plt.xlabel('Invariant mass(GeV/$c^2$)', fontsize = 15)
plt.ylabel('Number of events', fontsize = 15)
plt.title('Distribution of invariant masses from muons \n', fontsize = 15 ) 

plt.show()

In [None]:
# Let's focus on the bump between 80 and 100 GeV. We could just set range = (80,100), but for the sake of example
# we're going to crop the data and choose only the events in the specific range

part = doublemu[(doublemu.M >= 80) & (doublemu.M <= 100)]


fig = plt.figure(figsize = (10,5))
plt.hist(part.M, bins = 200)

plt.xlabel('Invariant mass (GeV/$c^2$)', fontsize = 15)
plt.ylabel('Number of events', fontsize = 15)
plt.title('Invariant mass distribution from muons \n', fontsize = 15 ) 

plt.show()

In general making non-linear fits for the results requires more or less (more) coding, but in case of distributions (normal, as the invariant mass looks like, for example) Python has quite a lot of commands to make your life easier.

In [None]:
# Here we set the limits for the fit. It is good practice to set these in variables in case you want to change them later, 
# makes it much easier

lower = 87
upper = 95

piece = doublemu[(doublemu.M > lower) & (doublemu.M < upper)]

fig = plt.figure(figsize=(15,10))

# Above is the limits for the normal fit, below are the limits of how wide are we going to draw the histogram.
# Note that the fit isn't for everything that's seen on the histogram

shw_lower = 80
shw_upper = 100

area = doublemu[(doublemu.M > shw_lower) & (doublemu.M < shw_upper)]

# Because the shown histogram's area is equal to 1, we have to calculate a multiplier for the fitted curve

multip = len(piece)/len(area)

# standard deviation and variance for the fit

(mu, sigma) = norm.fit(piece.M)

# Let's draw the histogram

n, bins, patches = plt.hist(area.M, 300, density = 1, facecolor = 'g', alpha=0.75, histtype = 'stepfilled')

# And make the fit as well

y_fit = multip*norm.pdf(bins, mu, sigma)
line = plt.plot(bins, y_fit, 'r--', linewidth = 2)

# This heading looks bad in the code, but beautiful in the final picture. 

plt.title(r'$\mathrm{Histogram\ of\ invariant\ masses\ normed\ to\ one:}\ \mu=%.3f,\ \sigma=%.3f$'
 %(mu,sigma),fontsize=15)

# While we're at it, let's give the plot a grid!

plt.grid(True)

plt.show()

We can also draw a histogram out of data which has no numbers. Let's take a look at [collision data from London](http://roads.data.tfl.gov.uk).

In [None]:
# Here's all the collisions from 2016, a bit over 40 000 different vehicles. Same events have the same AREFNO.

traffic = pd.read_table('http://roads.data.tfl.gov.uk/AccidentStats/Prod/2016-gla-data-extract-vehicle.csv', sep = ",")
casualties = pd.read_table('http://roads.data.tfl.gov.uk/AccidentStats/Prod/2016-gla-data-extract-casualty.csv', sep = ",")

In [None]:
traffic.head()

In [None]:
casualties.head()

In [None]:
# Let's check the collisions for ages between certain limits

lower = 18
upper = 25

age_collisions = traffic.loc[(traffic['Driver Age'] <= upper) & (traffic['Driver Age'] >= lower)]

In [None]:
# What does the vehicle distribution with this age group look like?

fig = plt.figure(figsize=(10,5))
plt.hist(age_collisions['Vehicle Type'])

# We have to rotate the xticks to see what kind of vehicles are actually used
plt.xticks(rotation = 40, ha='right')

plt.show()

Cars seems to dominate this statistic, which isn't too surprising. But ridden horse? We could dig deeper into this:

In [None]:
# Let's take out all the horses from the data:

horses = traffic.loc[traffic['Vehicle Type'] == '16 Ridden Horse']
horses.head()

In [None]:
# Hmm, same AREFNO, so the horses seems to have collided with each other (Veh. Impact: Front hit first, back hit first)
# How severe was this collision?

horseCasualties = casualties.loc[casualties['AREFNO'] == '0116TW60237']
horseCasualties.head()

# Protip: manually entering the ref# is not a good practice, particularly when working with larger datasets. In that 
# case you should make a reference to another table and compare the ref# to make this work automatically.

Luckily the collision wasn't too severe, and only one of the riders got hurt slightly.

A word on errors when plotting data: in reality there's always some variance regarding how accurate a measurement is, or even how accurately you can measure something. These precision limits can be found out using statistical methods on the fits made for the data, they can be known for each data point separately (which often is the case in measurements made in schools). Let's make a example of this.

In [None]:
# As you may have noticed by now, we have defined randomizer multiple times in this document. That's not how it should 
# be done, as it takes away the idea of functions. However it's done this way if someone wants to check out only
# this example and not the first one where this function was introduced.

def randomizer(a):
 b = a.copy()
 for i in range(0,len(b)):
 b[i] = b[i]*rand.normalvariate(1, 0.1)
 return b

# Let's generate the random data

val1 = randomizer(np.linspace(3,5,100))
val2 = randomizer(np.linspace(10,20,100))

# And let's give each datapoint a random error

err1 = (1/5)*randomizer(np.ones(len(val1)))
err2 = randomizer(np.ones(len(val2)))

fig = plt.figure(figsize=(10,5))

plt.scatter(val1, val2, marker ='*', color = 'b', label = 'Measurements')
plt.errorbar(val1, val2, xerr = err1, yerr = err2, fmt = 'none')

# Let's throw in a fit based on linear regression as well

slope, intercept, r_value, p_value, std_err = stats.linregress(val1, val2)
plt.plot(val1, intercept + slope*val1, 'r', label='Fit')

plt.legend(fontsize = 15)
plt.show()

# If you want to know more of the mathematical values of the fit, you can write print(slope), print(std_err), etc..


### 7. Animations

You can pretty easily also create animations using Python. This can be done with multiple different modules, but we recommend **NOT** to use plotly with Notebooks, as it slows down everything to the point nothing can be done. In this example we're going to create an animation of a histogram which nicely shows why more data = better results.

In [None]:
data = pd.read_csv('http://opendata.cern.ch/record/545/files/Dimuon_DoubleMu.csv')

iMass = data.M

In [None]:
# Let's define the function that's going to upgrade the histogram
# variable num is basically the frame number
# So the way animations work is that this function calculates a new histogram for each frame 

def updt_hist(num, iMass):
 plt.cla()
 axes = plt.gca()
 axes.set_ylim(0,8000)
 axes.set_xlim(0,200)
 plt.hist(iMass[:num*480], bins = 120)

NOTE: cells including animations are $\Large \textbf{ slow }$ to run. The more frames the more time it takes to run.

In [None]:
# Required for animations
import matplotlib.animation

In [None]:
%%capture
fig = plt.figure()
 
# fargs tells which variables the function (updt_hist) is going to take in, the empty variable is required
# so the program knows that there's two variables used in the function. The other one is automatically
# the current frame
anim = matplotlib.animation.FuncAnimation(fig, updt_hist, frames = 200, fargs=(iMass, ) )

# anim.to_jshtml() changes the animation to (javascript)html, so it can be shown on Notebook
from IPython.display import HTML
HTML(anim.to_jshtml())

The above cell doesn't give output because of the ```%%capture``` -magic command. This is done because otherwise we'd get two different pictures of the animation. It looks prettier this way.

In [None]:
HTML(anim.to_jshtml())


### 8. Maps and heatmaps

Using interactive maps in Jupyter Notebook so you can plot data on them? Yes please! Using them is much simpler than it sounds. In this example you'll see how. The data you're going to plot just needs to have latitude and longitude columns so you can plot it (or some other coordinate system from which you can calculate latitude and longitude).

In [None]:
# Folium has maps:
import folium

# We're also going to need a way to plot a heatmap:
from folium.plugins import HeatMap

In [None]:
# The data includes all earthquake data from the last month, chances are that the newest data of the set are from
# last night or this morning
quakeData = pd.read_csv('https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv')
quakeData.head()

In [None]:
# This is required as the data we have now is in dataframe, and HeatMap-function reads lists

# First let's make long enough list, in this variable we're going to save the data
dat = [0]*len(quakeData)

# The list is going to consist of tuples containing latitude, longitude and magnitude 
# (magnitude is not required, but it's nice to have in case you want to plot only quakes above 
# a certain magnitude for example)
for i in range(0, len(quakeData)):
 dat[i] = [quakeData['latitude'][i], quakeData['longitude'][i], quakeData['mag'][i]]

In [None]:
# There's some (one) data about earthquakes that don't include magnitude (saved as NaN) so
# we have to remove these values

dat = [x for x in dat if ~np.isnan(x[2])]

In [None]:
# Different map tiles: https://deparkes.co.uk/2016/06/10/folium-map-tiles/
# world_copy_jump = True tells us that the map can be scrolled to the side and the data can be seen there as well
# If you want the map to be a 'single' map you can put an extra argument no_wrap = True
# With control_scale you can see the scale on the bottom left corner

m = folium.Map([15., -75.], tiles='openstreetmap', zoom_start=3, world_copy_jump = True, control_scale = True)

HeatMap(dat, radius = 15).add_to(m)

m

Let's check another example where we have to chance the coordinate system. This dataset uses a easting-northing system which isn't too different from easting-northing known from the [UTM](https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system). You can find more about the coordinate system on the page 38 on [this](https://www.ordnancesurvey.co.uk/docs/support/guide-coordinate-systems-great-britain.pdf) and see that the conversion isn't too trivial, if it interests you.

In [None]:
collData = pd.read_csv('https://files.datapress.com/london/dataset/road-casualties-severity-borough/TFL-road-casualty-data-since-2005.csv')
collData.head()

In [None]:
# Luckily someone has encountered this grid system before and we don't have to do the conversion ourselves
from OSGridConverter import grid2latlong

In [None]:
# Ignore collisions where the severity is 'slight'
part = collData[collData['Casualty_Severity'] != '3 Slight']

In [None]:
# In this example the conversion is done in two steps just to show how it's done, makes 
# the code more readable

# coords is used to temporarily store the lat&lon data as grid2latlong function returns just one row
coords = [0]*len(part)

# And we have to iterate the whole dataset row by row..
# Plus since the coordinates in the datasets don't have the area (TQ, where London is located in), but
# the area is told in the first 2 numbers in each easting and northing values we have
# to choose everything else in them using syntax (name)[1:], which ignores the first 2 numbers
# ALSO they are saved as integers and grid2latlong takes in strings, so we have to chance 
# the datatype by using str(value)
# This cell might run for a while
i = 0
for index, row in part.iterrows():
 coords[i] = grid2latlong('TQ' + str(row['Easting'])[1:] + str(row['Northing'])[1:])
 i += 1

In [None]:
# Because of the type grid2latlong returns, we have to create a new variable (list) so we can use
# the values with the map
latlong = [[0,0]]*len(coords)

# for each value in coords we choose it's latitude and longitude values and save them
# in i:th row of latlong
for i in range (0,len(coords)):
 latlong[i] = [coords[i].latitude,coords[i].longitude]

In [None]:
m = folium.Map([51.5,-0.1], zoom_start=9, world_copy_jump = True, control_scale = True)

HeatMap(latlong, radius = 10).add_to(m)

m

Wouldn't it be nice if everyone used the same coordinate system?


### 9. Problems? Check here

**Summary:** 

Bohoo, I can't? 
Cell seems stuck and doesn't draw the plot or run the code? 
I get an error 'name is not defined' or 'name does not exist'? 
I tried to save something in to a variable but print(name) tells me None? 
My data won't load? 
The data I loaded contains some NaN values? 
I combined pieces of data but now I can't do things with the new variable? 
My code doesn't work, even if it's correctly written? 
The dates in the data are confusing the program, how do I fix this? 
I copied the data in to a new variable, but the changes to it also changes the original data?

#### Bohoo, I can't?

No problem, nobody starts as a champion. You learn by doing and errors are part of it (some say 90% of coding is fixing errors..). 

Using Python there's this one great thing: there are A LOT of users. No matter the problem, chances are someone has faced it already and posted a solution online. Googling the problem usually gives the right answer within the first few results.

Here's fixes to some common problematic situations (which we faced when making this document).

#### Cell seems stuck and doesn't draw the plot or run the code?

If running the cell takes longer than a few seconds, without it being needlessly complicated or handling **large** datasets, it's probably stuck in an infinite loop. You should stop the kernel (by choosing ***Kernel $\rightarrow$ Interrupt*** from the top bar or pressing the square right below it) and check your code for possible errors. If you can't find the problem try to simplify the syntax, until you're positibe there's nothing wrong with your code. (Sometimes also just resetting the kernel and running the cells again makes it work.)

One common problem is that a syntax-error makes the program do something wrong. For example: you're drawing a histogram but forgot to choose a specific column. Now the program tries to create a histogram of the whole data, which it obviously can't do without further specifications. 

#### I get an error 'name is not defined' or 'name does not exist'? 

The variable you're referring to doesn't exist. Check that you've run the cell where the variable is defined during this session. Also make sure that the variable name is correct, as they are case-sensitive. 

#### I tried to save something in to a variable but print(name) tells me None? 

There really isn't anything in the variable. Remember to save the changes you make in to a variable, for example

```Python
var = load(data)
var = var*2
```

and not 
```Python
var = load(data)
var*2
```

Make sure that the operation you want to make is right so it doesn't delete the data by accident (or do anything unexpected).

#### My data won't load? 

You can check what text-based data (such as .csv) looks like using the most basic text editors. Now you can see how the data is separated, what rows contain the information you want or is the dataset even the one you wanted.

Separators, headers and such can be defined in the arguments of the read_csv and read_table functions, for example
```Python
pd.read_csv('file.csv', sep = ';')
```
would load a csv file named file.csv (in the same folder), with ';' as a separator. More on this you can find in chapter 3 of this document.

#### The data I loaded contains some NaN values? 

NaN stands for Not-A-Number, and it's commonly used in computer sciences. Either the data at that point is strange (like sqrt(-1)) or it simply doesn't exist. 

Functions usually don't care about these NaN-values, or you can put in an argument so the function ignores these. 

#### I combined pieces of data but now I can't do things with the new variable? 

Did you combine different kinds of data types? Usually this isn't problem with Python as it automatically decides what the type is for variables, but sometimes it might create some problems if you combine integers, float or string type variables. In datasets sometimes even numbers are saved as a 'string', which is unfortunate to notice after doing something with the data. In Python there's different operators which can check what kind of type the variable is, such as isstring(). 

Did you combine the data correctly? If you wanted the columns next to each other, you probably shouldn't combine them the way where they are top of each other. You can check what your variable holds in with varname.shape() or varname.head() commands.

#### My code doesn't work, even if it's correctly written? 

Check the code once more. If there's a comma at the wrong place or a character in the variable's name is the wrong size, it creates problems.

If the code _really_ doesn't work, even if it should, the reason might be found from the kernel. Try ***Restart & Clear Output*** from the Kernel menu in the top bar, this usually fixes this. 

#### The dates in the data are confusing the program, how do I fix this? 

As you're probably aware, different kind of syntax for dates are being used all over the world. If the default settings don't make the data behave correctly, you can try to check from the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) of pandas.read_csv() how you can change the date settings. **dayfirst** or **date_parser** might solve the problem. There's also a Python module named **[time](https://docs.python.org/3/library/time.html)**, in which you can surely find the solutions for these kind of situations.

#### I copied the data in to a new variable, but the changes to it also changes the original data?

Instead of saving the actualy data in to the new variable, Python copies a _pointer_ there. Pointers tell where the data is saved in the memory. When creating a new variable like this 
```Python
new_var = old_var
```
Python just copies the the pointer, and the two variables are practically the same. However, if you only take part of the original data and save it in to a new variable, it creates a copy of the actual data instead of pointers. If you want the whole data in two places (if for example you want one variable to have the data multiplied and compare it to the original data, not sure why someone would want to do this but you never know when working with humans), you should use the command .copy():
```Python
new_var = old_var.copy()
```
This copies the actualy data in to a new memory location and changes to new_var won't affect old_var.