{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Analysis with Pandas\n",
"\n",
"Data can often be stored in a multiple of file formats:\n",
"\n",
"- Excel spreadsheets (`.xlsx`);\n",
"- Comma seperated files (`.csv`);\n",
"- Json (`.json`);\n",
"- ...\n",
"\n",
"Simlarly you might want to store data in any of the above data formats. This is where the [Pandas](http://pandas.pydata.org/) library can be useful:\n",
"\n",
"> \"... easy-to-use data structures and data analysis tools for the Python programming language.\"\n",
"\n",
"In this section we will see how to:\n",
"\n",
"- Read in data files;\n",
"- Query those data files;\n",
"- Write to data files.\n",
"\n",
"## Reading in data files\n",
"\n",
"Consider the file [goldbach.xlsx](data/goldbach.xlsx) which contains rows of data confirming the [Goldbach conjecture](https://en.wikipedia.org/wiki/Goldbach's_conjecture):\n",
"\n",
"> Every even integer greater than 2 can be expressed as the sum of two primes.\n",
"\n",
"The data is made up of 3 columns: $N$, $a$ and $b$ and there is a row for every possible expression of $N = a + b$ for $a \\leq b$ prime. Note that this data was made using [Appendix A](A1 - Appendix creating Goldbach data.ipynb).\n",
"\n",
"Let us start by importing Pandas and reading in the data file. To do this you will need to know the path to the file on your computer:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_excel(\"data/goldbach.xlsx\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This reads in the excel file as a Pandas data frame. Let us take a look at the first few rows of the data frame (the `head`):"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
N
\n",
"
a
\n",
"
b
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
4
\n",
"
2
\n",
"
2
\n",
"
\n",
"
\n",
"
1
\n",
"
6
\n",
"
3
\n",
"
3
\n",
"
\n",
"
\n",
"
2
\n",
"
8
\n",
"
3
\n",
"
5
\n",
"
\n",
"
\n",
"
3
\n",
"
10
\n",
"
3
\n",
"
7
\n",
"
\n",
"
\n",
"
4
\n",
"
10
\n",
"
5
\n",
"
5
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" N a b\n",
"0 4 2 2\n",
"1 6 3 3\n",
"2 8 3 5\n",
"3 10 3 7\n",
"4 10 5 5"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us look at the last few rows (the `tail`):"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
N
\n",
"
a
\n",
"
b
\n",
"
\n",
" \n",
" \n",
"
\n",
"
2591
\n",
"
500
\n",
"
151
\n",
"
349
\n",
"
\n",
"
\n",
"
2592
\n",
"
500
\n",
"
163
\n",
"
337
\n",
"
\n",
"
\n",
"
2593
\n",
"
500
\n",
"
193
\n",
"
307
\n",
"
\n",
"
\n",
"
2594
\n",
"
500
\n",
"
223
\n",
"
277
\n",
"
\n",
"
\n",
"
2595
\n",
"
500
\n",
"
229
\n",
"
271
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" N a b\n",
"2591 500 151 349\n",
"2592 500 163 337\n",
"2593 500 193 307\n",
"2594 500 223 277\n",
"2595 500 229 271"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Querying our data\n",
"\n",
"A quick way of getting a summary of the data we have is with `.describe()`:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
N
\n",
"
a
\n",
"
b
\n",
"
\n",
" \n",
" \n",
"
\n",
"
count
\n",
"
2596.000000
\n",
"
2596.000000
\n",
"
2596.000000
\n",
"
\n",
"
\n",
"
mean
\n",
"
311.466872
\n",
"
72.297766
\n",
"
239.169106
\n",
"
\n",
"
\n",
"
std
\n",
"
127.817425
\n",
"
58.087254
\n",
"
111.180401
\n",
"
\n",
"
\n",
"
min
\n",
"
4.000000
\n",
"
2.000000
\n",
"
2.000000
\n",
"
\n",
"
\n",
"
25%
\n",
"
214.000000
\n",
"
23.000000
\n",
"
157.000000
\n",
"
\n",
"
\n",
"
50%
\n",
"
329.000000
\n",
"
59.000000
\n",
"
239.000000
\n",
"
\n",
"
\n",
"
75%
\n",
"
420.000000
\n",
"
107.000000
\n",
"
317.000000
\n",
"
\n",
"
\n",
"
max
\n",
"
500.000000
\n",
"
241.000000
\n",
"
491.000000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" N a b\n",
"count 2596.000000 2596.000000 2596.000000\n",
"mean 311.466872 72.297766 239.169106\n",
"std 127.817425 58.087254 111.180401\n",
"min 4.000000 2.000000 2.000000\n",
"25% 214.000000 23.000000 157.000000\n",
"50% 329.000000 59.000000 239.000000\n",
"75% 420.000000 107.000000 317.000000\n",
"max 500.000000 241.000000 491.000000"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although in this particular data set that is not terribly useful, that's more useful in the case of statistical analysis.\n",
"\n",
"Let us take a closer look at a specific number ($N=322$) and the ways it can be written as the sum of two primes."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
N
\n",
"
a
\n",
"
b
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1250
\n",
"
322
\n",
"
5
\n",
"
317
\n",
"
\n",
"
\n",
"
1251
\n",
"
322
\n",
"
11
\n",
"
311
\n",
"
\n",
"
\n",
"
1252
\n",
"
322
\n",
"
29
\n",
"
293
\n",
"
\n",
"
\n",
"
1253
\n",
"
322
\n",
"
41
\n",
"
281
\n",
"
\n",
"
\n",
"
1254
\n",
"
322
\n",
"
53
\n",
"
269
\n",
"
\n",
"
\n",
"
1255
\n",
"
322
\n",
"
59
\n",
"
263
\n",
"
\n",
"
\n",
"
1256
\n",
"
322
\n",
"
71
\n",
"
251
\n",
"
\n",
"
\n",
"
1257
\n",
"
322
\n",
"
83
\n",
"
239
\n",
"
\n",
"
\n",
"
1258
\n",
"
322
\n",
"
89
\n",
"
233
\n",
"
\n",
"
\n",
"
1259
\n",
"
322
\n",
"
131
\n",
"
191
\n",
"
\n",
"
\n",
"
1260
\n",
"
322
\n",
"
149
\n",
"
173
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" N a b\n",
"1250 322 5 317\n",
"1251 322 11 311\n",
"1252 322 29 293\n",
"1253 322 41 281\n",
"1254 322 53 269\n",
"1255 322 59 263\n",
"1256 322 71 251\n",
"1257 322 83 239\n",
"1258 322 89 233\n",
"1259 322 131 191\n",
"1260 322 149 173"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df['N'] == 322]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To briefly explain what is happening there: `df['N'] == 322` is returning a series of `True` and `False`, identify the positions in our data where $N=322$. We directly pass that series to the data frame to get out those numbers `df[df['N'] == 322]`.\n",
"\n",
"Let us find out how many decompositions exist for each number in our data frame. We will do this using the `value_counts()` statement:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"420 30\n",
"480 29\n",
"462 28\n",
"450 27\n",
"390 27\n",
"456 24\n",
"468 24\n",
"330 24\n",
"486 23\n",
"474 23\n",
"498 23\n",
"360 22\n",
"492 22\n",
"378 22\n",
"414 21\n",
"444 21\n",
"300 21\n",
"438 21\n",
"426 21\n",
"396 21\n",
"324 20\n",
"408 20\n",
"354 20\n",
"490 19\n",
"336 19\n",
"294 19\n",
"210 19\n",
"270 19\n",
"432 19\n",
"384 19\n",
" ..\n",
"58 4\n",
"80 4\n",
"88 4\n",
"92 4\n",
"152 4\n",
"42 4\n",
"26 3\n",
"30 3\n",
"24 3\n",
"40 3\n",
"22 3\n",
"98 3\n",
"52 3\n",
"56 3\n",
"62 3\n",
"128 3\n",
"44 3\n",
"38 2\n",
"18 2\n",
"10 2\n",
"14 2\n",
"16 2\n",
"20 2\n",
"68 2\n",
"28 2\n",
"32 2\n",
"12 1\n",
"8 1\n",
"6 1\n",
"4 1\n",
"Name: N, Length: 249, dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s = df['N'].value_counts()\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above is a series (ordered by count), let us rename the count variable and create a new dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
counts
\n",
"
\n",
" \n",
" \n",
"
\n",
"
420
\n",
"
30
\n",
"
\n",
"
\n",
"
480
\n",
"
29
\n",
"
\n",
"
\n",
"
462
\n",
"
28
\n",
"
\n",
"
\n",
"
450
\n",
"
27
\n",
"
\n",
"
\n",
"
390
\n",
"
27
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" counts\n",
"420 30\n",
"480 29\n",
"462 28\n",
"450 27\n",
"390 27"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(s.rename('counts'))\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have created a dataframe with an index given by `N`. Let us create a new variable which is the normalised count:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"df[\"normalised_counts\"] = df['counts'] / df.index"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"