{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Analysis with Pandas\n", "\n", "Data can often be stored in a multiple of file formats:\n", "\n", "- Excel spreadsheets (`.xlsx`);\n", "- Comma seperated files (`.csv`);\n", "- Json (`.json`);\n", "- ...\n", "\n", "Simlarly you might want to store data in any of the above data formats. This is where the [Pandas](http://pandas.pydata.org/) library can be useful:\n", "\n", "> \"... easy-to-use data structures and data analysis tools for the Python programming language.\"\n", "\n", "In this section we will see how to:\n", "\n", "- Read in data files;\n", "- Query those data files;\n", "- Write to data files.\n", "\n", "## Reading in data files\n", "\n", "Consider the file [goldbach.xlsx](data/goldbach.xlsx) which contains rows of data confirming the [Goldbach conjecture](https://en.wikipedia.org/wiki/Goldbach's_conjecture):\n", "\n", "> Every even integer greater than 2 can be expressed as the sum of two primes.\n", "\n", "The data is made up of 3 columns: $N$, $a$ and $b$ and there is a row for every possible expression of $N = a + b$ for $a \\leq b$ prime. Note that this data was made using [Appendix A](A1 - Appendix creating Goldbach data.ipynb).\n", "\n", "Let us start by importing Pandas and reading in the data file. To do this you will need to know the path to the file on your computer:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df = pd.read_excel(\"data/goldbach.xlsx\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This reads in the excel file as a Pandas data frame. Let us take a look at the first few rows of the data frame (the `head`):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Nab
0422
1633
2835
31037
41055
\n", "
" ], "text/plain": [ " N a b\n", "0 4 2 2\n", "1 6 3 3\n", "2 8 3 5\n", "3 10 3 7\n", "4 10 5 5" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us look at the last few rows (the `tail`):" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Nab
2591500151349
2592500163337
2593500193307
2594500223277
2595500229271
\n", "
" ], "text/plain": [ " N a b\n", "2591 500 151 349\n", "2592 500 163 337\n", "2593 500 193 307\n", "2594 500 223 277\n", "2595 500 229 271" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Querying our data\n", "\n", "A quick way of getting a summary of the data we have is with `.describe()`:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Nab
count2596.0000002596.0000002596.000000
mean311.46687272.297766239.169106
std127.81742558.087254111.180401
min4.0000002.0000002.000000
25%214.00000023.000000157.000000
50%329.00000059.000000239.000000
75%420.000000107.000000317.000000
max500.000000241.000000491.000000
\n", "
" ], "text/plain": [ " N a b\n", "count 2596.000000 2596.000000 2596.000000\n", "mean 311.466872 72.297766 239.169106\n", "std 127.817425 58.087254 111.180401\n", "min 4.000000 2.000000 2.000000\n", "25% 214.000000 23.000000 157.000000\n", "50% 329.000000 59.000000 239.000000\n", "75% 420.000000 107.000000 317.000000\n", "max 500.000000 241.000000 491.000000" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although in this particular data set that is not terribly useful, that's more useful in the case of statistical analysis.\n", "\n", "Let us take a closer look at a specific number ($N=322$) and the ways it can be written as the sum of two primes." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Nab
12503225317
125132211311
125232229293
125332241281
125432253269
125532259263
125632271251
125732283239
125832289233
1259322131191
1260322149173
\n", "
" ], "text/plain": [ " N a b\n", "1250 322 5 317\n", "1251 322 11 311\n", "1252 322 29 293\n", "1253 322 41 281\n", "1254 322 53 269\n", "1255 322 59 263\n", "1256 322 71 251\n", "1257 322 83 239\n", "1258 322 89 233\n", "1259 322 131 191\n", "1260 322 149 173" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['N'] == 322]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To briefly explain what is happening there: `df['N'] == 322` is returning a series of `True` and `False`, identify the positions in our data where $N=322$. We directly pass that series to the data frame to get out those numbers `df[df['N'] == 322]`.\n", "\n", "Let us find out how many decompositions exist for each number in our data frame. We will do this using the `value_counts()` statement:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "420 30\n", "480 29\n", "462 28\n", "450 27\n", "390 27\n", "456 24\n", "468 24\n", "330 24\n", "486 23\n", "474 23\n", "498 23\n", "360 22\n", "492 22\n", "378 22\n", "414 21\n", "444 21\n", "300 21\n", "438 21\n", "426 21\n", "396 21\n", "324 20\n", "408 20\n", "354 20\n", "490 19\n", "336 19\n", "294 19\n", "210 19\n", "270 19\n", "432 19\n", "384 19\n", " ..\n", "58 4\n", "80 4\n", "88 4\n", "92 4\n", "152 4\n", "42 4\n", "26 3\n", "30 3\n", "24 3\n", "40 3\n", "22 3\n", "98 3\n", "52 3\n", "56 3\n", "62 3\n", "128 3\n", "44 3\n", "38 2\n", "18 2\n", "10 2\n", "14 2\n", "16 2\n", "20 2\n", "68 2\n", "28 2\n", "32 2\n", "12 1\n", "8 1\n", "6 1\n", "4 1\n", "Name: N, Length: 249, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = df['N'].value_counts()\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above is a series (ordered by count), let us rename the count variable and create a new dataframe:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
counts
42030
48029
46228
45027
39027
\n", "
" ], "text/plain": [ " counts\n", "420 30\n", "480 29\n", "462 28\n", "450 27\n", "390 27" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(s.rename('counts'))\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have created a dataframe with an index given by `N`. Let us create a new variable which is the normalised count:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "df[\"normalised_counts\"] = df['counts'] / df.index" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countsnormalised_counts
420300.071429
480290.060417
462280.060606
450270.060000
390270.069231
\n", "
" ], "text/plain": [ " counts normalised_counts\n", "420 30 0.071429\n", "480 29 0.060417\n", "462 28 0.060606\n", "450 27 0.060000\n", "390 27 0.069231" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could now look a bit closer at this using `describe`:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countsnormalised_counts
count249.000000249.000000
mean10.4257030.051769
std5.8979480.029921
min1.0000000.017588
25%6.0000000.030973
50%9.0000000.044776
75%13.0000000.062500
max30.0000000.250000
\n", "
" ], "text/plain": [ " counts normalised_counts\n", "count 249.000000 249.000000\n", "mean 10.425703 0.051769\n", "std 5.897948 0.029921\n", "min 1.000000 0.017588\n", "25% 6.000000 0.030973\n", "50% 9.000000 0.044776\n", "75% 13.000000 0.062500\n", "max 30.000000 0.250000" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also directly plot our data, but just as when we used Sympy we need to tell Jupyter to display the plots in the notebook:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a quick histogram of the counts:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [ "nbval-ignore-output" ] }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAERFJREFUeJzt3W+MXGd1x/HvqZOI1Atx0iRb10m7oUQRiC2hWUWp0qLdhNKUVI0rJYgoRQ5Ku5UKUSpcFZcXBaoimarhz4uKyiWAKwEbK0BthQiw3GwpUgnYJOAElwYiN8Rx7VLswKIIZDh9sdeqd9n1zOzM7Owcvh/Jmrl3npl7jq735+tn770TmYkkafj93KALkCT1hoEuSUUY6JJUhIEuSUUY6JJUhIEuSUUY6JJUhIEuSUUY6JJUxDmrubGLL744x8bGFqz7wQ9+wPr161ezjL6q1g/U68l+1r5qPXXbz4EDB76TmZe0GreqgT42Nsb+/fsXrJudnWVycnI1y+irav1AvZ7sZ+2r1lO3/UTEf7UzzikXSSrCQJekIgx0SSrCQJekIgx0SSrCQJekIgx0SSrCQJekItoK9IjYEBEPRMR/RMShiPiNiLgoIvZGxJPN44X9LlaStLx2rxR9P/CZzLw1Is4Dfh54G7AvM7dHxDZgG/DWPtX5M2ls26cHst3D228eyHYldaflEXpEvAh4FXAfQGb+KDNPArcAO5thO4HN/SpSktRaO1MuLwb+B/hwRDwaER+MiPXAaGYeBWgeL+1jnZKkFiIzzz4gYgL4InB9Zj4SEe8HvgfcnZkbzhh3IjN/ah49IqaBaYDR0dFrZmZmFrw+NzfHyMhI142sFb3s5+CR53ryOZ0a33TBgmX30dpWrR+o11O3/UxNTR3IzIlW49oJ9F8EvpiZY83ybzE/X/4SYDIzj0bERmA2M68622dNTEykd1ts31qZQ3cfrW3V+oF6PfXgbottBXrLKZfM/G/g2xFxOqxvBL4O7AG2NOu2ALtXWKskqQfaPcvlbuCjzRkuTwFvZP4fg10RcRfwNHBbf0qUJLWjrUDPzMeApQ73b+xtOZKklfJKUUkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkq4px2BkXEYeD7wI+BU5k5EREXAfcDY8Bh4HWZeaI/ZUqSWunkCH0qM6/OzIlmeRuwLzOvBPY1y5KkAelmyuUWYGfzfCewuftyJEkr1W6gJ/C5iDgQEdPNutHMPArQPF7ajwIlSe2JzGw9KOKXMvPZiLgU2AvcDezJzA1njDmRmRcu8d5pYBpgdHT0mpmZmQWvz83NMTIy0l0Xa0gv+zl45LmefE6nxjddsGDZfbS2VesH6vXUbT9TU1MHzpjuXlZbgb7gDRHvAOaAPwYmM/NoRGwEZjPzqrO9d2JiIvfv379g3ezsLJOTkx3VsJb1sp+xbZ/uyed06vD2mxcsu4/Wtmr9QL2euu0nItoK9JZTLhGxPiJeePo58BrgcWAPsKUZtgXYveJqJUlda+e0xVHgUxFxevzHMvMzEfFlYFdE3AU8DdzWvzIlSa20DPTMfAp4xRLr/xe4sR9FSZI655WiklSEgS5JRRjoklSEgS5JRRjoklSEgS5JRRjoklSEgS5JRRjoklSEgS5JRRjoklSEgS5JRbT1JdH62bL4Puxbx09x5yrdm33xvdgltc8jdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCI8bbGFxafwtbKap/hJ0pk8QpekIgx0SSrCQJekIgx0SSrCQJekItoO9IhYFxGPRsSDzfIVEfFIRDwZEfdHxHn9K1OS1EonR+j3AIfOWH438N7MvBI4AdzVy8IkSZ1pK9Aj4jLgZuCDzXIANwAPNEN2Apv7UaAkqT3tHqG/D/gL4CfN8i8AJzPzVLP8DLCpx7VJkjoQmXn2ARG/B7w2M/80IiaBPwfeCPx7Zr6kGXM58FBmji/x/mlgGmB0dPSamZmZBa/Pzc0xMjLSg1b64+CR5zoaP3o+HHu+T8UMyGr2NL7pgr5vY63/netUtX6gXk/d9jM1NXUgMydajWvn0v/rgd+PiNcCLwBexPwR+4aIOKc5Sr8MeHapN2fmDmAHwMTERE5OTi54fXZ2lsXr1pJOL+PfOn6Kew/WuqPCavZ0+I7Jvm9jrf+d61S1fqBeT6vVT8spl8z8y8y8LDPHgNcD/5KZdwAPA7c2w7YAu/tWpSSppW7OQ38r8JaI+Cbzc+r39aYkSdJKdPT/6MycBWab508B1/a+JEnSSnilqCQVYaBLUhEGuiQVYaBLUhEGuiQVYaBLUhEGuiQVYaBLUhEGuiQVYaBLUhEGuiQVYaBLUhEGuiQVYaBLUhEGuiQVYaBLUhEGuiQVYaBLUhEGuiQVYaBLUhEGuiQVYaBLUhHnDLoA6Uxj2z7d921sHT/FnYu2c3j7zX3frtRvHqFLUhEGuiQV0TLQI+IFEfGliPhqRDwREe9s1l8REY9ExJMRcX9EnNf/ciVJy2nnCP2HwA2Z+QrgauCmiLgOeDfw3sy8EjgB3NW/MiVJrbQM9Jw31yye2/xJ4AbggWb9TmBzXyqUJLWlrTn0iFgXEY8Bx4G9wLeAk5l5qhnyDLCpPyVKktoRmdn+4IgNwKeAvwI+nJkvadZfDjyUmeNLvGcamAYYHR29ZmZmZsHrc3NzjIyMrLiBfjt45LmOxo+eD8ee71MxA1Ktp6X6Gd90wWCK6YG1/jO0EtV66rafqampA5k50WpcR+ehZ+bJiJgFrgM2RMQ5zVH6ZcCzy7xnB7ADYGJiIicnJxe8Pjs7y+J1a8ni85Vb2Tp+insP1jq9v1pPS/Vz+I7JwRTTA2v9Z2glqvW0Wv20c5bLJc2RORFxPvBq4BDwMHBrM2wLsLtfRUqSWmvnsGsjsDMi1jH/D8CuzHwwIr4OzETE3wCPAvf1sU5JUgstAz0zvwa8con1TwHX9qMoSVLnvFJUkoow0CWpCANdkoow0CWpCANdkoow0CWpCANdkoow0CWpCANdkoow0CWpCANdkoow0CWpCANdkoow0CWpCANdkooYmu8VG+vwq+Ak6WeNR+iSVISBLklFGOiSVISBLklFGOiSVISBLklFGOiSVISBLklFGOiSVETLQI+IyyPi4Yg4FBFPRMQ9zfqLImJvRDzZPF7Y/3IlSctp5wj9FLA1M18KXAe8KSJeBmwD9mXmlcC+ZlmSNCAtAz0zj2bmV5rn3wcOAZuAW4CdzbCdwOZ+FSlJaq2jOfSIGANeCTwCjGbmUZgPfeDSXhcnSWpfZGZ7AyNGgH8F3pWZn4yIk5m54YzXT2TmT82jR8Q0MA0wOjp6zczMzILX5+bmGBkZabn9g0eea6vOQRs9H449P+gqeqtaT0v1M77pgsEU0wPt/gwNk2o9ddvP1NTUgcycaDWurUCPiHOBB4HPZuZ7mnXfACYz82hEbARmM/Oqs33OxMRE7t+/f8G62dlZJicnW9YwLLfP3Tp+insPDs1didtSrael+jm8/eYBVdO9dn+Ghkm1nrrtJyLaCvR2znIJ4D7g0Okwb+wBtjTPtwC7V1KoJKk32jnsuh54A3AwIh5r1r0N2A7sioi7gKeB2/pToiSpHS0DPTO/AMQyL9/Y23IkSSvllaKSVISBLklFGOiSVISBLklFGOiSVISBLklFGOiSVISBLklF1LlBhzSkur1P0dbxU9y5gs8Y5vvXaGkeoUtSEQa6JBVhoEtSEQa6JBVhoEtSEQa6JBXhaYsSw/MVh9LZeIQuSUUY6JJUhIEuSUUY6JJUhIEuSUUY6JJUhIEuSUUY6JJUhIEuSUW0DPSI+FBEHI+Ix89Yd1FE7I2IJ5vHC/tbpiSplXaO0D8C3LRo3TZgX2ZeCexrliVJA9Qy0DPz88B3F62+BdjZPN8JbO5xXZKkDq10Dn00M48CNI+X9q4kSdJKRGa2HhQxBjyYmS9vlk9m5oYzXj+RmUvOo0fENDANMDo6es3MzMyC1+fm5hgZGWlZw8Ejz7UcsxaMng/Hnh90Fb1VrSf7mTe+6YLeF9Mj7ebCsOi2n6mpqQOZOdFq3Epvn3ssIjZm5tGI2AgcX25gZu4AdgBMTEzk5OTkgtdnZ2dZvG4pK/lW80HYOn6Kew/WuitxtZ7sZ97hOyZ7X0yPtJsLw2K1+lnplMseYEvzfAuwuzflSJJWquU/6xHxcWASuDgingHeDmwHdkXEXcDTwG39LFJS7w3ySz0Ob795YNuurGWgZ+bty7x0Y49rkSR1wStFJakIA12SijDQJakIA12SijDQJakIA12SijDQJakIA12SijDQJakIA12SijDQJakIA12SijDQJakIA12SijDQJakIA12SijDQJakIA12SijDQJamIlt8pKklVDOqLsT9y0/pV2Y5H6JJUhIEuSUU45SJp1bWa+tg6foo7BzQ9Msw8QpekIgx0SSqiq0CPiJsi4hsR8c2I2NaroiRJnVtxoEfEOuDvgd8FXgbcHhEv61VhkqTOdHOEfi3wzcx8KjN/BMwAt/SmLElSp7oJ9E3At89YfqZZJ0kagMjMlb0x4jbgdzLzj5rlNwDXZubdi8ZNA9PN4lXANxZ91MXAd1ZUxNpUrR+o15P9rH3Veuq2n1/JzEtaDermPPRngMvPWL4MeHbxoMzcAexY7kMiYn9mTnRRx5pSrR+o15P9rH3VelqtfrqZcvkycGVEXBER5wGvB/b0pixJUqdWfISemaci4s3AZ4F1wIcy84meVSZJ6khXl/5n5kPAQ13WsOx0zJCq1g/U68l+1r5qPa1KPyv+pagkaW3x0n9JKmJggV7xtgERcTgiDkbEYxGxf9D1dCoiPhQRxyPi8TPWXRQReyPiyebxwkHW2KllenpHRBxp9tNjEfHaQdbYiYi4PCIejohDEfFERNzTrB/K/XSWfoZ5H70gIr4UEV9tenpns/6KiHik2Uf3NyeT9Hbbg5hyaW4b8J/AbzN/+uOXgdsz8+urXkwPRcRhYCIzh/L82Yh4FTAH/FNmvrxZ97fAdzNze/MP74WZ+dZB1tmJZXp6BzCXmX83yNpWIiI2Ahsz8ysR8ULgALAZuJMh3E9n6ed1DO8+CmB9Zs5FxLnAF4B7gLcAn8zMmYj4B+CrmfmBXm57UEfo3jZgDcrMzwPfXbT6FmBn83wn8z9sQ2OZnoZWZh7NzK80z78PHGL+Cu2h3E9n6Wdo5by5ZvHc5k8CNwAPNOv7so8GFehVbxuQwOci4kBzhWwFo5l5FOZ/+IBLB1xPr7w5Ir7WTMkMxfTEYhExBrwSeIQC+2lRPzDE+ygi1kXEY8BxYC/wLeBkZp5qhvQl8wYV6LHEugqn21yfmb/O/B0o39T8d19rzweAXwWuBo4C9w62nM5FxAjwCeDPMvN7g66nW0v0M9T7KDN/nJlXM38F/bXAS5ca1uvtDirQ27ptwLDJzGebx+PAp5jfkcPuWDPPeXq+8/iA6+laZh5rfuB+AvwjQ7afmnnZTwAfzcxPNquHdj8t1c+w76PTMvMkMAtcB2yIiNPX/vQl8wYV6OVuGxAR65tf6hAR64HXAI+f/V1DYQ+wpXm+Bdg9wFp64nTwNf6AIdpPzS/c7gMOZeZ7znhpKPfTcv0M+T66JCI2NM/PB17N/O8GHgZubYb1ZR8N7MKi5jSk9/H/tw1410AK6ZGIeDHzR+UwfwXux4atp4j4ODDJ/J3hjgFvB/4Z2AX8MvA0cFtmDs0vGZfpaZL5/8oncBj4k9Pzz2tdRPwm8G/AQeAnzeq3MT/vPHT76Sz93M7w7qNfY/6XnuuYP2jelZl/3WTEDHAR8Cjwh5n5w55u2ytFJakGrxSVpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkq4v8AxUqGct8FCcUAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df['counts'].hist();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The normalised counts:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "tags": [ "nbval-ignore-output" ] }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAD31JREFUeJzt3X+MHPV5x/H3U1zCDyvYYHJFNskZxU1L4lYpV0SDkp7jSAHcxkgFiYimhro6qaUUFUuN2/yBVCmS8wdNqVQ1siDFkaIYQlBBNWlLHTZR/zCtDRQDbmrHccHYhYRf6RGU9Jqnf+wcvR5n7m5m9sb39fslrW5m9jszzz4efzw3uzuOzESSVK6f6roASdJgGfSSVDiDXpIKZ9BLUuEMekkqnEEvSYUz6CWpcAa9JBXOoJekwi3pugCAFStW5PDwcNdlDNzrr7/O2Wef3XUZnbIHffbBHkxq0od9+/Z9PzPPn23cSRH0w8PD7N27t+syBq7X6zE6Otp1GZ2yB332wR5MatKHiPiPuYzz0o0kFc6gl6TCGfSSVDiDXpIKZ9BLUuEMekkqnEEvSYUz6CWpcAa9JBXupPhmbBPDW3d1tu8j2zZ0tm9JmivP6CWpcAa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKpxBL0mFM+glqXAGvSQVzqCXpMIZ9JJUOINekgpn0EtS4Qx6SSqcQS9JhTPoJalwBr0kFc6gl6TCGfSSVLhZgz4ivhgRL0bEU1OWnRsRD0fEwern8mp5RMRfRMShiHgyIn5pkMVLkmY3lzP6u4Erpi3bCuzOzDXA7moe4EpgTfUYA/6qnTIlSXXNGvSZ+S3g5WmLNwI7qukdwNVTln8p+/YAyyLigraKlSTNX91r9EOZeRyg+vmuavlK4Lkp445WyyRJHVnS8vZihmU548CIMfqXdxgaGqLX69Xa4Za1E7XWa8N8ax4fH6/9OkthD/rsgz2YtBB9qBv0L0TEBZl5vLo082K1/Chw4ZRxq4BjM20gM7cD2wFGRkZydHS0ViE3bN1Va702HLl+dF7je70edV9nKexBn32wB5MWog91L908CGyqpjcBD0xZ/lvVp28uA16bvMQjSerGrGf0EfEVYBRYERFHgduAbcC9EbEZeBa4thr+EHAVcAj4IXDjAGqWJM3DrEGfmZ88wVPrZxibwE1Ni5IktcdvxkpS4Qx6SSqcQS9JhTPoJalwBr0kFc6gl6TCGfSSVDiDXpIKZ9BLUuEMekkqnEEvSYUz6CWpcAa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKpxBL0mFM+glqXAGvSQVzqCXpMIZ9JJUOINekgpn0EtS4Qx6SSqcQS9JhTPoJalwjYI+Iv4wIp6OiKci4isRcUZErI6IRyPiYETcExGnt1WsJGn+agd9RKwE/gAYycwPAKcB1wGfAz6fmWuAV4DNbRQqSaqn6aWbJcCZEbEEOAs4DnwUuK96fgdwdcN9SJIaiMysv3LELcBngTeAfwBuAfZk5nur5y8Evl6d8U9fdwwYAxgaGrpk586dtWrY//xr9YpvwdqV58xr/Pj4OEuXLh1QNYuDPeizD/ZgUpM+rFu3bl9mjsw2bkmtrQMRsRzYCKwGXgW+Clw5w9AZ/yXJzO3AdoCRkZEcHR2tVccNW3fVWq8NR64fndf4Xq9H3ddZCnvQZx/swaSF6EOTSzcfA76bmd/LzP8G7gc+BCyrLuUArAKONaxRktRAk6B/FrgsIs6KiADWA88AjwDXVGM2AQ80K1GS1ETtoM/MR+m/6foYsL/a1nbg08CtEXEIOA+4q4U6JUk11b5GD5CZtwG3TVt8GLi0yXYlSe3xm7GSVDiDXpIK1+jSzalueJ4f7dyydqKVj4Me2bah8TYknTo8o5ekwhn0klQ4g16SCmfQS1LhDHpJKpxBL0mFM+glqXAGvSQVzqCXpMIZ9JJUOINekgpn0EtS4Qx6SSqcQS9JhTPoJalwBr0kFc6gl6TCGfSSVDiDXpIKZ9BLUuEMekkqnEEvSYUz6CWpcAa9JBWuUdBHxLKIuC8i/i0iDkTEr0TEuRHxcEQcrH4ub6tYSdL8NT2jvwP4u8z8OeAXgQPAVmB3Zq4BdlfzkqSO1A76iHgn8BHgLoDM/HFmvgpsBHZUw3YAVzctUpJUX5Mz+ouA7wF/HRGPR8SdEXE2MJSZxwGqn+9qoU5JUk2RmfVWjBgB9gCXZ+ajEXEH8APg5sxcNmXcK5n5luv0ETEGjAEMDQ1dsnPnzlp17H/+tVrrdWHoTHjhjebbWbvynOYb6cj4+DhLly7tuozO2Qd7MKlJH9atW7cvM0dmG9ck6H8G2JOZw9X8h+lfj38vMJqZxyPiAqCXme97u22NjIzk3r17a9UxvHVXrfW6sGXtBLfvX9J4O0e2bWihmm70ej1GR0e7LqNz9sEeTGrSh4iYU9DXvnSTmf8JPBcRkyG+HngGeBDYVC3bBDxQdx+SpOaanl7eDHw5Ik4HDgM30v/H496I2Aw8C1zbcB+SpAYaBX1mPgHM9GvD+ibblSS1x2/GSlLhDHpJKpxBL0mFM+glqXAGvSQVzqCXpMIZ9JJUOINekgpn0EtS4Qx6SSqcQS9JhTPoJalwBr0kFc6gl6TCGfSSVDiDXpIKZ9BLUuEMekkqnEEvSYUz6CWpcAa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKpxBL0mFaxz0EXFaRDweEX9bza+OiEcj4mBE3BMRpzcvU5JUVxtn9LcAB6bMfw74fGauAV4BNrewD0lSTY2CPiJWARuAO6v5AD4K3FcN2QFc3WQfkqRmmp7R/znwR8BPqvnzgFczc6KaPwqsbLgPSVIDS+quGBG/BryYmfsiYnRy8QxD8wTrjwFjAENDQ/R6vVp1bFk7Mfugk8TQme3UW7dXJ4Px8fFFXX9b7IM9mLQQfagd9MDlwCci4irgDOCd9M/wl0XEkuqsfhVwbKaVM3M7sB1gZGQkR0dHaxVxw9Zdtdbrwpa1E9y+v0nL+45cP9q8mI70ej3q/lmXxD7Yg0kL0Yfal24y848zc1VmDgPXAd/IzOuBR4BrqmGbgAcaVylJqm0Qn6P/NHBrRByif83+rgHsQ5I0R82vIwCZ2QN61fRh4NI2titJas5vxkpS4Qx6SSqcQS9JhTPoJalwBr0kFc6gl6TCtfLxSi2s4Q6/DXxk24bO9i2pHs/oJalwBr0kFc6gl6TCGfSSVDiDXpIKZ9BLUuEMekkqnEEvSYUz6CWpcAa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKpxBL0mFM+glqXAGvSQVzqCXpMIZ9JJUOINekgpXO+gj4sKIeCQiDkTE0xFxS7X83Ih4OCIOVj+Xt1euJGm+mpzRTwBbMvPngcuAmyLiYmArsDsz1wC7q3lJUkdqB31mHs/Mx6rp/wIOACuBjcCOatgO4OqmRUqS6ovMbL6RiGHgW8AHgGczc9mU517JzLdcvomIMWAMYGho6JKdO3fW2vf+51+rtV4Xhs6EF97ouopm1q48p9H64+PjLF26tKVqFi/7YA8mNenDunXr9mXmyGzjGgd9RCwFvgl8NjPvj4hX5xL0U42MjOTevXtr7X94665a63Vhy9oJbt+/pOsyGjmybUOj9Xu9HqOjo+0Us4jZB3swqUkfImJOQd/oUzcR8dPA14AvZ+b91eIXIuKC6vkLgBeb7EOS1EyTT90EcBdwIDP/bMpTDwKbqulNwAP1y5MkNdXkOsLlwKeA/RHxRLXsT4BtwL0RsRl4Fri2WYmSpCZqB31m/hMQJ3h6fd3tSpLa5TdjJalwBr0kFc6gl6TCGfSSVDiDXpIKZ9BLUuEMekkq3OK+8YoWXNN7C21ZO8ENNbbR9B470qnMM3pJKpxBL0mFM+glqXAGvSQVzqCXpMIZ9JJUOINekgpn0EtS4Qx6SSqcQS9JhfMWCFoUmt56oS5vvaASeEYvSYUz6CWpcAa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKtxAvjAVEVcAdwCnAXdm5rZB7EcatEF9UWsu/3euX9ZSW1o/o4+I04C/BK4ELgY+GREXt70fSdLcDOKM/lLgUGYeBoiIncBG4JkB7EsqVle3fVgoM/1W09VvMV32+u4rzh74PgZxjX4l8NyU+aPVMklSByIz291gxLXAxzPzd6r5TwGXZubN08aNAWPV7PuAb7dayMlpBfD9rovomD3osw/2YFKTPrwnM8+fbdAgLt0cBS6cMr8KODZ9UGZuB7YPYP8nrYjYm5kjXdfRJXvQZx/swaSF6MMgLt38C7AmIlZHxOnAdcCDA9iPJGkOWj+jz8yJiPh94O/pf7zyi5n5dNv7kSTNzUA+R5+ZDwEPDWLbi9wpdanqBOxBn32wB5MG3ofW34yVJJ1cvAWCJBXOoG9BRFwREd+OiEMRsXWG598REfdUzz8aEcPV8uGIeCMinqgeX1jo2ts0hz58JCIei4iJiLhm2nObIuJg9di0cFW3q2EP/mfKsbCoP8Awhz7cGhHPRMSTEbE7It4z5blT5Vh4ux60eyxkpo8GD/pvOH8HuAg4HfhX4OJpY34P+EI1fR1wTzU9DDzV9WtYwD4MA78AfAm4Zsryc4HD1c/l1fTyrl/TQvagem6869ewgH1YB5xVTf/ulL8Tp9KxMGMPBnEseEbf3Ju3fMjMHwOTt3yYaiOwo5q+D1gfEbGANS6EWfuQmUcy80ngJ9PW/TjwcGa+nJmvAA8DVyxE0S1r0oOSzKUPj2TmD6vZPfS/bwOn1rFwoh60zqBvbi63fHhzTGZOAK8B51XPrY6IxyPimxHx4UEXO0BNbn1Rym0zmr6OMyJib0TsiYir2y1tQc23D5uBr9dc92TVpAfQ8rEwkI9XnmJmOjOf/lGmE405Drw7M1+KiEuAv4mI92fmD9oucgHMpQ+DWPdk0vR1vDszj0XERcA3ImJ/Zn6npdoW0pz7EBG/CYwAvzrfdU9yTXoALR8LntE3N5dbPrw5JiKWAOcAL2fmjzLzJYDM3Ef/mt7PDrziwZjTrS8GsO7JpNHryMxj1c/DQA/4YJvFLaA59SEiPgZ8BvhEZv5oPusuAk160P6x0PWbFov9Qf+3osPAav7vTZf3TxtzE///zdh7q+nzgdOq6YuA54Fzu35Ng+rDlLF389Y3Y79L/8235dX0outDwx4sB95RTa8ADjLtzbvF8pjj34kP0j+xWTNt+SlzLLxND1o/FjpvSAkP4Crg36s/tM9Uy/6U/r/SAGcAXwUOAf8MXFQt/w3g6eogeAz49a5fy4D78Mv0z3ReB14Cnp6y7m9X/TkE3Nj1a1noHgAfAvZXx8J+YHPXr2XAffhH4AXgierx4Cl4LMzYg0EcC34zVpIK5zV6SSqcQS9JhTPoJalwBr0kFc6gl6TCGfSSVDiDXpIKZ9BLUuH+F44rT3+v3OcWAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df['normalised_counts'].hist();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing our data to files\n", "\n", "Finally let us write this data to a 'comma seperated value' (`.csv`) in case we wanted to look at it later:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "df.to_csv(\"data/goldbach_counts.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "This section briefly showed us how to use pandas to read, manipulate and write data. We also took a brief look at plotting with pandas but this should only be used in an exploratary way. For higher quality plots [matplotlib](http://matplotlib.org/) is recommended." ] } ], "metadata": { "anaconda-cloud": {}, "celltoolbar": "Tags", "kernelspec": { "display_name": "Python [conda env:mwp]", "language": "python", "name": "conda-env-mwp-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 1 }