{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![title](header.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas comes with some neat functions that are essential to know for importing data into Python. We will take a quick look at the two main file types you will recieve data in - CSV and Excel Files, and how to import them as pandas dataframes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing CSV Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CSV files (known as Comma Seperated Value Files) are a common format to see data in. Pandas has a built in method for importing this data. For this example, we will be using the example files found [here](https://github.com/stmball/HiPy-2017/tree/master/Importing%20Files%20With%20Pandas) - you will need to download this file to see the example code below work, or you can find your own CSV files and try it for yourself. Note that Python looks by default at the directory the notebook file is in - so try and keep the jupyter notebook file and all imported files in the same folder! The command for this to work is as follows:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c d\n", "0 0 1 2 3\n", "1 4 5 6 7\n", "2 8 9 10 11\n", "3 12 13 14 15\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "df = pd.read_csv('example.csv')\n", "\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use the same function to skip downloading CSV files and passing the arguement of the download link to the read_csv function:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " index-entry-number entry-number entry-timestamp key \\\n", "0 357 357 2017-01-26T12:34:10Z KIN \n", "1 356 356 2016-11-01T14:16:54Z GLA \n", "2 355 355 2016-10-31T12:59:03Z LND \n", "3 354 354 2016-10-21T16:11:20Z WOT \n", "4 353 353 2016-10-21T16:11:20Z MSS \n", "\n", " local-authority-eng local-authority-type name \\\n", "0 KIN NMD King's Lynn and West Norfolk \n", "1 GLA SRA Greater London \n", "2 LND CC City of London \n", "3 WOT NMD Worthing \n", "4 MSS NMD Mid Sussex \n", "\n", " official-name start-date end-date \n", "0 Borough Council of King's Lynn and West Norfolk NaN NaN \n", "1 Greater London Authority 1905-06-22 NaN \n", "2 City of London Corporation 1905-06-28 NaN \n", "3 Worthing Borough Council 1905-06-25 NaN \n", "4 Mid Sussex District Council 1974-04-01 NaN \n" ] } ], "source": [ "df = pd.read_csv('https://local-authority-eng.register.gov.uk/records.csv?page-size=5000')\n", "\n", "print(df.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing Excel Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Excel files can be imported into Python very easily using the following function - the file we're targeting can be found [here](https://github.com/stmball/HiPy-2017/tree/master/Importing%20Files%20With%20Pandas)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c d\n", "0 0 1 0 1\n", "1 1 0 1 0\n", "2 0 1 0 1\n", "3 1 0 1 0\n" ] } ], "source": [ "df = pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Worked Example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're going to work with the local authoity CSV, cleaning it up ready for some very basic analysis.\n", "\n", "First, we're going to load in our modules and import our data:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " index-entry-number entry-number entry-timestamp key \\\n", "0 357 357 2017-01-26T12:34:10Z KIN \n", "1 356 356 2016-11-01T14:16:54Z GLA \n", "2 355 355 2016-10-31T12:59:03Z LND \n", "3 354 354 2016-10-21T16:11:20Z WOT \n", "4 353 353 2016-10-21T16:11:20Z MSS \n", "\n", " local-authority-eng local-authority-type name \\\n", "0 KIN NMD King's Lynn and West Norfolk \n", "1 GLA SRA Greater London \n", "2 LND CC City of London \n", "3 WOT NMD Worthing \n", "4 MSS NMD Mid Sussex \n", "\n", " official-name start-date end-date \n", "0 Borough Council of King's Lynn and West Norfolk NaN NaN \n", "1 Greater London Authority 1905-06-22 NaN \n", "2 City of London Corporation 1905-06-28 NaN \n", "3 Worthing Borough Council 1905-06-25 NaN \n", "4 Mid Sussex District Council 1974-04-01 NaN \n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "df = pd.read_csv('https://local-authority-eng.register.gov.uk/records.csv?page-size=5000')\n", "\n", "print(df.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're going to create a new dataframe with just the local authority code." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 KIN\n", "1 GLA\n", "2 LND\n", "3 WOT\n", "4 MSS\n", "Name: local-authority-eng, dtype: object\n" ] } ], "source": [ "df = df['local-authority-eng'] #selecting the local authority code\n", "print(df.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we now have a list of all local authority codes in the UK. Let's try and get a frequency chart of the first letter of each authority's code. There are a lot of different ways of doing this, we will be using the most clear way - first we define a function that returns the first letter of the local authority code, then we use the \".apply\" method to run this function on every entry in our series above:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def getFirstLetter(string):\n", " return string[0]\n", "\n", "counts = df.apply(getFirstLetter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we use the \".value_counts\" method to count how many times each letter appears:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "S 52\n", "W 38\n", "B 35\n", "C 30\n", "N 26\n", "H 25\n", "E 21\n", "R 19\n", "T 18\n", "M 15\n", "L 14\n", "D 13\n", "G 10\n", "K 8\n", "A 7\n", "P 7\n", "F 5\n", "O 4\n", "I 4\n", "V 1\n", "U 1\n", "Y 1\n", "Name: local-authority-eng, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts = counts.value_counts()\n", "counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we could use matplotlib to graph the frequency charts of each letter - but a quicker and easier way to do this is to use panda's built in visualisation - we'll see how to use this later:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEaJJREFUeJzt3X2wXHV9x/H3lwCCFiqRS5oR460FdZAOaK9UxbYjkTYW\nNLFVClXJVDRtxyfQjhOwVamtE6yjWM04jYqNVnmwiskoRfFSplUpmgAWFWqoJioNJAIO2FrLw7d/\nnJOy3O7uOftws7k/3q+ZnXuefud8797dz/72t2fPjcxEkrTw7TfpAiRJ42GgS1IhDHRJKoSBLkmF\nMNAlqRAGuiQVwkCXpEIY6JJUCANdkgqx/9482OGHH57T09N785CStOBt3br1R5k51bTdXg306elp\ntmzZsjcPKUkLXkTsaLOdQy6SVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklSI\nvfpN0bmm136+57rt607Zi5VI0sJnD12SCmGgS1IhDHRJKkSrMfSI2A7cCzwA3J+ZMxGxGLgUmAa2\nA6dl5t3zU6YkqckgPfTnZebxmTlTz68FZjPzaGC2npckTcgoQy4rgY319EZg1ejlSJKG1TbQE/hS\nRGyNiDX1siWZubOevh1YMvbqJEmttT0P/bmZeVtEHAFcFRG3dK7MzIyI7NawfgFYA7Bs2bKRipUk\n9daqh56Zt9U/dwGXAycAd0TEUoD6564ebTdk5kxmzkxNNf5LPEnSkBoDPSIeExGH7JkGfhP4JrAZ\nWF1vthrYNF9FSpKatRlyWQJcHhF7tv9kZl4ZEV8HLouIs4AdwGnzV6YkqUljoGfmd4Hjuiy/E1g+\nH0VJkgbnN0UlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFcJAl6RC\nGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSB\nLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSpE60CPiEURcUNEfK6eXxwRV0XEtvrn\nYfNXpiSpySA99DcAN3fMrwVmM/NoYLaelyRNSKtAj4gjgVOAD3csXglsrKc3AqvGW5okaRBte+gX\nAm8GHuxYtiQzd9bTtwNLxlmYJGkwjYEeEacCuzJza69tMjOB7NF+TURsiYgtu3fvHr5SSVJfbXro\nJwIviojtwCXASRHxd8AdEbEUoP65q1vjzNyQmTOZOTM1NTWmsiVJczUGemaem5lHZuY0cDpwdWa+\nHNgMrK43Ww1smrcqJUmNRjkPfR1wckRsA55fz0uSJmT/QTbOzGuAa+rpO4Hl4y9JkjQMvykqSYUw\n0CWpEAa6JBVioDH0fcn02s/3Xb993Sl7qRJJ2jfYQ5ekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmF\nMNAlqRAGuiQVwkCXpEIY6JJUCANdkgqxYK/lMgqvAyOpRPbQJakQBrokFcJAl6RCGOiSVAgDXZIK\nYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKkRjoEfEQRHxtYj4\nRkR8KyLOr5cvjoirImJb/fOw+S9XktRLmx76z4CTMvM44HhgRUQ8C1gLzGbm0cBsPS9JmpDGQM/K\nT+rZA+pbAiuBjfXyjcCqealQktRKq39BFxGLgK3AUcD6zLwuIpZk5s56k9uBJT3argHWACxbtmz0\niifMf18naV/V6kPRzHwgM48HjgROiIhj56xPql57t7YbMnMmM2empqZGLliS1N1AZ7lk5o+BfwRW\nAHdExFKA+ueu8ZcnSWqrzVkuUxHx2Hr6YOBk4BZgM7C63mw1sGm+ipQkNWszhr4U2FiPo+8HXJaZ\nn4uIa4HLIuIsYAdw2jzWKUlq0BjomfmvwNO7LL8TWD4fRUmSBuc3RSWpEAa6JBXCQJekQhjoklQI\nA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVotX10DU+/a6n7rXUJY3CHrokFcJAl6RCGOiSVAjH\n0BcQx98l9WMPXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJ\nKoSBLkmFMNAlqRAGuiQVwkCXpEI0Xg89Ip4AfAxYAiSwITPfFxGLgUuBaWA7cFpm3j1/pWoU/a6l\nDl5PXSpBmx76/cCbMvMY4FnAayLiGGAtMJuZRwOz9bwkaUIaAz0zd2bm9fX0vcDNwOOBlcDGerON\nwKr5KlKS1GygMfSImAaeDlwHLMnMnfWq26mGZCRJE9I60CPi54BPA2dn5j2d6zIzqcbXu7VbExFb\nImLL7t27RypWktRbq0CPiAOowvwTmfmZevEdEbG0Xr8U2NWtbWZuyMyZzJyZmpoaR82SpC4aAz0i\nAvgIcHNmvqdj1WZgdT29Gtg0/vIkSW01nrYInAi8ArgpIm6sl50HrAMui4izgB3AafNToiSpjcZA\nz8wvA9Fj9fLxlqN9keewSwuD3xSVpEIY6JJUCANdkgrR5kNRaST9xuAdf5fGxx66JBXCQJekQhjo\nklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCK/lon2a14GR2rOHLkmFMNAl\nqRAGuiQVwjF0Fcv/hapHGnvoklQIA12SCmGgS1IhHEOXunD8XQuRPXRJKoSBLkmFMNAlqRAGuiQV\nwkCXpEIY6JJUCANdkgrRGOgRcVFE7IqIb3YsWxwRV0XEtvrnYfNbpiSpSZse+t8CK+YsWwvMZubR\nwGw9L0maoMZAz8x/Au6as3glsLGe3gisGnNdkqQBDTuGviQzd9bTtwNLxlSPJGlII1/LJTMzIrLX\n+ohYA6wBWLZs2aiHk/Z5XgdGkzJsD/2OiFgKUP/c1WvDzNyQmTOZOTM1NTXk4SRJTYYN9M3A6np6\nNbBpPOVIkobV5rTFi4FrgadExA8j4ixgHXByRGwDnl/PS5ImqHEMPTPP6LFq+ZhrkSSNwG+KSlIh\nDHRJKoSBLkmF8H+KSvuYfuexew67+rGHLkmFMNAlqRAGuiQVwjF0qSCjjL97DZqFzx66JBXCQJek\nQhjoklQIA12SCmGgS1IhDHRJKoSBLkmF8Dx0SSPzHPZ9gz10SSqEgS5JhTDQJakQjqFLmqhRx9+9\nfvxD7KFLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFGOla\nLhGxAngfsAj4cGauG0tVkrQXjHIdmUm17WfoHnpELALWAy8AjgHOiIhjht2fJGk0owy5nADcmpnf\nzcz/AS4BVo6nLEnSoEYJ9McDP+iY/2G9TJI0AZGZwzWMeAmwIjNfVc+/AvjVzHztnO3WAGvq2acA\n/9Znt4cDPxqqoEde20ke2995YbSd5LEXYttJHrup7RMzc6pxL5k51A14NvCFjvlzgXOH3V+9jy22\n3feP7e+8MNou1Lq9v4a/jTLk8nXg6Ij4xYg4EDgd2DzC/iRJIxj6tMXMvD8iXgt8geq0xYsy81tj\nq0ySNJCRzkPPzCuAK8ZUC8AG2y6IY/s7L4y2kzz2Qmw7yWOPWjcwwoeikqR9i1/917yIiJHe/Uka\nnIGu+fK1SRcg7Wsi4oqImJ6v/U8k0CPimRHxCx3zZ0bEpoj464hY3ND27Ig4YRI9wIg4KiJO7LL8\nxIj4pb1dzz4uJl2AICKeGxHrJ11Hk4g4KCKOrW8H7YXjre/2XN4LPgp8MSLeEhEHjHvnk3pb/DfA\n8wEi4teBdcDrgOOpPhx4SZ+2RwIXAk+NiJuArwBfBb6amXcNU0xEHA7cmc0fKFxIdb79XPfU6144\nzPHnW0S8OTPfVU+/NDM/1bHunZl53hD73A84IzM/0WOTqYh4Y6/2mfmeQY85LhFxdmZe2HLbKYDM\n3N1y+5XAkZm5vp6/DtjzhZA3Z+bf92n7hMz8QY91p2bm51rW8HTg94GXAt8DPtOizUHAHwFHATcB\nH8nM+1seb1lmfr/Ntl3a7g+8E3glsIOqI/CEiPgo8JbMvG+Y/bbwHeDdEbEUuAy4ODNvaNu4y2M7\nqb4Y9OXM/F6vdpn5qYj4B+DPgC0R8XHgwY71Iz0vJvKhaER8IzOPq6fXA7sz8+31/I2ZeXyLfRwI\nzADPofqS07OBH2dm3wuERcSzqF5A7gLeAXyc6lta+wFnZuaVfdp+PTOf2WPdTZn5y33avrVPWZmZ\n72io+/1UD5peO3h9n7bXZ+Yz5k53m+/S9lDgNVSXddgMXAW8FngT8I3M7Hr9nojYCXyQHj31zDy/\n1zE79tH3ew2Z+aKmffTY7/czc1mf9QG8jer33I/qd7gfeH9m/nnDvr8CnL4nmCPiRmA58Bjgo5m5\nvE/bW6i+fb19zvJXUoVbz3eBEfFk4Iz69iPgUuBPMvOJ/ertaH8pcB/wz1QX3NuRmW9o2bbz8fXp\nzPzdNu3q7d8LHAKck5n31ssOBd4N/LRfDRFxL92fE0H1nDq0xfGfSPUdmtOBg4GLqcL9Ow3t3tZl\n8WLgt4C3Z+YlfdoeCKyletG9lIcHeuPzop9J9dAXRcT+dQ9gOQ9dGmCQmg4GDgV+vr79B1XPoskH\ngPPqNlcDL8jMf4mIp1L9MXsGOvDYhnr6+c8uyx4NvAp4HNWLSz9bOqbPpwqctqLHdLf5uT4O3A1c\nS1XreXWbVZl5Y592O5vCr4VnU10v6GLguha1ttW0n3OAE4Fn7ultRcSTgA9GxDmZ+d4+bQ+c08v+\ncmbeCdwZEY9pOO4bqd6On5KZ2+rjnkv1xP+Nhra3UIXxqZl5a932nIY2nY7Z0yGJiI8w2Gcgnffn\nkwZoB3Aq8OTOd8eZeU9E/DHV79Qz0DPzkAGP1W0fO4ALgAvqdzYXAW+l+m5Nv3Zdg7ceMv4S1cUK\nu61fAbyHqnP0jMz8r+Gr717YXr8Bb6EaKtkE3MBD7xSOAr7S0HZD3fZKqmB7AXDYAMe+sWP65jnr\nbmhoezHw6i7LXwVcOkANhwB/SvV2+ALgiAHvv751dtn++m7T3ea7tL2pY3oRsAs4aNw19tjHImAF\nsLF+nPwF8LQx7Pf7TbUDh3dZPtXiMXJrn3X/3qK25cCtwLFUw3hfbfP4BlZRhcgPgA/V+/neMI+R\nNo+Lto+vFm2/M8y6cd2oOpAvBD4B3F7fhytH3GfPxwjVi+7Ij+Fet4n00DPzLyNiFlgKfDHr35Tq\n7e3rGpovAx4FbANuo7rK448HOPyDHdM/nVtaQ9uzgcsj4mXA1nrZDHAg8OKmA9ev3m8EXkYVUs/I\nzLvbFD1gnXMdFxH3UPWkDq6nqeebPoD6vzHMzHwgIn6Ymf/d4pg9hxbayswHqF64r4yIR1ENJ1wT\nEedn5gf6tW14O970buqAzPx/F0rKzN0tPsi6LiJenZkfmlPPH9Ki15uZsxHxB8A1VGF+Upv7OzM/\nC3y2fhewkuqxekREfBC4PDO/2LCL4+Y8Lg7ueMxk9h++6Pf4amr77Yg4MzM/1rkwIl5O1UOfFxFx\nMtXj6bep/i6XAGsys9s76UH2+zyqd7RdZeavjbL/xuM/lKULRz3G+TSq8fPnUPVm7gKuzcy+QxER\n8QDV8MeeJ/aetzxB1fNs/OS5/qMdW89+KzOvbtHmr4DfoXqHsT4zf9LUps+++o57j1PH/QUPv89a\nj1OOePxHAadQPfmmqd6qXpSZt83jMXvevy0+czgC+CzwM+D6evGvUHVCVmXmHX3a7nkRinr7+4AH\nGPK+jojDqD4Y/b3sM3Y/SRHxeKoPbX/KwztJBwMvnq+/c0RcDXwS+PQwnar6hIy54bmYauj3zMyc\ntxejfhZkoO8REUdSjXU+h2os7nGZ2W+ce2Ii4kGqJ/n9PPyB0OrJOqfH+Wge/kI078E6CRHxMaoX\nziuASzLzm3vpuJ0vYg9bRfsX/ZOoOh3Q8kX/kWzO/fXtzJydZD1N6g9TOyXVmXIj9fBHteACPSJe\nz0M98/uoT1msbzdl5oN9mmsBqV8E9zxBBn4RlB5pFuLXs6eBT1Gd5rRzwrVoHmWm32SWBrDgeuiS\npO7sAUlSIQx0SSqEgS5JhTDQJakQBrokFeJ/Ae43jJAkTGEaAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "counts.plot.bar()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cool! Do you think there's a reason that S is the most common letter? (If you have an idea, test it out using python!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mini Project" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Feburary costs of the Department of Education over £25,000 can be found at this link (https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/594112/DfE_and_EFA_Spend_January_2017.csv) in csv format. Can you find out how much money the Department of Education spent in Feburary (in costs of over £25000)? Can you break these down by the expense type, or supplier? (Hint: Look at the groupby function) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More information " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Govenment data found at: https://data.gov.uk/\n", "* Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }