{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Acquisition and Cleaning\n", "\n", "This is by far and away the most valuable skill in Data Science.\n", "\n", "Not the fancy graphs, not the clever machine learning...\n", "\n", "The ability to import and manipulate data from a format that you get, to a format that you want, is **the** data science skill.\n", "\n", "Don't let anyone tell you differently.\n", "\n", "![](img/math_skill.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why do we need to clean data? \n", "\n", "Because a lot of the time, noone expected machines to read the data you want to play with\n", "\n", "\n", "* A computer might only read one of these header rows\n", "![](img/multirow.png)\n", "* A computer doesn't automatically know what a `\\*` or `#` means, and will assume the whole column is just strings\n", "![](img/starman.png)\n", "![](img/hashs.png)\n", "* A computer doesn't automatically know that these aren't \"valid\" data rows, or how to interpret them\n", "![](img/footer.png)\n", "\n", "Long story short, half of the battle of doing Data Science is translating between human/business expectation and computer/programming clarity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Where can we get data?\n", "\n", "All over the place\n", "\n", "* [Kaggle](https://www.kaggle.com/datasets)\n", "* [fivethirtyeight](https://github.com/fivethirtyeight/data)\n", "* [NASA](https://earthdata.nasa.gov/)\n", "\n", "And hundreds more. But for our purposes, I like to start locally.\n", "\n", "![https://www.opendatani.gov.uk/](img/opendata-fp.png)\n", "\n", "https://www.opendatani.gov.uk/\n", "\n", "### Sidebar\n", "\n", "I am both a massive proponent and regular critic of Open Data in Northern Ireland. The team are great but some of the data that's been put on the platform is woefully inadequate and really difficult to work with. \n", "\n", "See below for a rant.\n", "\n", "[![](http://img.youtube.com/vi/mtrIEW2nCMc/0.jpg)](http://www.youtube.com/watch?v=mtrIEW2nCMc) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example Data\n", "\n", "We're going to start off with a relatively 'easy' one; [Population Estimates for Northern Ireland by Parlimentary Constituency](https://www.opendatani.gov.uk/dataset/population-estimates-for-northern-ireland/resource/67c25586-b9aa-4717-9a4b-42de21a403f2)\n", "\n", "![](img/pop_cons.png)\n", "\n", "Spoiler Alert: _While this data set doesn't have any (known) errors, it does require some manipulation to make sense of. Also, it including Geographic Data Means that we can do some really cool stuff in the Data Visualisation section...._\n", "\n", "However, how we're starting to do some real code, we need a mascot....\n", "\n", "![](img/panda.png)\n", "\n", "`pandas` is a python _package_ (sometimes called a _module_), and you can think of packages as being big (or small) boxes of functionality that we can bring to bear to solve a problem, like custom toolboxes for different operations, for example.\n", "\n", "One of the major strengths of the Python Data Science ecosystemis the range of mature and well maintained packages that are outside of the 'standard library' that python ships with.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd # this is just a convention, 'pandas' is the special word" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the above line doesn't work, you will need to install `pandas` in your environment. \n", "\n", "This should be as simple as the following jupyter line(s)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting package metadata (current_repodata.json): ...working... done\n", "Solving environment: ...working... done\n", "\n", "# All requested packages already installed.\n", "\n", "\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%conda install pandas" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pandas in c:\\users\\me\\anaconda3\\lib\\site-packages (1.0.5)\n", "Requirement already satisfied: python-dateutil>=2.6.1 in c:\\users\\me\\anaconda3\\lib\\site-packages (from pandas) (2.8.1)\n", "Requirement already satisfied: pytz>=2017.2 in c:\\users\\me\\anaconda3\\lib\\site-packages (from pandas) (2020.1)\n", "Requirement already satisfied: numpy>=1.13.3 in c:\\users\\me\\anaconda3\\lib\\site-packages (from pandas) (1.18.5)\n", "Requirement already satisfied: six>=1.5 in c:\\users\\me\\anaconda3\\lib\\site-packages (from python-dateutil>=2.6.1->pandas) (1.15.0)\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_The trick here is that the `%` symbol tells `jupyter` to route the rest of the outside the `python` envionment and to the base terminal/shell_\n", "\n", "Now that we've got that sorted; we can get to downloading the data set and seeing what `pandas` can do for us..." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading...\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Geo_NameGeo_CodeMid_Year_EndingGenderAgePopulation_Estimate
0Belfast EastN060000012001All persons0827
1Belfast EastN060000012001All persons11045
2Belfast EastN060000012001All persons21159
3Belfast EastN060000012001All persons31032
4Belfast EastN060000012001All persons41106
\n", "
" ], "text/plain": [ " Geo_Name Geo_Code Mid_Year_Ending Gender Age \\\n", "0 Belfast East N06000001 2001 All persons 0 \n", "1 Belfast East N06000001 2001 All persons 1 \n", "2 Belfast East N06000001 2001 All persons 2 \n", "3 Belfast East N06000001 2001 All persons 3 \n", "4 Belfast East N06000001 2001 All persons 4 \n", "\n", " Population_Estimate \n", "0 827 \n", "1 1045 \n", "2 1159 \n", "3 1032 \n", "4 1106 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from pathlib import Path\n", "\n", "\n", "# Jump to the 'asides' to understand the operation of the walrus (:=)\n", "\n", "# This is really 'clever' and fun but don't worry about it, it just says \n", "# \"If I've already downloaded the file, read it from there, otherwise, set\n", "# the source to be that url\"\n", "if not (source:= Path('data/ni_pop.csv')).exists():\n", " print('Downloading...')\n", " source = 'https://www.opendatani.gov.uk/dataset/62e7073f-e924-4d3f-81a5-ad45b5127682/resource/67c25586-b9aa-4717-9a4b-42de21a403f2/download/parliamentary-constituencies-by-single-year-of-age-and-gender-mid-2001-to-mid-2019.csv'\n", "\n", "df = pd.read_csv(source) # `read_csv` can read from URL's or from local files aswell\n", "df.to_csv('data/ni_pop.csv', index=False) # Stash for later\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(93366, 6)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`head()` gives us the first 5 entries, and `shape` tells us that there are more than 90 thousand rows and 6 data columns.\n", "\n", "`pandas` also provides simple aggregation functions too." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "68166520" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Population_Estimate'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whoa... That doesn't look right...\n", "\n", "Maybe check that...\n", "\n", "![](img/ni_pop.png)\n", "\n", "So what's going wrong here?\n", "\n", "![](img/ni_pop_info.png)\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,\n", " 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019], dtype=int64)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Mid_Year_Ending'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can apply logical conditions to the columns (or several columns at once).\n", "\n", "Using this `pandas` allows you to select within a dataframe (like a table in Excel, except it supports near infinite columns/rows....)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "93361 True\n", "93362 True\n", "93363 True\n", "93364 True\n", "93365 True\n", "Name: Mid_Year_Ending, Length: 93366, dtype: bool" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Mid_Year_Ending'] == 2019" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Geo_NameGeo_CodeMid_Year_EndingGenderAgePopulation_Estimate
4914Belfast EastN060000012019All persons01141
4915Belfast EastN060000012019All persons11163
4916Belfast EastN060000012019All persons21170
4917Belfast EastN060000012019All persons31159
4918Belfast EastN060000012019All persons41150
.....................
93361West TyroneN060000182019Males86103
93362West TyroneN060000182019Males87113
93363West TyroneN060000182019Males8880
93364West TyroneN060000182019Males8952
93365West TyroneN060000182019Males90196
\n", "

4914 rows × 6 columns

\n", "
" ], "text/plain": [ " Geo_Name Geo_Code Mid_Year_Ending Gender Age \\\n", "4914 Belfast East N06000001 2019 All persons 0 \n", "4915 Belfast East N06000001 2019 All persons 1 \n", "4916 Belfast East N06000001 2019 All persons 2 \n", "4917 Belfast East N06000001 2019 All persons 3 \n", "4918 Belfast East N06000001 2019 All persons 4 \n", "... ... ... ... ... ... \n", "93361 West Tyrone N06000018 2019 Males 86 \n", "93362 West Tyrone N06000018 2019 Males 87 \n", "93363 West Tyrone N06000018 2019 Males 88 \n", "93364 West Tyrone N06000018 2019 Males 89 \n", "93365 West Tyrone N06000018 2019 Males 90 \n", "\n", " Population_Estimate \n", "4914 1141 \n", "4915 1163 \n", "4916 1170 \n", "4917 1159 \n", "4918 1150 \n", "... ... \n", "93361 103 \n", "93362 113 \n", "93363 80 \n", "93364 52 \n", "93365 196 \n", "\n", "[4914 rows x 6 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['Mid_Year_Ending'] == 2019] ## select * from df where 'mid_year_ending' = 2019" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3787334" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['Mid_Year_Ending'] == 2019]['Population_Estimate'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, so we're at least in the correct order of magnitude but still very wrong... Any ideas?\n", "\n", "`pandas` data frames are stored in columns. Each column is assumed to be a particular `type`. \n", "\n", "When you read in from a CSV file, pandas tries its best to guess what type each column coming in is.\n", "\n", "For instance, we can see that the 'Mid_Year_Ending','Age',and 'Population_Estimate' all get `int64` type, which means each value is stored in memory as a 64-bit Integer. \n", "\n", "However, these `object` types indicate that the guessing has just given up, and treated these columns as boring strings, however we know that 'Gender' and 'Geo_Name' are `Categories`, sometimes called 'enums' or enumerables, i.e a small set of valid values, like a dropdown box.\n", "\n", "These categories can often indicate particular slices and design decisions made by data providers." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Geo_Name object\n", "Geo_Code object\n", "Mid_Year_Ending int64\n", "Gender object\n", "Age int64\n", "Population_Estimate int64\n", "dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at Gender, we can see there are three available 'values' of Gender, Males, Females, and All Persons.\n", "\n", "From context we can guess that 'All Persons' is a double-count of the Males/Females breakdown." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['All persons', 'Females', 'Males'], dtype=object)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Gender'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make things slightly more light weight (and significantly faster when used correctly...), we can great a new categorical type that's more efficient using `pd.CategoricalDtype`\n", "\n", "This is useful for situations where there are a very small number of possible (usually string) options, but a very large number of entries. \n", "\n", "Good examples of this are found in the clothing industry\n", "\n", "* Size (X-Small, Small, Medium, Large, X-Large)\n", "* Color (Red, Black, White)\n", "* Style (Short sleeve, long sleeve)\n", "* Material (Cotton, Polyester)\n", "\n", "This categorisation is a form of 'encoding' where, behind the scenes, `pandas` will create a very small lookup table between a set of numbers ${0,1,2}$ for instance, and will use those values in the in-memory dataframe, replacing the very redundant existing labels; {'All persons','Males','Females'}\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 93366 entries, 0 to 93365\n", "Data columns (total 1 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Gender 93366 non-null category\n", "dtypes: category(1)\n", "memory usage: 91.4 KB\n" ] } ], "source": [ "df['Gender'].astype('category').to_frame().info()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 93366 entries, 0 to 93365\n", "Data columns (total 1 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Gender 93366 non-null object\n", "dtypes: object(1)\n", "memory usage: 729.5+ KB\n" ] } ], "source": [ "df['Gender'].to_frame().info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It would be completly reasonable to ask \"What's the point of this? This is a tiny dataset and size doesn't matter\".\n", "\n", "However, having a solid understanding of the data you're working with, and being able to embed your assumptions and contexts in to the data itself as 'metadata' is extremely valuable for tripping you up if you do something stupid.\n", "\n", "For instance, if at some point during your analysis you accidentally swap columns around and instead of querying to the 'minimum age', you try to ask for the 'minimum gender'; as these are non-lexical categories, this makes no sense, and `pandas` will helpfully tell you off for trying to do something so silly" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "Categorical is not ordered for operation min\nyou can use .as_ordered() to change the Categorical to an ordered one\n", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mdf\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'Gender'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'category'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmin\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36mstat_func\u001b[1;34m(self, axis, skipna, level, numeric_only, **kwargs)\u001b[0m\n\u001b[0;32m 11212\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mlevel\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 11213\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_agg_by_level\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mname\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mlevel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mskipna\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m> 11214\u001b[1;33m return self._reduce(\n\u001b[0m\u001b[0;32m 11215\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mname\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mskipna\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnumeric_only\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mnumeric_only\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 11216\u001b[0m )\n", "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\series.py\u001b[0m in \u001b[0;36m_reduce\u001b[1;34m(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)\u001b[0m\n\u001b[0;32m 3870\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3871\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdelegate\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mCategorical\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 3872\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mdelegate\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_reduce\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mname\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mskipna\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 3873\u001b[0m \u001b[1;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdelegate\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mExtensionArray\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3874\u001b[0m \u001b[1;31m# dispatch to ExtensionArray interface\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\arrays\\categorical.py\u001b[0m in \u001b[0;36m_reduce\u001b[1;34m(self, name, axis, **kwargs)\u001b[0m\n\u001b[0;32m 2123\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mfunc\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2124\u001b[0m \u001b[1;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34mf\"Categorical cannot perform the operation {name}\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2125\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2126\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2127\u001b[0m \u001b[1;33m@\u001b[0m\u001b[0mdeprecate_kwarg\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mold_arg_name\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m\"numeric_only\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnew_arg_name\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m\"skipna\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\util\\_decorators.py\u001b[0m in \u001b[0;36mwrapper\u001b[1;34m(*args, **kwargs)\u001b[0m\n\u001b[0;32m 212\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 213\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mnew_arg_name\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnew_arg_value\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 214\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 215\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 216\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mcast\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mF\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\arrays\\categorical.py\u001b[0m in \u001b[0;36mmin\u001b[1;34m(self, skipna, **kwargs)\u001b[0m\n\u001b[0;32m 2146\u001b[0m \"\"\"\n\u001b[0;32m 2147\u001b[0m \u001b[0mnv\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mvalidate_min\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2148\u001b[1;33m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcheck_for_ordered\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"min\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2149\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2150\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_codes\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\arrays\\categorical.py\u001b[0m in \u001b[0;36mcheck_for_ordered\u001b[1;34m(self, op)\u001b[0m\n\u001b[0;32m 1491\u001b[0m \u001b[1;34m\"\"\" assert that we are ordered \"\"\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1492\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mordered\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1493\u001b[1;33m raise TypeError(\n\u001b[0m\u001b[0;32m 1494\u001b[0m \u001b[1;34mf\"Categorical is not ordered for operation {op}\\n\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1495\u001b[0m \u001b[1;34m\"you can use .as_ordered() to change the \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mTypeError\u001b[0m: Categorical is not ordered for operation min\nyou can use .as_ordered() to change the Categorical to an ordered one\n" ] } ], "source": [ "df['Gender'].astype('category').min()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pandas.api.types import CategoricalDtype\n", "\n", "_gender_type = CategoricalDtype(\n", " categories=df['Gender'].unique(), \n", " ordered=False\n", ")\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "93361 False\n", "93362 False\n", "93363 False\n", "93364 False\n", "93365 False\n", "Length: 93366, dtype: bool" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(df['Mid_Year_Ending'] == 2019) & (df['Gender'] == 'All persons')" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Geo_NameGeo_CodeMid_Year_EndingGenderAgePopulation_Estimate
4914Belfast EastN060000012019All persons01141
4915Belfast EastN060000012019All persons11163
4916Belfast EastN060000012019All persons21170
4917Belfast EastN060000012019All persons31159
4918Belfast EastN060000012019All persons41150
.....................
93179West TyroneN060000182019All persons86266
93180West TyroneN060000182019All persons87255
93181West TyroneN060000182019All persons88208
93182West TyroneN060000182019All persons89158
93183West TyroneN060000182019All persons90608
\n", "

1638 rows × 6 columns

\n", "
" ], "text/plain": [ " Geo_Name Geo_Code Mid_Year_Ending Gender Age \\\n", "4914 Belfast East N06000001 2019 All persons 0 \n", "4915 Belfast East N06000001 2019 All persons 1 \n", "4916 Belfast East N06000001 2019 All persons 2 \n", "4917 Belfast East N06000001 2019 All persons 3 \n", "4918 Belfast East N06000001 2019 All persons 4 \n", "... ... ... ... ... ... \n", "93179 West Tyrone N06000018 2019 All persons 86 \n", "93180 West Tyrone N06000018 2019 All persons 87 \n", "93181 West Tyrone N06000018 2019 All persons 88 \n", "93182 West Tyrone N06000018 2019 All persons 89 \n", "93183 West Tyrone N06000018 2019 All persons 90 \n", "\n", " Population_Estimate \n", "4914 1141 \n", "4915 1163 \n", "4916 1170 \n", "4917 1159 \n", "4918 1150 \n", "... ... \n", "93179 266 \n", "93180 255 \n", "93181 208 \n", "93182 158 \n", "93183 608 \n", "\n", "[1638 rows x 6 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\n", " (df['Mid_Year_Ending'] == 2019) & \n", " (df['Gender'] == 'All persons')\n", "]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1893667" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\n", " (df['Mid_Year_Ending'] == 2019) & \n", " (df['Gender'] == 'All persons')\n", "]['Population_Estimate'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](img/great_success.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Understanding Data: \n", "\n", "This population data set has:\n", "* Multiple years of data put together with temporal 'duplicates'\n", "* Multiple geographic regions put together\n", "* Gender counted 'twice'.\n", "\n", "## Question 1: Binary Computation\n", "\n", "How does this dataset treat people with non binary gender identity?\n", "\n", "How can you test that?\n", "\n", "Try it out for yourself and check Q1.ipynb for an answer.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Finally! Some Statistics!" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Geo_NameGeo_CodeMid_Year_EndingGenderAgePopulation_Estimate
4914Belfast EastN060000012019All persons01141
4915Belfast EastN060000012019All persons11163
4916Belfast EastN060000012019All persons21170
4917Belfast EastN060000012019All persons31159
4918Belfast EastN060000012019All persons41150
.....................
93179West TyroneN060000182019All persons86266
93180West TyroneN060000182019All persons87255
93181West TyroneN060000182019All persons88208
93182West TyroneN060000182019All persons89158
93183West TyroneN060000182019All persons90608
\n", "

1638 rows × 6 columns

\n", "
" ], "text/plain": [ " Geo_Name Geo_Code Mid_Year_Ending Gender Age \\\n", "4914 Belfast East N06000001 2019 All persons 0 \n", "4915 Belfast East N06000001 2019 All persons 1 \n", "4916 Belfast East N06000001 2019 All persons 2 \n", "4917 Belfast East N06000001 2019 All persons 3 \n", "4918 Belfast East N06000001 2019 All persons 4 \n", "... ... ... ... ... ... \n", "93179 West Tyrone N06000018 2019 All persons 86 \n", "93180 West Tyrone N06000018 2019 All persons 87 \n", "93181 West Tyrone N06000018 2019 All persons 88 \n", "93182 West Tyrone N06000018 2019 All persons 89 \n", "93183 West Tyrone N06000018 2019 All persons 90 \n", "\n", " Population_Estimate \n", "4914 1141 \n", "4915 1163 \n", "4916 1170 \n", "4917 1159 \n", "4918 1150 \n", "... ... \n", "93179 266 \n", "93180 255 \n", "93181 208 \n", "93182 158 \n", "93183 608 \n", "\n", "[1638 rows x 6 columns]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2019_all = df[\n", " (df['Mid_Year_Ending'] == 2019) & \n", " (df['Gender'] == 'All persons')\n", "]\n", "df_2019_all" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Part of data maniulation is removing columns that aren't relevant to us; \n", "We've created a new slimmed down version of the dataframe with just the data to do with 2019 that disregards gender.\n", "\n", "But we can see that it still takes up a bit of memory." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 1638 entries, 4914 to 93183\n", "Data columns (total 6 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Geo_Name 1638 non-null object\n", " 1 Geo_Code 1638 non-null object\n", " 2 Mid_Year_Ending 1638 non-null int64 \n", " 3 Gender 1638 non-null object\n", " 4 Age 1638 non-null int64 \n", " 5 Population_Estimate 1638 non-null int64 \n", "dtypes: int64(3), object(3)\n", "memory usage: 376.4 KB\n" ] } ], "source": [ "df_2019_all.info(memory_usage='deep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about if we forgot about (or 'dropped') the Gender and Mid_Year_Ending columns, because we don't need them, and the 'Geo_Code' one, because we're not going to use it?" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "ename": "KeyError", "evalue": "\"['Gender' 'Mid_Year_Ending' 'Geo_Code'] not found in axis\"", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mdf_2019_all\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'Gender'\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;34m'Mid_Year_Ending'\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;34m'Geo_Code'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;32mC:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\core\\frame.py\u001b[0m in \u001b[0;36mdrop\u001b[1;34m(self, labels, axis, index, columns, level, inplace, errors)\u001b[0m\n\u001b[0;32m 3988\u001b[0m \u001b[0mweight\u001b[0m \u001b[1;36m1.0\u001b[0m \u001b[1;36m0.8\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3989\u001b[0m \"\"\"\n\u001b[1;32m-> 3990\u001b[1;33m return super().drop(\n\u001b[0m\u001b[0;32m 3991\u001b[0m \u001b[0mlabels\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mlabels\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3992\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36mdrop\u001b[1;34m(self, labels, axis, index, columns, level, inplace, errors)\u001b[0m\n\u001b[0;32m 3934\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlabels\u001b[0m \u001b[1;32min\u001b[0m \u001b[0maxes\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3935\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mlabels\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 3936\u001b[1;33m \u001b[0mobj\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mobj\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_drop_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mlabels\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mlevel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merrors\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0merrors\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 3937\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3938\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0minplace\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36m_drop_axis\u001b[1;34m(self, labels, axis, level, errors)\u001b[0m\n\u001b[0;32m 3968\u001b[0m \u001b[0mnew_axis\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mlabels\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mlevel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merrors\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0merrors\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3969\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 3970\u001b[1;33m \u001b[0mnew_axis\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mlabels\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merrors\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0merrors\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 3971\u001b[0m \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreindex\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m**\u001b[0m\u001b[1;33m{\u001b[0m\u001b[0maxis_name\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mnew_axis\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3972\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32mC:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\core\\indexes\\base.py\u001b[0m in \u001b[0;36mdrop\u001b[1;34m(self, labels, errors)\u001b[0m\n\u001b[0;32m 5016\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mmask\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0many\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5017\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0merrors\u001b[0m \u001b[1;33m!=\u001b[0m \u001b[1;34m\"ignore\"\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 5018\u001b[1;33m \u001b[1;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34mf\"{labels[mask]} not found in axis\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 5019\u001b[0m \u001b[0mindexer\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mindexer\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;33m~\u001b[0m\u001b[0mmask\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5020\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdelete\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mKeyError\u001b[0m: \"['Gender' 'Mid_Year_Ending' 'Geo_Code'] not found in axis\"" ] } ], "source": [ "df_2019_all.drop(['Gender','Mid_Year_Ending','Geo_Code'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whoops! `drop` defaults to thinking about rows, when we want columns; " ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Geo_NameAgePopulation_Estimate
4914Belfast East01141
4915Belfast East11163
4916Belfast East21170
4917Belfast East31159
4918Belfast East41150
............
93179West Tyrone86266
93180West Tyrone87255
93181West Tyrone88208
93182West Tyrone89158
93183West Tyrone90608
\n", "

1638 rows × 3 columns

\n", "
" ], "text/plain": [ " Geo_Name Age Population_Estimate\n", "4914 Belfast East 0 1141\n", "4915 Belfast East 1 1163\n", "4916 Belfast East 2 1170\n", "4917 Belfast East 3 1159\n", "4918 Belfast East 4 1150\n", "... ... ... ...\n", "93179 West Tyrone 86 266\n", "93180 West Tyrone 87 255\n", "93181 West Tyrone 88 208\n", "93182 West Tyrone 89 158\n", "93183 West Tyrone 90 608\n", "\n", "[1638 rows x 3 columns]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2019_all.drop(['Gender','Mid_Year_Ending','Geo_Code'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, this is important; **we have not changed either the original `df` data frame or the `df_2019_all` data frame**, what is displayed above is just the *result* of the `drop` function." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Geo_NameGeo_CodeMid_Year_EndingGenderAgePopulation_Estimate
4914Belfast EastN060000012019All persons01141
4915Belfast EastN060000012019All persons11163
4916Belfast EastN060000012019All persons21170
4917Belfast EastN060000012019All persons31159
4918Belfast EastN060000012019All persons41150
.....................
93179West TyroneN060000182019All persons86266
93180West TyroneN060000182019All persons87255
93181West TyroneN060000182019All persons88208
93182West TyroneN060000182019All persons89158
93183West TyroneN060000182019All persons90608
\n", "

1638 rows × 6 columns

\n", "
" ], "text/plain": [ " Geo_Name Geo_Code Mid_Year_Ending Gender Age \\\n", "4914 Belfast East N06000001 2019 All persons 0 \n", "4915 Belfast East N06000001 2019 All persons 1 \n", "4916 Belfast East N06000001 2019 All persons 2 \n", "4917 Belfast East N06000001 2019 All persons 3 \n", "4918 Belfast East N06000001 2019 All persons 4 \n", "... ... ... ... ... ... \n", "93179 West Tyrone N06000018 2019 All persons 86 \n", "93180 West Tyrone N06000018 2019 All persons 87 \n", "93181 West Tyrone N06000018 2019 All persons 88 \n", "93182 West Tyrone N06000018 2019 All persons 89 \n", "93183 West Tyrone N06000018 2019 All persons 90 \n", "\n", " Population_Estimate \n", "4914 1141 \n", "4915 1163 \n", "4916 1170 \n", "4917 1159 \n", "4918 1150 \n", "... ... \n", "93179 266 \n", "93180 255 \n", "93181 208 \n", "93182 158 \n", "93183 608 \n", "\n", "[1638 rows x 6 columns]" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2019_all" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The clearest way to do what we mean to do is to simply reassign the dataframe back to 'itself' (If you want to know the innards of this, it's complicated, but [here](https://realpython.com/python-memory-management/) is a good read)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "df_2019_all = df_2019_all.drop(['Gender','Mid_Year_Ending','Geo_Code'], axis=1)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Geo_NameAgePopulation_Estimate
4914Belfast East01141
4915Belfast East11163
4916Belfast East21170
4917Belfast East31159
4918Belfast East41150
............
93179West Tyrone86266
93180West Tyrone87255
93181West Tyrone88208
93182West Tyrone89158
93183West Tyrone90608
\n", "

1638 rows × 3 columns

\n", "
" ], "text/plain": [ " Geo_Name Age Population_Estimate\n", "4914 Belfast East 0 1141\n", "4915 Belfast East 1 1163\n", "4916 Belfast East 2 1170\n", "4917 Belfast East 3 1159\n", "4918 Belfast East 4 1150\n", "... ... ... ...\n", "93179 West Tyrone 86 266\n", "93180 West Tyrone 87 255\n", "93181 West Tyrone 88 208\n", "93182 West Tyrone 89 158\n", "93183 West Tyrone 90 608\n", "\n", "[1638 rows x 3 columns]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2019_all" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 1638 entries, 4914 to 93183\n", "Data columns (total 3 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Geo_Name 1638 non-null object\n", " 1 Age 1638 non-null int64 \n", " 2 Population_Estimate 1638 non-null int64 \n", "dtypes: int64(2), object(1)\n", "memory usage: 229.2 KB\n" ] } ], "source": [ "df_2019_all.info(memory_usage='deep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Obviously this didn't exactly save the world in this case, but it's useful to know about..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Statistics\n", "\n", "### Population by age\n", "`pandas.groupby` allows you to segment the dataset by a particular column (or multiples as we'll see later).\n", "\n", "So, the way to read the below is:\n", "> For each value of Age, give me the sum of the Population_Estimate" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Age\n", "0 22721\n", "1 23408\n", "2 24206\n", "3 25074\n", "4 24960\n", " ... \n", "86 5789\n", "87 5005\n", "88 4375\n", "89 3617\n", "90 13734\n", "Name: Population_Estimate, Length: 91, dtype: int64" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2019_all.groupby('Age')['Population_Estimate'].sum() # select sum(population_estimate) from df_2019_all group by Age " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to think about this is by taking each group in turn" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Geo_NameAgePopulation_Estimate
4914Belfast East01141
10101Belfast North01399
15288Belfast South01246
20475Belfast West01260
25662East Antrim0857
30849East Londonderry01097
36036Fermanagh and South Tyrone01425
41223Foyle01303
46410Lagan Valley01320
51597Mid Ulster01412
56784Newry and Armagh01678
61971North Antrim01268
67158North Down0865
72345South Antrim01208
77532South Down01458
82719Strangford0936
87906Upper Bann01693
93093West Tyrone01155
\n", "
" ], "text/plain": [ " Geo_Name Age Population_Estimate\n", "4914 Belfast East 0 1141\n", "10101 Belfast North 0 1399\n", "15288 Belfast South 0 1246\n", "20475 Belfast West 0 1260\n", "25662 East Antrim 0 857\n", "30849 East Londonderry 0 1097\n", "36036 Fermanagh and South Tyrone 0 1425\n", "41223 Foyle 0 1303\n", "46410 Lagan Valley 0 1320\n", "51597 Mid Ulster 0 1412\n", "56784 Newry and Armagh 0 1678\n", "61971 North Antrim 0 1268\n", "67158 North Down 0 865\n", "72345 South Antrim 0 1208\n", "77532 South Down 0 1458\n", "82719 Strangford 0 936\n", "87906 Upper Bann 0 1693\n", "93093 West Tyrone 0 1155" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for age_value, group in df_2019_all.groupby('Age'):\n", " break # Interrupts the flow of the loop\n", "group" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "22721" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group['Population_Estimate'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is the most populated age group?\n", "\n", "This starts off as an easy one, but depends on your definition of 'age group'....\n", "\n", "_Now we start to see that Statistics can be used by unscrupulous people to change a narrative to suit themselves... be on the watch for politicians carrying statistics..._" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Age\n", "0 22721\n", "1 23408\n", "2 24206\n", "3 25074\n", "4 24960\n", " ... \n", "86 5789\n", "87 5005\n", "88 4375\n", "89 3617\n", "90 13734\n", "Name: Population_Estimate, Length: 91, dtype: int64" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s_age_pop = df_2019_all.groupby('Age')['Population_Estimate'].sum()\n", "s_age_pop" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Age\n", "89 3617\n", "88 4375\n", "87 5005\n", "86 5789\n", "85 6219\n", " ... \n", "7 26274\n", "52 26301\n", "53 26344\n", "51 26499\n", "54 26776\n", "Name: Population_Estimate, Length: 91, dtype: int64" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s_age_pop.sort_values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Easy peasy, 54 is the most common age, and it *looks like* the 50's are well represented in this data." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4914 NaN\n", "4915 (0.0, 18.0]\n", "4916 (0.0, 18.0]\n", "4917 (0.0, 18.0]\n", "4918 (0.0, 18.0]\n", " ... \n", "93179 (50.0, 90.0]\n", "93180 (50.0, 90.0]\n", "93181 (50.0, 90.0]\n", "93182 (50.0, 90.0]\n", "93183 (50.0, 90.0]\n", "Name: Age, Length: 1638, dtype: category\n", "Categories (4, interval[int64]): [(0, 18] < (18, 30] < (30, 50] < (50, 90]]" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.cut(df_2019_all.Age, [0,18,30,50,90])" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Age\n", "(0, 10] 252321\n", "(10, 20] 233470\n", "(20, 30] 239596\n", "(30, 40] 250481\n", "(40, 50] 243976\n", "(50, 60] 253225\n", "(60, 70] 190199\n", "(70, 80] 136211\n", "(80, 90] 71467\n", "Name: Population_Estimate, dtype: int64" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s_age_grp_pop = df_2019_all.groupby(\n", " pd.cut(df_2019_all.Age, [0,10,20,30,40,50,60,70,80,90])\n", ")['Population_Estimate'].sum()\n", "s_age_grp_pop" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Age\n", "(80, 90] 71467\n", "(70, 80] 136211\n", "(60, 70] 190199\n", "(10, 20] 233470\n", "(20, 30] 239596\n", "(40, 50] 243976\n", "(30, 40] 250481\n", "(0, 10] 252321\n", "(50, 60] 253225\n", "Name: Population_Estimate, dtype: int64" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s_age_grp_pop.sort_values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So it's a real toss up between 50/60's and the children!\n", "\n", "But what if we modified the age groups to something like [Marketing people care about](https://marketingartfully.com/customer-demographics-age-demographics-for-advertising/)? " ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Age\n", "(0, 12] 303398\n", "(12, 17] 114586\n", "(17, 25] 183719\n", "(25, 35] 250458\n", "(35, 45] 239822\n", "(45, 55] 260036\n", "(55, 65] 223084\n", "(65, 90] 295843\n", "Name: Population_Estimate, dtype: int64" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s_age_grp_pop = df_2019_all.groupby(\n", " pd.cut(df_2019_all.Age, [0,12,17,25,35,45,55,65,90])\n", ")['Population_Estimate'].sum()\n", "s_age_grp_pop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about if we wanted to estimate the 'average age'?\n", "\n", "We do it with 'normal' numbers by simply summing all the values and dividing by the number of values, but we're not dealing with individual values here, we're dealing with 'bucketised' values. And we can do that too!" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "73899465" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This is the total number of years lived by everyone in northern ireland\n", "# It's approximately the same time ago as when the meteorite killed the dinosaurs\n", "(df_2019_all.Age*df_2019_all.Population_Estimate).sum() " ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "39.02453018402919" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(df_2019_all.Age*df_2019_all.Population_Estimate).sum() / df_2019_all.Population_Estimate.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh! Well that's different; Depending on how you say things, the following statistics are all true; \n", "1. The average age in Northern Ireland is 40\n", "2. The more of the population are 54 than any other age\n", "3. Under 12's are the most populous age group\n", "\n", "Is there anything else we can say that similarly contradictory?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning outcomes\n", "\n", "So far we've worked out:\n", "* How to read CSV files into `pandas` dataframes\n", "* How to validate data against expectations\n", "* How to select and filter data in these dataframes based on columns and their values\n", "* How to make basic aggregations across data\n", "* How to make aggregations across groups of data\n", "\n", "Now we're gonna make them pretty... in the next worksheet." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "df.to_csv('ihatemyself.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }