{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Acquisition and Cleaning\n",
    "\n",
    "This is by far and away the most valuable skill in Data Science.\n",
    "\n",
    "Not the fancy graphs, not the clever machine learning...\n",
    "\n",
    "The ability to import and manipulate data from a format that you get, to a format that you want, is **the** data science skill.\n",
    "\n",
    "Don't let anyone tell you differently.\n",
    "\n",
    "![](img/math_skill.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why do we need to clean data? \n",
    "\n",
    "Because a lot of the time, noone expected machines to read the data you want to play with\n",
    "\n",
    "\n",
    "* A computer might only read one of these header rows\n",
    "![](img/multirow.png)\n",
    "* A computer doesn't automatically know what a `\\*` or `#` means, and will assume the whole column is just strings\n",
    "![](img/starman.png)\n",
    "![](img/hashs.png)\n",
    "* A computer doesn't automatically know that these aren't \"valid\" data rows, or how to interpret them\n",
    "![](img/footer.png)\n",
    "\n",
    "Long story short, half of the battle of doing Data Science is translating between human/business expectation and computer/programming clarity."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Where can we get data?\n",
    "\n",
    "All over the place\n",
    "\n",
    "* [Kaggle](https://www.kaggle.com/datasets)\n",
    "* [fivethirtyeight](https://github.com/fivethirtyeight/data)\n",
    "* [NASA](https://earthdata.nasa.gov/)\n",
    "\n",
    "And hundreds more. But for our purposes, I like to start locally.\n",
    "\n",
    "![https://www.opendatani.gov.uk/](img/opendata-fp.png)\n",
    "\n",
    "https://www.opendatani.gov.uk/\n",
    "\n",
    "### Sidebar\n",
    "\n",
    "I am both a massive proponent and regular critic of Open Data in Northern Ireland. The team are great but some of the data that's been put on the platform is woefully inadequate and really difficult to work with. \n",
    "\n",
    "See below for a rant.\n",
    "\n",
    "[![](http://img.youtube.com/vi/mtrIEW2nCMc/0.jpg)](http://www.youtube.com/watch?v=mtrIEW2nCMc) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example Data\n",
    "\n",
    "We're going to start off with a relatively 'easy' one; [Population Estimates for Northern Ireland by Parlimentary Constituency](https://www.opendatani.gov.uk/dataset/population-estimates-for-northern-ireland/resource/67c25586-b9aa-4717-9a4b-42de21a403f2)\n",
    "\n",
    "![](img/pop_cons.png)\n",
    "\n",
    "Spoiler Alert: _While this data set doesn't have any (known) errors, it does require some manipulation to make sense of. Also, it including Geographic Data Means that we can do some really cool stuff in the Data Visualisation section...._\n",
    "\n",
    "However, how we're starting to do some real code, we need a mascot....\n",
    "\n",
    "![](img/panda.png)\n",
    "\n",
    "`pandas` is a python _package_ (sometimes called a _module_), and you can think of packages as being big (or small) boxes of functionality that we can bring to bear to solve a problem, like custom toolboxes for different operations, for example.\n",
    "\n",
    "One of the major strengths of the Python Data Science ecosystemis the range of mature and well maintained packages that are outside of the 'standard library' that python ships with.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd # this is just a convention, 'pandas' is the special word"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the above line doesn't work, you will need to install `pandas` in your environment. \n",
    "\n",
    "This should be as simple as the following jupyter line(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting package metadata (current_repodata.json): ...working... done\n",
      "Solving environment: ...working... done\n",
      "\n",
      "# All requested packages already installed.\n",
      "\n",
      "\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%conda install pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: pandas in c:\\users\\me\\anaconda3\\lib\\site-packages (1.0.5)\n",
      "Requirement already satisfied: python-dateutil>=2.6.1 in c:\\users\\me\\anaconda3\\lib\\site-packages (from pandas) (2.8.1)\n",
      "Requirement already satisfied: pytz>=2017.2 in c:\\users\\me\\anaconda3\\lib\\site-packages (from pandas) (2020.1)\n",
      "Requirement already satisfied: numpy>=1.13.3 in c:\\users\\me\\anaconda3\\lib\\site-packages (from pandas) (1.18.5)\n",
      "Requirement already satisfied: six>=1.5 in c:\\users\\me\\anaconda3\\lib\\site-packages (from python-dateutil>=2.6.1->pandas) (1.15.0)\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_The trick here is that the `%` symbol tells `jupyter` to route the rest of the outside the `python` envionment and to the base terminal/shell_\n",
    "\n",
    "Now that we've got that sorted; we can get to downloading the data set and seeing what `pandas` can do for us..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading...\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Geo_Name</th>\n",
       "      <th>Geo_Code</th>\n",
       "      <th>Mid_Year_Ending</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Age</th>\n",
       "      <th>Population_Estimate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2001</td>\n",
       "      <td>All persons</td>\n",
       "      <td>0</td>\n",
       "      <td>827</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2001</td>\n",
       "      <td>All persons</td>\n",
       "      <td>1</td>\n",
       "      <td>1045</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2001</td>\n",
       "      <td>All persons</td>\n",
       "      <td>2</td>\n",
       "      <td>1159</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2001</td>\n",
       "      <td>All persons</td>\n",
       "      <td>3</td>\n",
       "      <td>1032</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2001</td>\n",
       "      <td>All persons</td>\n",
       "      <td>4</td>\n",
       "      <td>1106</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Geo_Name   Geo_Code  Mid_Year_Ending       Gender  Age  \\\n",
       "0  Belfast East  N06000001             2001  All persons    0   \n",
       "1  Belfast East  N06000001             2001  All persons    1   \n",
       "2  Belfast East  N06000001             2001  All persons    2   \n",
       "3  Belfast East  N06000001             2001  All persons    3   \n",
       "4  Belfast East  N06000001             2001  All persons    4   \n",
       "\n",
       "   Population_Estimate  \n",
       "0                  827  \n",
       "1                 1045  \n",
       "2                 1159  \n",
       "3                 1032  \n",
       "4                 1106  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from pathlib import Path\n",
    "\n",
    "\n",
    "# Jump to the 'asides' to understand the operation of the walrus (:=)\n",
    "\n",
    "# This is really 'clever' and fun but don't worry about it, it just says \n",
    "# \"If I've already downloaded the file, read it from there, otherwise, set\n",
    "# the source to be that url\"\n",
    "if not (source:= Path('data/ni_pop.csv')).exists():\n",
    "    print('Downloading...')\n",
    "    source = 'https://www.opendatani.gov.uk/dataset/62e7073f-e924-4d3f-81a5-ad45b5127682/resource/67c25586-b9aa-4717-9a4b-42de21a403f2/download/parliamentary-constituencies-by-single-year-of-age-and-gender-mid-2001-to-mid-2019.csv'\n",
    "\n",
    "df = pd.read_csv(source) # `read_csv` can read from URL's or from local files aswell\n",
    "df.to_csv('data/ni_pop.csv', index=False) # Stash for later\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(93366, 6)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`head()` gives us the first 5 entries, and `shape` tells us that there are more than 90 thousand rows and 6 data columns.\n",
    "\n",
    "`pandas` also provides simple aggregation functions too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "68166520"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['Population_Estimate'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whoa... That doesn't look right...\n",
    "\n",
    "Maybe check that...\n",
    "\n",
    "![](img/ni_pop.png)\n",
    "\n",
    "So what's going wrong here?\n",
    "\n",
    "![](img/ni_pop_info.png)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,\n",
       "       2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019], dtype=int64)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['Mid_Year_Ending'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can apply logical conditions to the columns (or several columns at once).\n",
    "\n",
    "Using this `pandas` allows you to select within a dataframe (like a table in Excel, except it supports near infinite columns/rows....)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        False\n",
       "1        False\n",
       "2        False\n",
       "3        False\n",
       "4        False\n",
       "         ...  \n",
       "93361     True\n",
       "93362     True\n",
       "93363     True\n",
       "93364     True\n",
       "93365     True\n",
       "Name: Mid_Year_Ending, Length: 93366, dtype: bool"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['Mid_Year_Ending'] == 2019"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Geo_Name</th>\n",
       "      <th>Geo_Code</th>\n",
       "      <th>Mid_Year_Ending</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Age</th>\n",
       "      <th>Population_Estimate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4914</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>0</td>\n",
       "      <td>1141</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4915</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>1</td>\n",
       "      <td>1163</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4916</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>2</td>\n",
       "      <td>1170</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4917</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>3</td>\n",
       "      <td>1159</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4918</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>4</td>\n",
       "      <td>1150</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93361</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>Males</td>\n",
       "      <td>86</td>\n",
       "      <td>103</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93362</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>Males</td>\n",
       "      <td>87</td>\n",
       "      <td>113</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93363</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>Males</td>\n",
       "      <td>88</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93364</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>Males</td>\n",
       "      <td>89</td>\n",
       "      <td>52</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93365</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>Males</td>\n",
       "      <td>90</td>\n",
       "      <td>196</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>4914 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           Geo_Name   Geo_Code  Mid_Year_Ending       Gender  Age  \\\n",
       "4914   Belfast East  N06000001             2019  All persons    0   \n",
       "4915   Belfast East  N06000001             2019  All persons    1   \n",
       "4916   Belfast East  N06000001             2019  All persons    2   \n",
       "4917   Belfast East  N06000001             2019  All persons    3   \n",
       "4918   Belfast East  N06000001             2019  All persons    4   \n",
       "...             ...        ...              ...          ...  ...   \n",
       "93361   West Tyrone  N06000018             2019        Males   86   \n",
       "93362   West Tyrone  N06000018             2019        Males   87   \n",
       "93363   West Tyrone  N06000018             2019        Males   88   \n",
       "93364   West Tyrone  N06000018             2019        Males   89   \n",
       "93365   West Tyrone  N06000018             2019        Males   90   \n",
       "\n",
       "       Population_Estimate  \n",
       "4914                  1141  \n",
       "4915                  1163  \n",
       "4916                  1170  \n",
       "4917                  1159  \n",
       "4918                  1150  \n",
       "...                    ...  \n",
       "93361                  103  \n",
       "93362                  113  \n",
       "93363                   80  \n",
       "93364                   52  \n",
       "93365                  196  \n",
       "\n",
       "[4914 rows x 6 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df['Mid_Year_Ending'] == 2019] ## select * from df where 'mid_year_ending' = 2019"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3787334"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df['Mid_Year_Ending'] == 2019]['Population_Estimate'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, so we're at least in the correct order of magnitude but still very wrong... Any ideas?\n",
    "\n",
    "`pandas` data frames are stored in columns. Each column is assumed to be a particular `type`. \n",
    "\n",
    "When you read in from a CSV file, pandas tries its best to guess what type each column coming in is.\n",
    "\n",
    "For instance, we can see that the 'Mid_Year_Ending','Age',and 'Population_Estimate' all get `int64` type, which means each value is stored in memory as a 64-bit Integer. \n",
    "\n",
    "However, these `object` types indicate that the guessing has just given up, and treated these columns as boring strings, however we know that 'Gender' and 'Geo_Name' are `Categories`, sometimes called 'enums' or enumerables, i.e a small set of valid values, like a dropdown box.\n",
    "\n",
    "These categories can often indicate particular slices and design decisions made by data providers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Geo_Name               object\n",
       "Geo_Code               object\n",
       "Mid_Year_Ending         int64\n",
       "Gender                 object\n",
       "Age                     int64\n",
       "Population_Estimate     int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at Gender, we can see there are three available 'values' of Gender, Males, Females, and All Persons.\n",
    "\n",
    "From context we can guess that 'All Persons' is a double-count of the Males/Females breakdown."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['All persons', 'Females', 'Males'], dtype=object)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['Gender'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make things slightly more light weight (and significantly faster when used correctly...), we can great a new categorical type that's more efficient using `pd.CategoricalDtype`\n",
    "\n",
    "This is useful for situations where there are a very small number of possible (usually string) options, but a very large number of entries. \n",
    "\n",
    "Good examples of this are found in the clothing industry\n",
    "\n",
    "* Size (X-Small, Small, Medium, Large, X-Large)\n",
    "* Color (Red, Black, White)\n",
    "* Style (Short sleeve, long sleeve)\n",
    "* Material (Cotton, Polyester)\n",
    "\n",
    "This categorisation is a form of 'encoding' where, behind the scenes, `pandas` will create a very small lookup table between a set of numbers ${0,1,2}$ for instance, and will use those values in the in-memory dataframe, replacing the very redundant existing labels; {'All persons','Males','Females'}\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 93366 entries, 0 to 93365\n",
      "Data columns (total 1 columns):\n",
      " #   Column  Non-Null Count  Dtype   \n",
      "---  ------  --------------  -----   \n",
      " 0   Gender  93366 non-null  category\n",
      "dtypes: category(1)\n",
      "memory usage: 91.4 KB\n"
     ]
    }
   ],
   "source": [
    "df['Gender'].astype('category').to_frame().info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 93366 entries, 0 to 93365\n",
      "Data columns (total 1 columns):\n",
      " #   Column  Non-Null Count  Dtype \n",
      "---  ------  --------------  ----- \n",
      " 0   Gender  93366 non-null  object\n",
      "dtypes: object(1)\n",
      "memory usage: 729.5+ KB\n"
     ]
    }
   ],
   "source": [
    "df['Gender'].to_frame().info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It would be completly reasonable to ask \"What's the point of this? This is a tiny dataset and size doesn't matter\".\n",
    "\n",
    "However, having a solid understanding of the data you're working with, and being able to embed your assumptions and contexts in to the data itself as 'metadata' is extremely valuable for tripping you up if you do something stupid.\n",
    "\n",
    "For instance, if at some point during your analysis you accidentally swap columns around and instead of querying to the 'minimum age', you try to ask for the 'minimum gender'; as these are non-lexical categories, this makes no sense, and `pandas` will helpfully tell you off for trying to do something so silly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "Categorical is not ordered for operation min\nyou can use .as_ordered() to change the Categorical to an ordered one\n",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-22-9f4b957ed454>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mdf\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'Gender'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'category'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmin\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36mstat_func\u001b[1;34m(self, axis, skipna, level, numeric_only, **kwargs)\u001b[0m\n\u001b[0;32m  11212\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mlevel\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m  11213\u001b[0m             \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_agg_by_level\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mname\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mlevel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mskipna\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m> 11214\u001b[1;33m         return self._reduce(\n\u001b[0m\u001b[0;32m  11215\u001b[0m             \u001b[0mf\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mname\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mskipna\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnumeric_only\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mnumeric_only\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m  11216\u001b[0m         )\n",
      "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\series.py\u001b[0m in \u001b[0;36m_reduce\u001b[1;34m(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)\u001b[0m\n\u001b[0;32m   3870\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   3871\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdelegate\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mCategorical\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 3872\u001b[1;33m             \u001b[1;32mreturn\u001b[0m \u001b[0mdelegate\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_reduce\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mname\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mskipna\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m   3873\u001b[0m         \u001b[1;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdelegate\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mExtensionArray\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   3874\u001b[0m             \u001b[1;31m# dispatch to ExtensionArray interface\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\arrays\\categorical.py\u001b[0m in \u001b[0;36m_reduce\u001b[1;34m(self, name, axis, **kwargs)\u001b[0m\n\u001b[0;32m   2123\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mfunc\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   2124\u001b[0m             \u001b[1;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34mf\"Categorical cannot perform the operation {name}\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2125\u001b[1;33m         \u001b[1;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m   2126\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   2127\u001b[0m     \u001b[1;33m@\u001b[0m\u001b[0mdeprecate_kwarg\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mold_arg_name\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m\"numeric_only\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnew_arg_name\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m\"skipna\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\util\\_decorators.py\u001b[0m in \u001b[0;36mwrapper\u001b[1;34m(*args, **kwargs)\u001b[0m\n\u001b[0;32m    212\u001b[0m                 \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    213\u001b[0m                     \u001b[0mkwargs\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mnew_arg_name\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnew_arg_value\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 214\u001b[1;33m             \u001b[1;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    215\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    216\u001b[0m         \u001b[1;32mreturn\u001b[0m \u001b[0mcast\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mF\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\arrays\\categorical.py\u001b[0m in \u001b[0;36mmin\u001b[1;34m(self, skipna, **kwargs)\u001b[0m\n\u001b[0;32m   2146\u001b[0m         \"\"\"\n\u001b[0;32m   2147\u001b[0m         \u001b[0mnv\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mvalidate_min\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2148\u001b[1;33m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcheck_for_ordered\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"min\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m   2149\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   2150\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_codes\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\arrays\\categorical.py\u001b[0m in \u001b[0;36mcheck_for_ordered\u001b[1;34m(self, op)\u001b[0m\n\u001b[0;32m   1491\u001b[0m         \u001b[1;34m\"\"\" assert that we are ordered \"\"\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   1492\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mordered\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1493\u001b[1;33m             raise TypeError(\n\u001b[0m\u001b[0;32m   1494\u001b[0m                 \u001b[1;34mf\"Categorical is not ordered for operation {op}\\n\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   1495\u001b[0m                 \u001b[1;34m\"you can use .as_ordered() to change the \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mTypeError\u001b[0m: Categorical is not ordered for operation min\nyou can use .as_ordered() to change the Categorical to an ordered one\n"
     ]
    }
   ],
   "source": [
    "df['Gender'].astype('category').min()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas.api.types import CategoricalDtype\n",
    "\n",
    "_gender_type = CategoricalDtype(\n",
    "    categories=df['Gender'].unique(), \n",
    "    ordered=False\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        False\n",
       "1        False\n",
       "2        False\n",
       "3        False\n",
       "4        False\n",
       "         ...  \n",
       "93361    False\n",
       "93362    False\n",
       "93363    False\n",
       "93364    False\n",
       "93365    False\n",
       "Length: 93366, dtype: bool"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(df['Mid_Year_Ending'] == 2019) & (df['Gender'] == 'All persons')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Geo_Name</th>\n",
       "      <th>Geo_Code</th>\n",
       "      <th>Mid_Year_Ending</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Age</th>\n",
       "      <th>Population_Estimate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4914</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>0</td>\n",
       "      <td>1141</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4915</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>1</td>\n",
       "      <td>1163</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4916</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>2</td>\n",
       "      <td>1170</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4917</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>3</td>\n",
       "      <td>1159</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4918</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>4</td>\n",
       "      <td>1150</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93179</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>86</td>\n",
       "      <td>266</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93180</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>87</td>\n",
       "      <td>255</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93181</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>88</td>\n",
       "      <td>208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93182</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>89</td>\n",
       "      <td>158</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93183</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>90</td>\n",
       "      <td>608</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1638 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           Geo_Name   Geo_Code  Mid_Year_Ending       Gender  Age  \\\n",
       "4914   Belfast East  N06000001             2019  All persons    0   \n",
       "4915   Belfast East  N06000001             2019  All persons    1   \n",
       "4916   Belfast East  N06000001             2019  All persons    2   \n",
       "4917   Belfast East  N06000001             2019  All persons    3   \n",
       "4918   Belfast East  N06000001             2019  All persons    4   \n",
       "...             ...        ...              ...          ...  ...   \n",
       "93179   West Tyrone  N06000018             2019  All persons   86   \n",
       "93180   West Tyrone  N06000018             2019  All persons   87   \n",
       "93181   West Tyrone  N06000018             2019  All persons   88   \n",
       "93182   West Tyrone  N06000018             2019  All persons   89   \n",
       "93183   West Tyrone  N06000018             2019  All persons   90   \n",
       "\n",
       "       Population_Estimate  \n",
       "4914                  1141  \n",
       "4915                  1163  \n",
       "4916                  1170  \n",
       "4917                  1159  \n",
       "4918                  1150  \n",
       "...                    ...  \n",
       "93179                  266  \n",
       "93180                  255  \n",
       "93181                  208  \n",
       "93182                  158  \n",
       "93183                  608  \n",
       "\n",
       "[1638 rows x 6 columns]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\n",
    "    (df['Mid_Year_Ending'] == 2019) & \n",
    "    (df['Gender'] == 'All persons')\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1893667"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\n",
    "    (df['Mid_Year_Ending'] == 2019) & \n",
    "    (df['Gender'] == 'All persons')\n",
    "]['Population_Estimate'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](img/great_success.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understanding Data: \n",
    "\n",
    "This population data set has:\n",
    "* Multiple years of data put together with temporal 'duplicates'\n",
    "* Multiple geographic regions put together\n",
    "* Gender counted 'twice'.\n",
    "\n",
    "## Question 1: Binary Computation\n",
    "\n",
    "How does this dataset treat people with non binary gender identity?\n",
    "\n",
    "How can you test that?\n",
    "\n",
    "Try it out for yourself and check Q1.ipynb for an answer.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Finally! Some Statistics!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Geo_Name</th>\n",
       "      <th>Geo_Code</th>\n",
       "      <th>Mid_Year_Ending</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Age</th>\n",
       "      <th>Population_Estimate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4914</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>0</td>\n",
       "      <td>1141</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4915</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>1</td>\n",
       "      <td>1163</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4916</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>2</td>\n",
       "      <td>1170</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4917</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>3</td>\n",
       "      <td>1159</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4918</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>4</td>\n",
       "      <td>1150</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93179</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>86</td>\n",
       "      <td>266</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93180</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>87</td>\n",
       "      <td>255</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93181</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>88</td>\n",
       "      <td>208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93182</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>89</td>\n",
       "      <td>158</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93183</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>90</td>\n",
       "      <td>608</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1638 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           Geo_Name   Geo_Code  Mid_Year_Ending       Gender  Age  \\\n",
       "4914   Belfast East  N06000001             2019  All persons    0   \n",
       "4915   Belfast East  N06000001             2019  All persons    1   \n",
       "4916   Belfast East  N06000001             2019  All persons    2   \n",
       "4917   Belfast East  N06000001             2019  All persons    3   \n",
       "4918   Belfast East  N06000001             2019  All persons    4   \n",
       "...             ...        ...              ...          ...  ...   \n",
       "93179   West Tyrone  N06000018             2019  All persons   86   \n",
       "93180   West Tyrone  N06000018             2019  All persons   87   \n",
       "93181   West Tyrone  N06000018             2019  All persons   88   \n",
       "93182   West Tyrone  N06000018             2019  All persons   89   \n",
       "93183   West Tyrone  N06000018             2019  All persons   90   \n",
       "\n",
       "       Population_Estimate  \n",
       "4914                  1141  \n",
       "4915                  1163  \n",
       "4916                  1170  \n",
       "4917                  1159  \n",
       "4918                  1150  \n",
       "...                    ...  \n",
       "93179                  266  \n",
       "93180                  255  \n",
       "93181                  208  \n",
       "93182                  158  \n",
       "93183                  608  \n",
       "\n",
       "[1638 rows x 6 columns]"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_2019_all = df[\n",
    "    (df['Mid_Year_Ending'] == 2019) & \n",
    "    (df['Gender'] == 'All persons')\n",
    "]\n",
    "df_2019_all"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Part of data maniulation is removing columns that aren't relevant to us; \n",
    "We've created a new slimmed down version of the dataframe with just the data to do with 2019 that disregards gender.\n",
    "\n",
    "But we can see that it still takes up a bit of memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 1638 entries, 4914 to 93183\n",
      "Data columns (total 6 columns):\n",
      " #   Column               Non-Null Count  Dtype \n",
      "---  ------               --------------  ----- \n",
      " 0   Geo_Name             1638 non-null   object\n",
      " 1   Geo_Code             1638 non-null   object\n",
      " 2   Mid_Year_Ending      1638 non-null   int64 \n",
      " 3   Gender               1638 non-null   object\n",
      " 4   Age                  1638 non-null   int64 \n",
      " 5   Population_Estimate  1638 non-null   int64 \n",
      "dtypes: int64(3), object(3)\n",
      "memory usage: 376.4 KB\n"
     ]
    }
   ],
   "source": [
    "df_2019_all.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What about if we forgot about (or 'dropped') the Gender and Mid_Year_Ending columns, because we don't need them, and the 'Geo_Code' one, because we're not going to use it?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "ename": "KeyError",
     "evalue": "\"['Gender' 'Mid_Year_Ending' 'Geo_Code'] not found in axis\"",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-43-ccc9cb5d22a9>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mdf_2019_all\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'Gender'\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;34m'Mid_Year_Ending'\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;34m'Geo_Code'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;32mC:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\core\\frame.py\u001b[0m in \u001b[0;36mdrop\u001b[1;34m(self, labels, axis, index, columns, level, inplace, errors)\u001b[0m\n\u001b[0;32m   3988\u001b[0m                 \u001b[0mweight\u001b[0m  \u001b[1;36m1.0\u001b[0m     \u001b[1;36m0.8\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   3989\u001b[0m         \"\"\"\n\u001b[1;32m-> 3990\u001b[1;33m         return super().drop(\n\u001b[0m\u001b[0;32m   3991\u001b[0m             \u001b[0mlabels\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mlabels\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   3992\u001b[0m             \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36mdrop\u001b[1;34m(self, labels, axis, index, columns, level, inplace, errors)\u001b[0m\n\u001b[0;32m   3934\u001b[0m         \u001b[1;32mfor\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlabels\u001b[0m \u001b[1;32min\u001b[0m \u001b[0maxes\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   3935\u001b[0m             \u001b[1;32mif\u001b[0m \u001b[0mlabels\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 3936\u001b[1;33m                 \u001b[0mobj\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mobj\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_drop_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mlabels\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mlevel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merrors\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0merrors\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m   3937\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   3938\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0minplace\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36m_drop_axis\u001b[1;34m(self, labels, axis, level, errors)\u001b[0m\n\u001b[0;32m   3968\u001b[0m                 \u001b[0mnew_axis\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mlabels\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mlevel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merrors\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0merrors\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   3969\u001b[0m             \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 3970\u001b[1;33m                 \u001b[0mnew_axis\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mlabels\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merrors\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0merrors\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m   3971\u001b[0m             \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreindex\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m**\u001b[0m\u001b[1;33m{\u001b[0m\u001b[0maxis_name\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mnew_axis\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   3972\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\core\\indexes\\base.py\u001b[0m in \u001b[0;36mdrop\u001b[1;34m(self, labels, errors)\u001b[0m\n\u001b[0;32m   5016\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mmask\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0many\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   5017\u001b[0m             \u001b[1;32mif\u001b[0m \u001b[0merrors\u001b[0m \u001b[1;33m!=\u001b[0m \u001b[1;34m\"ignore\"\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 5018\u001b[1;33m                 \u001b[1;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34mf\"{labels[mask]} not found in axis\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m   5019\u001b[0m             \u001b[0mindexer\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mindexer\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;33m~\u001b[0m\u001b[0mmask\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   5020\u001b[0m         \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdelete\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mKeyError\u001b[0m: \"['Gender' 'Mid_Year_Ending' 'Geo_Code'] not found in axis\""
     ]
    }
   ],
   "source": [
    "df_2019_all.drop(['Gender','Mid_Year_Ending','Geo_Code'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whoops! `drop` defaults to thinking about rows, when we want columns; "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Geo_Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>Population_Estimate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4914</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>0</td>\n",
       "      <td>1141</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4915</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>1</td>\n",
       "      <td>1163</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4916</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>2</td>\n",
       "      <td>1170</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4917</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>3</td>\n",
       "      <td>1159</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4918</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>4</td>\n",
       "      <td>1150</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93179</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>86</td>\n",
       "      <td>266</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93180</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>87</td>\n",
       "      <td>255</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93181</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>88</td>\n",
       "      <td>208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93182</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>89</td>\n",
       "      <td>158</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93183</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>90</td>\n",
       "      <td>608</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1638 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           Geo_Name  Age  Population_Estimate\n",
       "4914   Belfast East    0                 1141\n",
       "4915   Belfast East    1                 1163\n",
       "4916   Belfast East    2                 1170\n",
       "4917   Belfast East    3                 1159\n",
       "4918   Belfast East    4                 1150\n",
       "...             ...  ...                  ...\n",
       "93179   West Tyrone   86                  266\n",
       "93180   West Tyrone   87                  255\n",
       "93181   West Tyrone   88                  208\n",
       "93182   West Tyrone   89                  158\n",
       "93183   West Tyrone   90                  608\n",
       "\n",
       "[1638 rows x 3 columns]"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_2019_all.drop(['Gender','Mid_Year_Ending','Geo_Code'], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, this is important; **we have not changed either the original `df` data frame or the `df_2019_all` data frame**, what is displayed above is just the *result* of the `drop` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Geo_Name</th>\n",
       "      <th>Geo_Code</th>\n",
       "      <th>Mid_Year_Ending</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Age</th>\n",
       "      <th>Population_Estimate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4914</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>0</td>\n",
       "      <td>1141</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4915</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>1</td>\n",
       "      <td>1163</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4916</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>2</td>\n",
       "      <td>1170</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4917</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>3</td>\n",
       "      <td>1159</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4918</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>N06000001</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>4</td>\n",
       "      <td>1150</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93179</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>86</td>\n",
       "      <td>266</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93180</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>87</td>\n",
       "      <td>255</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93181</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>88</td>\n",
       "      <td>208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93182</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>89</td>\n",
       "      <td>158</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93183</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>N06000018</td>\n",
       "      <td>2019</td>\n",
       "      <td>All persons</td>\n",
       "      <td>90</td>\n",
       "      <td>608</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1638 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           Geo_Name   Geo_Code  Mid_Year_Ending       Gender  Age  \\\n",
       "4914   Belfast East  N06000001             2019  All persons    0   \n",
       "4915   Belfast East  N06000001             2019  All persons    1   \n",
       "4916   Belfast East  N06000001             2019  All persons    2   \n",
       "4917   Belfast East  N06000001             2019  All persons    3   \n",
       "4918   Belfast East  N06000001             2019  All persons    4   \n",
       "...             ...        ...              ...          ...  ...   \n",
       "93179   West Tyrone  N06000018             2019  All persons   86   \n",
       "93180   West Tyrone  N06000018             2019  All persons   87   \n",
       "93181   West Tyrone  N06000018             2019  All persons   88   \n",
       "93182   West Tyrone  N06000018             2019  All persons   89   \n",
       "93183   West Tyrone  N06000018             2019  All persons   90   \n",
       "\n",
       "       Population_Estimate  \n",
       "4914                  1141  \n",
       "4915                  1163  \n",
       "4916                  1170  \n",
       "4917                  1159  \n",
       "4918                  1150  \n",
       "...                    ...  \n",
       "93179                  266  \n",
       "93180                  255  \n",
       "93181                  208  \n",
       "93182                  158  \n",
       "93183                  608  \n",
       "\n",
       "[1638 rows x 6 columns]"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_2019_all"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The clearest way to do what we mean to do is to simply reassign the dataframe back to 'itself' (If you want to know the innards of this, it's complicated, but [here](https://realpython.com/python-memory-management/) is a good read)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_2019_all = df_2019_all.drop(['Gender','Mid_Year_Ending','Geo_Code'], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Geo_Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>Population_Estimate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4914</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>0</td>\n",
       "      <td>1141</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4915</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>1</td>\n",
       "      <td>1163</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4916</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>2</td>\n",
       "      <td>1170</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4917</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>3</td>\n",
       "      <td>1159</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4918</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>4</td>\n",
       "      <td>1150</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93179</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>86</td>\n",
       "      <td>266</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93180</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>87</td>\n",
       "      <td>255</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93181</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>88</td>\n",
       "      <td>208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93182</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>89</td>\n",
       "      <td>158</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93183</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>90</td>\n",
       "      <td>608</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1638 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           Geo_Name  Age  Population_Estimate\n",
       "4914   Belfast East    0                 1141\n",
       "4915   Belfast East    1                 1163\n",
       "4916   Belfast East    2                 1170\n",
       "4917   Belfast East    3                 1159\n",
       "4918   Belfast East    4                 1150\n",
       "...             ...  ...                  ...\n",
       "93179   West Tyrone   86                  266\n",
       "93180   West Tyrone   87                  255\n",
       "93181   West Tyrone   88                  208\n",
       "93182   West Tyrone   89                  158\n",
       "93183   West Tyrone   90                  608\n",
       "\n",
       "[1638 rows x 3 columns]"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_2019_all"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 1638 entries, 4914 to 93183\n",
      "Data columns (total 3 columns):\n",
      " #   Column               Non-Null Count  Dtype \n",
      "---  ------               --------------  ----- \n",
      " 0   Geo_Name             1638 non-null   object\n",
      " 1   Age                  1638 non-null   int64 \n",
      " 2   Population_Estimate  1638 non-null   int64 \n",
      "dtypes: int64(2), object(1)\n",
      "memory usage: 229.2 KB\n"
     ]
    }
   ],
   "source": [
    "df_2019_all.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obviously this didn't exactly save the world in this case, but it's useful to know about..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Statistics\n",
    "\n",
    "### Population by age\n",
    "`pandas.groupby` allows you to segment the dataset by a particular column (or multiples as we'll see later).\n",
    "\n",
    "So, the way to read the below is:\n",
    "> For each value of Age, give me the sum of the Population_Estimate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Age\n",
       "0     22721\n",
       "1     23408\n",
       "2     24206\n",
       "3     25074\n",
       "4     24960\n",
       "      ...  \n",
       "86     5789\n",
       "87     5005\n",
       "88     4375\n",
       "89     3617\n",
       "90    13734\n",
       "Name: Population_Estimate, Length: 91, dtype: int64"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_2019_all.groupby('Age')['Population_Estimate'].sum() # select sum(population_estimate) from df_2019_all group by Age "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another way to think about this is by taking each group in turn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Geo_Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>Population_Estimate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4914</th>\n",
       "      <td>Belfast East</td>\n",
       "      <td>0</td>\n",
       "      <td>1141</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10101</th>\n",
       "      <td>Belfast North</td>\n",
       "      <td>0</td>\n",
       "      <td>1399</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15288</th>\n",
       "      <td>Belfast South</td>\n",
       "      <td>0</td>\n",
       "      <td>1246</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20475</th>\n",
       "      <td>Belfast West</td>\n",
       "      <td>0</td>\n",
       "      <td>1260</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25662</th>\n",
       "      <td>East Antrim</td>\n",
       "      <td>0</td>\n",
       "      <td>857</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30849</th>\n",
       "      <td>East Londonderry</td>\n",
       "      <td>0</td>\n",
       "      <td>1097</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36036</th>\n",
       "      <td>Fermanagh and South Tyrone</td>\n",
       "      <td>0</td>\n",
       "      <td>1425</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41223</th>\n",
       "      <td>Foyle</td>\n",
       "      <td>0</td>\n",
       "      <td>1303</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46410</th>\n",
       "      <td>Lagan Valley</td>\n",
       "      <td>0</td>\n",
       "      <td>1320</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>51597</th>\n",
       "      <td>Mid Ulster</td>\n",
       "      <td>0</td>\n",
       "      <td>1412</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56784</th>\n",
       "      <td>Newry and Armagh</td>\n",
       "      <td>0</td>\n",
       "      <td>1678</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61971</th>\n",
       "      <td>North Antrim</td>\n",
       "      <td>0</td>\n",
       "      <td>1268</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>67158</th>\n",
       "      <td>North Down</td>\n",
       "      <td>0</td>\n",
       "      <td>865</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72345</th>\n",
       "      <td>South Antrim</td>\n",
       "      <td>0</td>\n",
       "      <td>1208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>77532</th>\n",
       "      <td>South Down</td>\n",
       "      <td>0</td>\n",
       "      <td>1458</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>82719</th>\n",
       "      <td>Strangford</td>\n",
       "      <td>0</td>\n",
       "      <td>936</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87906</th>\n",
       "      <td>Upper Bann</td>\n",
       "      <td>0</td>\n",
       "      <td>1693</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93093</th>\n",
       "      <td>West Tyrone</td>\n",
       "      <td>0</td>\n",
       "      <td>1155</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                         Geo_Name  Age  Population_Estimate\n",
       "4914                 Belfast East    0                 1141\n",
       "10101               Belfast North    0                 1399\n",
       "15288               Belfast South    0                 1246\n",
       "20475                Belfast West    0                 1260\n",
       "25662                 East Antrim    0                  857\n",
       "30849            East Londonderry    0                 1097\n",
       "36036  Fermanagh and South Tyrone    0                 1425\n",
       "41223                       Foyle    0                 1303\n",
       "46410                Lagan Valley    0                 1320\n",
       "51597                  Mid Ulster    0                 1412\n",
       "56784            Newry and Armagh    0                 1678\n",
       "61971                North Antrim    0                 1268\n",
       "67158                  North Down    0                  865\n",
       "72345                South Antrim    0                 1208\n",
       "77532                  South Down    0                 1458\n",
       "82719                  Strangford    0                  936\n",
       "87906                  Upper Bann    0                 1693\n",
       "93093                 West Tyrone    0                 1155"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "for age_value, group in df_2019_all.groupby('Age'):\n",
    "    break # Interrupts the flow of the loop\n",
    "group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "22721"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "group['Population_Estimate'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is the most populated age group?\n",
    "\n",
    "This starts off as an easy one, but depends on your definition of 'age group'....\n",
    "\n",
    "_Now we start to see that Statistics can be used by unscrupulous people to change a narrative to suit themselves... be on the watch for politicians carrying statistics..._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Age\n",
       "0     22721\n",
       "1     23408\n",
       "2     24206\n",
       "3     25074\n",
       "4     24960\n",
       "      ...  \n",
       "86     5789\n",
       "87     5005\n",
       "88     4375\n",
       "89     3617\n",
       "90    13734\n",
       "Name: Population_Estimate, Length: 91, dtype: int64"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s_age_pop = df_2019_all.groupby('Age')['Population_Estimate'].sum()\n",
    "s_age_pop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Age\n",
       "89     3617\n",
       "88     4375\n",
       "87     5005\n",
       "86     5789\n",
       "85     6219\n",
       "      ...  \n",
       "7     26274\n",
       "52    26301\n",
       "53    26344\n",
       "51    26499\n",
       "54    26776\n",
       "Name: Population_Estimate, Length: 91, dtype: int64"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s_age_pop.sort_values()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Easy peasy, 54 is the most common age, and it *looks like* the 50's are well represented in this data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4914              NaN\n",
       "4915      (0.0, 18.0]\n",
       "4916      (0.0, 18.0]\n",
       "4917      (0.0, 18.0]\n",
       "4918      (0.0, 18.0]\n",
       "             ...     \n",
       "93179    (50.0, 90.0]\n",
       "93180    (50.0, 90.0]\n",
       "93181    (50.0, 90.0]\n",
       "93182    (50.0, 90.0]\n",
       "93183    (50.0, 90.0]\n",
       "Name: Age, Length: 1638, dtype: category\n",
       "Categories (4, interval[int64]): [(0, 18] < (18, 30] < (30, 50] < (50, 90]]"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.cut(df_2019_all.Age, [0,18,30,50,90])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Age\n",
       "(0, 10]     252321\n",
       "(10, 20]    233470\n",
       "(20, 30]    239596\n",
       "(30, 40]    250481\n",
       "(40, 50]    243976\n",
       "(50, 60]    253225\n",
       "(60, 70]    190199\n",
       "(70, 80]    136211\n",
       "(80, 90]     71467\n",
       "Name: Population_Estimate, dtype: int64"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s_age_grp_pop = df_2019_all.groupby(\n",
    "    pd.cut(df_2019_all.Age, [0,10,20,30,40,50,60,70,80,90])\n",
    ")['Population_Estimate'].sum()\n",
    "s_age_grp_pop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Age\n",
       "(80, 90]     71467\n",
       "(70, 80]    136211\n",
       "(60, 70]    190199\n",
       "(10, 20]    233470\n",
       "(20, 30]    239596\n",
       "(40, 50]    243976\n",
       "(30, 40]    250481\n",
       "(0, 10]     252321\n",
       "(50, 60]    253225\n",
       "Name: Population_Estimate, dtype: int64"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s_age_grp_pop.sort_values()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So it's a real toss up between 50/60's and the children!\n",
    "\n",
    "But what if we modified the age groups to something like [Marketing people care about](https://marketingartfully.com/customer-demographics-age-demographics-for-advertising/)? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Age\n",
       "(0, 12]     303398\n",
       "(12, 17]    114586\n",
       "(17, 25]    183719\n",
       "(25, 35]    250458\n",
       "(35, 45]    239822\n",
       "(45, 55]    260036\n",
       "(55, 65]    223084\n",
       "(65, 90]    295843\n",
       "Name: Population_Estimate, dtype: int64"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s_age_grp_pop = df_2019_all.groupby(\n",
    "    pd.cut(df_2019_all.Age, [0,12,17,25,35,45,55,65,90])\n",
    ")['Population_Estimate'].sum()\n",
    "s_age_grp_pop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What about if we wanted to estimate the 'average age'?\n",
    "\n",
    "We do it with 'normal' numbers by simply summing all the values and dividing by the number of values, but we're not dealing with individual values here, we're dealing with 'bucketised' values. And we can do that too!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "73899465"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This is the total number of years lived by everyone in northern ireland\n",
    "# It's approximately the same time ago as when the meteorite killed the dinosaurs\n",
    "(df_2019_all.Age*df_2019_all.Population_Estimate).sum() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "39.02453018402919"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(df_2019_all.Age*df_2019_all.Population_Estimate).sum() / df_2019_all.Population_Estimate.sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Oh! Well that's different; Depending on how you say things, the following statistics are all true; \n",
    "1. The average age in Northern Ireland is 40\n",
    "2. The more of the population are 54 than any other age\n",
    "3. Under 12's are the most populous age group\n",
    "\n",
    "Is there anything else we can say that similarly contradictory?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning outcomes\n",
    "\n",
    "So far we've worked out:\n",
    "* How to read CSV files into `pandas` dataframes\n",
    "* How to validate data against expectations\n",
    "* How to select and filter data in these dataframes based on columns and their values\n",
    "* How to make basic aggregations across data\n",
    "* How to make aggregations across groups of data\n",
    "\n",
    "Now we're gonna make them pretty... in the next worksheet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_csv('ihatemyself.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}