{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"float:left\">\n",
    "    <h1 style=\"width:450px\">Practical 8: Dimensions in Data</h1>\n",
    "    <h2 style=\"width:450px\">Transformation &amp; Dimensionality Reduction</h2>\n",
    "</div>\n",
    "<div style=\"float:right\"><img width=\"100\" src=\"https://github.com/jreades/i2p/raw/master/img/casa_logo.jpg\" /></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| Complete | Part 1: Foundations | Part 2: Data | Part 3: Analysis |     |\n",
    "| :------- | :------------------ | :----------- | :--------------- | --: |\n",
    "| 70% | &#9619;&#9619;&#9619;&#9619;&#9619;&#9619;&#9619;&#9619; | &#9619;&#9619;&#9619;&#9619;&#9619;&#9619; | &#9619;&#9617;&#9617;&#9617;&#9617;&#9617; | 8/10\n",
    "\n",
    "In this session the focus is on MSOA-level Census data from 2011. We're going to explore this as a *possible* complement to the InsideAirbnb data. Although it's not ideal to use 2011 data with scraped from Airbnb this year, we: \n",
    "\n",
    "1. Have little choice as the 2021 data is only just starting to come through from the Census and the London Data Store hasn't been updated (still!); and \n",
    "2. Could usefully do a bit of thinking about whether the situation in 2011 might in some way help us to 'predict' the situation now...\n",
    "\n",
    "Ultimately, however, you don't *need* to use this for your analysis, this practical is intended as a demonstration of how transformation and dimensionality reduction work in practice and the kinds of issues that come up."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>&#128279; Connections</b>: There are a <em>lot</em> of links across sessions now, as well as some <em>forward links</em> to stuff we've not yet covered (see: <tt>pandas.merge</tt>). We'll pick these up as we move through the notebook.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preamble\n",
    "\n",
    "Let's start with the usual bits of code to ensure plotting works, to import packages and load the data into memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import re\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import geopandas as gpd\n",
    "import seaborn as sns\n",
    "\n",
    "import matplotlib.cm as cm\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.preprocessing import RobustScaler\n",
    "from sklearn.preprocessing import PowerTransformer\n",
    "\n",
    "import umap\n",
    "from kneed import knee_locator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from requests import get\n",
    "from urllib.parse import urlparse\n",
    "\n",
    "def cache_data(src:str, dest:str) -> str:\n",
    "    \"\"\"Downloads and caches a remote file locally.\n",
    "    \n",
    "    The function sits between the 'read' step of a pandas or geopandas\n",
    "    data frame and downloading the file from a remote location. The idea\n",
    "    is that it will save it locally so that you don't need to remember to\n",
    "    do so yourself. Subsequent re-reads of the file will return instantly\n",
    "    rather than downloading the entire file for a second or n-th itme.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    src : str\n",
    "        The remote *source* for the file, any valid URL should work.\n",
    "    dest : str\n",
    "        The *destination* location to save the downloaded file.\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    str\n",
    "        A string representing the local location of the file.\n",
    "    \"\"\"\n",
    "    \n",
    "    url = urlparse(src) # We assume that this is some kind of valid URL \n",
    "    fn  = os.path.split(url.path)[-1] # Extract the filename\n",
    "    dfn = os.path.join(dest,fn) # Destination filename\n",
    "    \n",
    "    # Check if dest+filename does *not* exist -- \n",
    "    # that would mean we have to download it!\n",
    "    if not os.path.isfile(dfn):\n",
    "        \n",
    "        print(f\"{dfn} not found, downloading!\")\n",
    "\n",
    "        # Convert the path back into a list (without)\n",
    "        # the filename -- we need to check that directories\n",
    "        # exist first.\n",
    "        path = os.path.split(dest)\n",
    "        \n",
    "        # Create any missing directories in dest(ination) path\n",
    "        # -- os.path.join is the reverse of split (as you saw above)\n",
    "        # but it doesn't work with lists... so I had to google how\n",
    "        # to use the 'splat' operator! os.makedirs creates missing\n",
    "        # directories in a path automatically.\n",
    "        if len(path) >= 1 and path[0] != '':\n",
    "            os.makedirs(os.path.join(*path), exist_ok=True)\n",
    "            \n",
    "        # Download and write the file\n",
    "        with open(dfn, \"wb\") as file:\n",
    "            response = get(src)\n",
    "            file.write(response.content)\n",
    "            \n",
    "        print('Done downloading...')\n",
    "\n",
    "    else:\n",
    "        print(f\"Found {dfn} locally!\")\n",
    "\n",
    "    return dfn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 1. Loading MSOA Census Data\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>&#128279; Connections</b>: By this point you should be fairly familiar with the UK's census geographies as you'll have encountered them in your GIS module. But in case you need a refresher, here's what the <a href=\"https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeographies/census2021geographies\">Office for National Statistics</a> says.\n",
    "</div>\n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>&#128161; Tip:</b> We're going to mix in the London's MSOA 'Atlas' from the <a href=\"https://data.london.gov.uk/dataset\">London Data Store</a>. I would <em>strongly</em> suggest that you have a look around the London Data Store as you develop your thinking for the group assessment -- you will likely find useful additional data there!\n",
    "</div>\n",
    "\n",
    "Once you see how we deal with this MSOA Atlas data you will be in a position to work with any other similarly complex (in terms of the headings and indexes) data set. If you're feeling particularly ambitious you can actually do this _same_ work at the LSOA scale using the [LSOA Atlas](https://data.london.gov.uk/dataset/lsoa-atlas) and LSOA boundaries... the process should be the same, though you will have smaller samples in each LSOA than you do in the MSOAs and calculations will take a bit longer to complete. You could also search on the ONS Census web site for data from 2021.\n",
    "\n",
    "There is a CSV file for the MSOA Atlas that would be easier to work with; however, the Excel file is useful for demonstrating how to work with multi-level indexes (an extension of last week's work). Notice that below we do two new things when reading the XLS file:\n",
    "\n",
    "1. We have to specify a sheet name because the file contains multiple sheets.\n",
    "2. We have to specify not just _one_ header, we actually have to specify three of them which generates a multi-level index (row 0 is the top-level, row 1 is the second-level, etc.)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 1.1: Load MSOA Excel File\n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>Difficulty level</b>: Low.\n",
    "</div>\n",
    "\n",
    "You might like to load the cached copy of the file into Excel so that you can see how the next bit works. You can find the rest of the MSOA Atlas [here](https://data.london.gov.uk/dataset/msoa-atlas)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "src_url   = 'https://data.london.gov.uk/download/msoa-atlas/39fdd8eb-e977-4d32-85a4-f65b92f29dcb/msoa-data.xls'\n",
    "dest_path = os.path.join('data','msoa')\n",
    "\n",
    "excel_atlas = pd.read_excel(\n",
    "    cache_data(src_url, dest_path), \n",
    "    ???, # Which sheet is the data in?\n",
    "    header=[0,1,2])            # Where are the column names... there's three of them!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice the format of the output and notice that all of the empty cells in the Excel sheet have come through as `Unnamed: <col_no>_level_<level_no>`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "excel_atlas.head(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Shape of the MSOA Atlas data frame is: {excel_atlas.shape[0]:,} x {excel_atlas.shape[1]:,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should get: `Shape of the MSOA Atlas data frame is: 984 x 207`, but how on earth are you going to access the data?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Task 1.2: Accessing MultiIndexes\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Difficulty level</b>: Moderate, though the difficulty is conceptual, not technical.\n",
    "</div>\n",
    "\n",
    "Until now we have understood the pandas index as a single column-like 'thing' in a data frame, but pandas also supports hierarchical and grouped indexes that allow us to interact with data in more complex ways... should we need it. Generally:\n",
    "\n",
    "- MultiIndex == hierarchical index on *columns*\n",
    "- DataFrameGroupBy == iterable pseudo-hierarchical index on *rows*\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>&#128279; Connections</b>: We'll be looking at <b>Grouping Data</b> in much more detail in <a href=\"https://jreades.github.io/fsds/sessions/week9.html#lectures\">next week</a>, so the main thing to remember is that grouping is for rows, multi-indexing is about columns.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2.1 Direct Access\n",
    "\n",
    "Of course, one way to get at the data is to use `.iloc[...]` since that refers to columns by *position* and ignores the complexity of the index. Try printing out the the first five rows of the first column using `iloc`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "excel_atlas.iloc[???]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should get:\n",
    "\n",
    "```\n",
    "0    E02000001\n",
    "1    E02000002\n",
    "2    E02000003\n",
    "3    E02000004\n",
    "4    E02000005\n",
    "Name: (Unnamed: 0_level_0, Unnamed: 0_level_1, MSOA Code), dtype: object\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2.2 Named Access\n",
    "\n",
    "But to do it by name is a little trickier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "excel_atlas.columns.tolist()[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how asking for the first five columns has given us a list of... what exactly?\n",
    "\n",
    "So to get the **same output** by column *name* what do you need to copy from above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "excel_atlas.loc[0:5, ???]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's *really* awkward, so we're going to look for a better way..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2.3 Grouped Access\n",
    "\n",
    "Despite this, *one* way that MultiIndexes can be useful is for accessing column-slices from a wide dataframe. We can, for instance, select all of the Age Structure columns in one go and it will be *simpler* than what we did above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "excel_atlas.loc[0:5, ('Age Structure (2011 Census)')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2.4 Understanding Levels\n",
    "\n",
    "This works because the MultiIndex tracks the columns using *levels*, with level 0 at the 'top' and level 2 (in our case) at the bottom. These are the unique *values* for the top level:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "excel_atlas.columns.levels[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are the *values* for those levels across the actual columns in the data frame, notice the repeated 'Age Structure (2011 Census)':"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "excel_atlas.columns.get_level_values(0)[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "excel_atlas.columns.get_level_values(1)[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By extension, if we drop a level 0 index then *all* of the columns that it supports at levels 1 and 2 are _also_ dropped: so when we drop `Mid-year Estimate totals` from level 0 then all 11 of the 'Mid-year Estimate totals (2002...2012)' columns are dropped in one go."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "excel_atlas[['Mid-year Estimate totals']].head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test = excel_atlas.drop(columns=['Mid-year Estimate totals'], axis=1, level=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Excel source had {excel_atlas.shape[1]} columns.\")\n",
    "print(f\"Test now has {test.shape[1]} columns.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Tidy up\n",
    "if 'test' in locals():\n",
    "    del(test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2.5 Questions\n",
    "\n",
    "- What data type is used for storing/accessing MultiIndexes? \n",
    "- *Why* is this is the appropriate data type?\n",
    "- How (conceptually) are the header rows in Excel are mapped on to levels in pandas?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 1.3: Tidying Up\n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>Difficulty level</b>: Low, though there's a lot of dealing with column names.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3.1 Dropping Named Levels\n",
    "\n",
    "There's a *lot* of data in the data frame that we don't need for our Airbnb work, so let's go a bit further with the dropping of column-groups using the MultiIndex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_drop = ['Mid-year Estimate totals','Mid-year Estimates 2012, by age','Religion (2011)',\n",
    "           'Land Area','Lone Parents (2011 Census)','Central Heating (2011 Census)','Health (2011 Census)',\n",
    "           'Low Birth Weight Births (2007-2011)','Obesity','Incidence of Cancer','Life Expectancy',\n",
    "           'Road Casualties']\n",
    "tidy = excel_atlas.drop(to_drop, axis=1, level=0)\n",
    "print(f\"Shape of the MSOA Atlas data frame is now: {tidy.shape[0]} x {tidy.shape[1]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This should drop you down to `984 x 111`. Notice below that the multi-level _index_ has not changed but the multi-level _values_ remaining have!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"There are {len(tidy.columns.levels[0].unique())} categories.\") # The categories\n",
    "print(f\"But only {len(tidy.columns.get_level_values(0).unique())} values.\") # The actual values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3.2 Selecting Columns using a List Comprehension\n",
    "\n",
    "Now we need to drop all of the percentages from the data set. These can be found at level 1, though they are specified in a number of different ways so you'll need to come up with a way to find them in the level 1 values using a list comprehension... \n",
    "\n",
    "I'd suggest looking for: \"(%)\", \"%\", and \"Percentages\". You may need to check both start and end. You could also use a regular expression here instead of multiple tests. Either *works*, but have a think about the tradeoffs between intelligibility, speed, and what *you* understand...\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>&#128279; Connections</b>: See how regular expressions keep coming <a href=\"https://jreades.github.io/fsds/sessions/week7.html#lectures\">baaaaaaaaack</a>? That said, you can also use simple string functions like <tt>startswith</tt> and <tt>endswith</tt> for this problem.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_drop = [x for x in tidy.columns.get_level_values(1) if (???)]\n",
    "print(to_drop)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print([x for x in tidy.columns.get_level_values(1) if re.search(???, x)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should get:\n",
    "```\n",
    "['Percentages', 'Percentages', 'Percentages', 'Percentages', 'Percentages', 'White (%)', 'Mixed/multiple ethnic groups (%)', 'Asian/Asian British (%)', 'Black/African/Caribbean/Black British (%)', 'Other ethnic group (%)', 'BAME (%)', 'United Kingdom (%)', 'Not United Kingdom (%)', '% of people aged 16 and over in household have English as a main language', '% of households where no people in household have English as a main language', 'Owned: Owned outright (%)', 'Owned: Owned with a mortgage or loan (%)', 'Social rented (%)', 'Private rented (%)', 'Household spaces with at least one usual resident (%)', 'Household spaces with no usual residents (%)', 'Whole house or bungalow: Detached (%)', 'Whole house or bungalow: Semi-detached (%)', 'Whole house or bungalow: Terraced (including end-terrace) (%)', 'Flat, maisonette or apartment (%)', 'Economically active %', 'Economically inactive %', '% of households with no adults in employment: With dependent children', '% living in income deprived households reliant on means tested benefit', '% of people aged over 60 who live in pension credit households', 'No cars or vans in household (%)', '1 car or van in household (%)', '2 cars or vans in household (%)', '3 cars or vans in household (%)', '4 or more cars or vans in household (%)']\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3.3 Drop by Level\n",
    "\n",
    "You now need to drop these columns using the `level` keyword as part of your drop command. You have plenty of examples of how to drop values in place, but I'd suggest _first_ getting the command correct (maybe duplicate the cell below and change the code so that the result is saved to a dataframe called `test` before overwriting `tidy`?) and _then_ saving the change."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tidy = tidy.drop(to_drop, axis=1, level=???)\n",
    "print(f\"Shape of the MSOA Atlas data frame is now: {tidy.shape[0]} x {tidy.shape[1]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data frame is now `984 x 76`. This is a _bit_ more manageable though still a _lot_ of data columns. Depending on what you decide to do for your final project you might want to revisit some of the columns that we dropped above... "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3.4 Flattening the Index\n",
    "\n",
    "Although this ia big improvement, you'll have trouble saving or linking this data to other inputs. The problem is that Level 2 of the multi-index is mainly composed of 'Unnamed' values and so we need to merge it with Level 1 to simplify our data frame, and then merge _that_ with level 0..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tidy.columns.values[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use code to sort this out!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_cols = []\n",
    "for c in tidy.columns.values:\n",
    "    \n",
    "    #print(f\"Column label: {c}\")\n",
    "    l1 = f\"{c[0]}\"\n",
    "    l2 = f\"{c[1]}\"\n",
    "    l3 = f\"{c[2]}\"\n",
    "    \n",
    "    # The new column label\n",
    "    clabel = ''\n",
    "    \n",
    "    # Assemble new label from the levels\n",
    "    if not l1.startswith(\"Unnamed\"):\n",
    "        l1 = l1.replace(\" (2011 Census)\",'').replace(\" (2011)\",'').replace(\"Household \",'').replace(\"House Prices\",'').replace(\"Car or van availability\",'Vehicles').replace(' (2011/12)','')\n",
    "        l1 = l1.replace('Age Structure','Age').replace(\"Ethnic Group\",'').replace('Dwelling type','').replace('Income Estimates','')\n",
    "        clabel += l1\n",
    "    if not l2.startswith(\"Unnamed\"):\n",
    "        l2 = l2.replace(\"Numbers\",'').replace(\" House Price (£)\",'').replace(\"Highest level of qualification: \",'').replace(\"Annual Household Income (£)\",'hh Income').replace('Whole house or bungalow: ','').replace(' qualifications','')\n",
    "        l2 = l2.replace('At least one person aged 16 and over in household has English as a main language',\"1+ English as a main language\").replace(\"No people in household have English as a main language\",\"None have English as main language\")\n",
    "        clabel += ('-' if clabel != '' else '') + l2\n",
    "    if not l3.startswith(\"Unnamed\"):\n",
    "        clabel += ('-' if clabel != '' else '') + l3\n",
    "    \n",
    "    # Replace other commonly-occuring verbiage that inflates column name width\n",
    "    clabel = clabel.replace('--','-').replace(\" household\",' hh').replace('Owned: ','')\n",
    "    \n",
    "    #clabel = clabel.replace(' (2011 Census)','').replace(' (2011)','').replace('Sales - 2011.1','Sales - 2012')\n",
    "    #clabel = clabel.replace('Numbers - ','').replace(' (£)','').replace('Car or van availability','Vehicles')\n",
    "    #clabel = clabel.replace('Household Income Estimates (2011/12) - ','').replace('Age Structure','Age')\n",
    "    \n",
    "    new_cols.append(clabel)\n",
    "\n",
    "print(new_cols)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-danger\">\n",
    "    <b>&#9888; Stop</b>: Make sure you understand what is happening here before just moving on to the next thing. Try adding <tt>print()</tt> statements if it will help it to make sense. This sort of code comes up a <em>lot</em> in the real world.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tidy.columns = new_cols  # <- Blow away complex index, replace with simple\n",
    "tidy.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You might want to have a look at _what_ the code below drops first before just running it... remember that you can pull apart any complex code into pieces:\n",
    "\n",
    "```python\n",
    "tidy['MSOA Code'].isna()\n",
    "tidy[tidy['MSOA Code'].isna()].index\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tidy.drop(index=tidy[tidy['MSOA Code'].isna()].index, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 1.4: Add Inner/Outer London Mapping\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Difficulty level</b>: Moderate, since I'm not giving you many clues.\n",
    "</div>\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>&#128279; Connections</b>: We touched on <tt>lambda</tt> functions last week; it's a 'trivial' function that we don't even want to bother defining with <tt>def</tt>. We also used the lambda function in the context of <tt>apply</tt> so this is just another chance to remind yourself how this works. This is quite advanced Python, so don't panic if you don't get it right away and have to do some Googling... \n",
    "</div>\n",
    "\n",
    "We want to add the borough name and a 'subregion' name. We already have the borough name buried in a *separate* column, so step 1 is to extract that from the MSOA Name. Step 2 is to use the borough name as a lookup to the subregion name using a **lambda** function. The format for a lambda function is usually `lambda x: <code that does something with x and returns a value>`. Hint: you've got a dictionary and you know how to use it!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.4.1 Add Boroughs\n",
    "\n",
    "We first need to extract the borough names from one of the existing fields in the data frame... a *regex* that does *replacement* would be fastest and easiest: focus on what you *don't* need from the MSOA Name **string** and **replacing** that using a **regex**..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tidy['Borough'] = tidy['MSOA Name'].???\n",
    "tidy.Borough.unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should get:\n",
    "```\n",
    "array(['City of London', 'Barking and Dagenham', 'Barnet', 'Bexley',\n",
    "       'Brent', 'Bromley', 'Camden', 'Croydon', 'Ealing', 'Enfield',\n",
    "       'Greenwich', 'Hackney', 'Hammersmith and Fulham', 'Haringey',\n",
    "       'Harrow', 'Havering', 'Hillingdon', 'Hounslow', 'Islington',\n",
    "       'Kensington and Chelsea', 'Kingston upon Thames', 'Lambeth',\n",
    "       'Lewisham', 'Merton', 'Newham', 'Redbridge',\n",
    "       'Richmond upon Thames', 'Southwark', 'Sutton', 'Tower Hamlets',\n",
    "       'Waltham Forest', 'Wandsworth', 'Westminster'], dtype=object)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.4.2 Map Boroughs to Subregions\n",
    "\n",
    "And now you need to understand how to *apply* the mapping ot the Borough field using a `lambda` function. It's fairly straightforward once you know the syntax: just a dictionary lookup. But as usual, you might want to first create a new cell and experiment with the output from the apply function before using it to write the `Subregion` field of the data frame..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mapping = {}\n",
    "for b in ['Enfield','Waltham Forest','Redbridge','Barking and Dagenham','Havering','Greenwich','Bexley']:\n",
    "    mapping[b]='Outer East and North East'\n",
    "for b in ['Haringey','Islington','Hackney','Tower Hamlets','Newham','Lambeth','Southwark','Lewisham']:\n",
    "    mapping[b]='Inner East'\n",
    "for b in ['Bromley','Croydon','Sutton','Merton','Kingston upon Thames']:\n",
    "    mapping[b]='Outer South'\n",
    "for b in ['Wandsworth','Kensington and Chelsea','Hammersmith and Fulham','Westminster','Camden','City of London']:\n",
    "    mapping[b]='Inner West'\n",
    "for b in ['Richmond upon Thames','Hounslow','Ealing','Hillingdon','Brent','Harrow','Barnet']:\n",
    "    mapping[b]='Outer West and North West'\n",
    "\n",
    "tidy['Subregion'] = tidy.Borough.apply(???)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.4.3 And Save\n",
    "\n",
    "There's a little snipped of useful code to work out here: we need to check if the `clean` directory exists in the `data` directory; if we don't then the `tidy.to_parquet()` call will fail."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "if not os.path.???(os.path.join('data','clean')):\n",
    "    os.makedirs(os.path.join('data','clean'))\n",
    "tidy.to_parquet(os.path.join('data','clean','MSOA_Atlas.parquet'))\n",
    "print(\"Done.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.4.4 Questions\n",
    "\n",
    "- What are the advantages to `apply` and `lambda` functions over looping and named functions? \n",
    "- When might you choose a named function over a lambda function?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 1.5: Merge Data & Geography\n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>Difficulty level</b>: Low, except for plotting.\n",
    "</div>\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>&#128279; Connections</b>: We'll cover joins (of which a <tt>merge</tt> is just one type) in the <a href=\"https://jreades.github.io/fsds/sessions/week10.html#lectures\">final week's lectures</a>, but between what you'd done in GIS and what we have here there should be enough here for you to start being able to make sense of how they work so that you don't have to wait until Week 10 to think about how this could help you with your Group Assessment.\n",
    "</div>\n",
    "\n",
    "First, we need to download the MSOA source file, which is a zipped archive of a Shapefile:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Oh look, we can read a Shapefile without needing to unzip it!\n",
    "msoas = gpd.read_file(\n",
    "    cache_data('https://github.com/jreades/fsds/blob/master/data/src/Middle_Layer_Super_Output_Areas__December_2011__EW_BGC_V2-shp.zip?raw=true', \n",
    "               os.path.join('data','geo')), driver='ESRI Shapefile')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.5.1 Identifying Matching Columns\n",
    "\n",
    "Looking at the first few columns of each data frame, which one might allow us to link the two files together? You've done this in GIS. *Remember*: the column *names* don't need to match for us to use them in a join, it's the *values* that matter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Column names: {', '.join(tidy.columns.tolist()[:5])}\")\n",
    "tidy.iloc[:3,:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Column names: {', '.join(msoas.columns.tolist()[:5])}\")\n",
    "msoas.iloc[:3,:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.5.2 Merge\n",
    "\n",
    "One more thing: if you've got more than one choice I'd *always* go with a code over a name because one is intended for matching and other is not..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf = pd.merge(msoas, tidy, left_on=???, right_on=???, how='inner')\n",
    "gdf = gdf.drop(columns=['MSOA11CD','MSOA11NM','OBJECTID'])\n",
    "\n",
    "print(f\"Final MSOA Atlas data frame has shape {gdf.shape[0]:,} x {gdf.shape[1]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should get `Final data frame has shape 983 x 86`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.5.3 Plot Choropleth\n",
    "\n",
    "Let's plot the median income in 2011 column using the `plasma` colour ramp... The rest is to show you how to customise a legend."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col = 'Median-2011'\n",
    "fig = gdf.plot(column=???, cmap='???', \n",
    "         scheme='FisherJenks', k=7, edgecolor='None', \n",
    "         legend=True, legend_kwds={'frameon':False, 'fontsize':8},\n",
    "         figsize=(9,8));\n",
    "plt.title(col.replace('-',' '));\n",
    "\n",
    "# Now to modify the legend: googling \"geopandas format legend\"\n",
    "# brings me to: https://stackoverflow.com/a/56591102/4041902\n",
    "leg = fig.get_legend()\n",
    "leg._loc = 3\n",
    "\n",
    "for lbl in leg.get_texts():\n",
    "    label_text = lbl.get_text()\n",
    "    [low, hi] = label_text.split(', ')\n",
    "    new_text = f'£{float(low):,.0f} - £{float(hi):,.0f}'\n",
    "    lbl.set_text(new_text)\n",
    "\n",
    "plt.show();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.5.4 Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf.to_geoparquet(os.path.join('data','geo','MSOA_Atlas.geoparquet'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.5.5 Questions\n",
    "\n",
    "- Try changing the colour scheme, classification scheme, and number of classes to see if you feel there's a *better* opeion than the one shown above... Copy the cell (click on anywhere outside the code and then hit `C` to copy. Then click on this cell *once*, and hit `V` to paste."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 2. Splitting into Test & Train\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>&#128279; Connections</b>: Here you will be using a standard approach in Machine Learningknown as <em>test/train  split</em>, although we <em>won't</em> be doing any actual ML (&#128517;). Here we are just going to use it to explore some issues raised by normalisation and standardisation in <a href=\"https://jreades.github.io/fsds/sessions/week8.html#lectures\">this week's lectures</a>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A standard approach to Machine Learning, and something that is becoming more widely used elsewhere, is the splitting of a large data into set into testing and training components. Typically, you would take 80-90% of your data to 'train' your algorithm and withold between 10-20% for validation ('testing'). An even 'stricter' approach, in the sense of trying to ensure the robustness of your model against outlier effects, is [cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) such as [k-folds cross-validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html).\n",
    "\n",
    "Sci-Kit Learn is probably *the* most important reason Python has become the *de fact* language of data science. Test/train-split is used when to avoid over-fitting when we are trying to **predict** something; so here Sci-Kit Learn _expects_ that you'll have an `X` which is your **predictors** (the inputs to your model) and a `y` which is the thing you're **trying to predict** (because: $y = \\beta X + \\epsilon$). \n",
    "\n",
    "We're not building a model here (that's for Term 2!) so we'll just 'pretend' that we're trying to predict the price of a listing and will set that up as our `y` data set *so that* we can see how the choice of normalisation/standardisation technique affects the robustness of the model against 'new' data. Notice too that you can pass a data frame directly to Sci-Kit Learn and it will split it for you."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 2.1: Reload\n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>&#128161; Tip</b>: In future 'runs' of this notebook you can now just pick up here and skip all of Task 1.\n",
    "</div>\n",
    "\n",
    "On subsequent runs of this notebook you might just want to start here!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Notice this handy code: we check if the data is already\n",
    "# in memory. And notice this is just a list comprehension\n",
    "# to see what is locally loaded.\n",
    "if 'gdf' not in locals():\n",
    "    gdf = gpd.read_parquet(os.path.join('data','geo','MSOA_Atlas.geoparquet'))\n",
    "print(gdf.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "categoricals = ['Borough','Subregion']\n",
    "for c in categoricals:\n",
    "    gdf[c] = gdf[c].astype('category')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 2.2: Split\n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>Difficulty level</b>: Low!\n",
    "</div>\n",
    "\n",
    "For our purposes this is a little bit overkill as you could also use pandas' `sample(frac=0.2)` and the indexes, but it's useful to see how this works. You use test/train split to get **four** data sets out: the training data gives you two (predictors + target as separate data sets) and the testing data gives you two as well (predictors + target as separate data sets). These are sized accoridng to the `test_size` specfied in the `test_train_split` parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split \n",
    "\n",
    "pdf = gdf['Median-2011'].copy() # pdf for Median *P*rice b/c we need *something*\n",
    "\n",
    "# df == *data* frame (a.k.a. predictors or independent variables)\n",
    "# pr == *predicted* value (a.k.a. dependent variable)\n",
    "# Notice we don't want the median price included in our training data\n",
    "df_train, df_test, pr_train, pr_test = train_test_split(\n",
    "                gdf.drop(columns=['Median-2011']), pdf, test_size=0.2, random_state=44)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below you should see that the data has been split roughly on the basis of the `test_size` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Original data size: {gdf.shape[0]:,} x {gdf.shape[1]}\")\n",
    "print(f\"  Training data size: {df_train.shape[0]:,} x {df_train.shape[1]} ({(df_train.shape[0]/gdf.shape[0])*100:.0f}%)\")\n",
    "print(f\"  Testing data size:  {df_test.shape[0]:,} x {df_test.shape[1]} ({(df_test.shape[0]/gdf.shape[0])*100:.0f}%)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also notice the indexes of each pair of data sets match:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\", \".join([str(x) for x in df_train.index[:10]]))\n",
    "print(\", \".join([str(x) for x in pr_train.index[:10]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 2.3: Plot Test/Train Data\n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>Difficulty level</b>: Low, but important!\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "boros = gpd.read_file(os.path.join('data','geo','Boroughs.gpkg'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f,axes = plt.subplots(1,2, figsize=(12,5))\n",
    "df_train.plot(ax=???)\n",
    "df_test.plot(ax=???)\n",
    "boros.plot(ax=???, facecolor='none', edgecolor='r', linewidth=.5, alpha=0.4)\n",
    "boros.plot(ax=???, facecolor='none', edgecolor='r', linewidth=.5, alpha=0.4)\n",
    "axes[0].set_title('Training Data')\n",
    "axes[1].set_title('Testing Data');\n",
    "axes[0].set_ylim(150000,210000)\n",
    "axes[1].set_ylim(150000,210000)\n",
    "axes[0].set_xlim(500000,565000)\n",
    "axes[1].set_xlim(500000,565000)\n",
    "axes[1].set_yticks([]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.3.1 Questions\n",
    "\n",
    "- Why might it be useful to produce a *map* of a test/train split? \n",
    "- Why might it matter *more* if you were dealing with user locations or behaviours?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 3. Normalisation\n",
    "\n",
    "The developers of [SciKit-Learn](https://scikit-learn.org/) define [normalisation](https://scikit-learn.org/stable/modules/preprocessing.html#normalization) as \"scaling individual samples to have **unit norm**.\" There are a _lot_ of subtleties to this when you start dealing with 'sparse' data, but for the most part it's worthwhile to think of this as a rescaling of the raw data to have similar ranges in order achieve some kind of comparison. This is such a common problem that sklearn offers a range of such (re)scalers including: `MinMaxScaler`.\n",
    "\n",
    "Let's see what effect this has on the data!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sets some handy 'keywords' to tweak the Seaborn plot\n",
    "kwds = dict(s=7,alpha=0.95,edgecolor=\"none\")\n",
    "\n",
    "# Set the *hue order* so that all plots have the *same*\n",
    "# colour on the Subregion\n",
    "ho = ['Inner East','Inner West','Outer West and North West','Outer South','Outer East and North East']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 3.1: Select Columns\n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>Difficulty level</b>: Low.\n",
    "</div>\n",
    "\n",
    "One thing you'll need to explain is why I keep writing `df[cols+['Subregion']` and why I don't just add it to the `cols` variable at the start? Don't try to answer this now, get through the rest of Tasks 3 and 4 and see what you think."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cols = ['Tenure-Owned outright', 'Tenure-Owned with a mortgage or loan',\n",
    "        'Tenure-Social rented', 'Tenure-Private rented']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Answer*: one part of the answer is that it makes it easy to change the columns we select without having to remember to keep `Subregion`, but the more important reason is that it allows us to re-use *this* 'definition' of `cols` elsewhere throughout the rest of this practical without needing to remember to *remove* `Subregion`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tr_raw  = df_train[cols+['Subregion']].copy() # train raw\n",
    "tst_raw = df_test[cols+['Subregion']].copy()  # test raw"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 3.2: Fit to Data\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Difficulty level</b>: Moderate if you want to understand what <tt>reshape</tt> is doing.\n",
    "</div>\n",
    "\n",
    "Fit the training data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "# Notice what this is doing! See if you can explain it clearly.\n",
    "scalers = [???.fit(???[x].values.reshape(-1,1)) for x in cols]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Task 3.3: Apply Transformations\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Difficulty level</b>: Moderate.\n",
    "</div>\n",
    "\n",
    "Train:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tr_normed = tr_raw.copy()\n",
    "\n",
    "for i, sc in enumerate(scalers):\n",
    "    # Ditto this -- can you explain what this code is doing\n",
    "    tr_normed[cols[i]] = sc.transform(df_???[cols[i]].values.reshape(-1,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tst_normed = tst_raw.copy()\n",
    "\n",
    "for i, sc in enumerate(scalers):\n",
    "    tst_normed[cols[i]] = sc.transform(df_???[cols[i]].values.reshape(-1,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>&#128279; Connections</b>: You don't <em>have</em> to fully understand the next section, but if you are able to get your head around it then this will seriously help you to prepare for the use of more advanced techniques in modelling and programming.\n",
    "</div>\n",
    "\n",
    "Check out the properties of `tst_normed` below. If you've understood what the `MinMaxScaler` is doing then you should be able to spot something unexpected in the transformed *test* outputs. If you've *really* understood this, you'll see why this result is problematic for *models*. *Hint*: one way to think of it is an issue of **extrapolation**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for c in cols:\n",
    "    print(f\"  Minimum: {tst_normed[c].min():.4f}\")\n",
    "    ???"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 3.4: Plot Distributions\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Difficulty level</b>: Moderate.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tr_raw.columns     = [re.sub('(-|/)',\"\\n\",x) for x in tr_raw.columns.values]\n",
    "tst_raw.columns    = [re.sub('(-|/)',\"\\n\",x) for x in tst_raw.columns.values]\n",
    "tr_normed.columns  = [re.sub('(-|/)',\"\\n\",x) for x in tr_normed.columns.values]\n",
    "tst_normed.columns = [re.sub('(-|/)',\"\\n\",x) for x in tst_normed.columns.values]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.pairplot(data=tr_raw, hue='Subregion', diag_kind='kde', corner=True, plot_kws=kwds, hue_order=ho);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.pairplot(data=tr_normed, hue='Subregion', diag_kind='kde', corner=True, plot_kws=kwds, hue_order=ho);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.4.1 Questions\n",
    "\n",
    "- Why do I keep writing `df[cols+['Subregion']`? Why I don't just add Subregions to the `cols` variable at the start?\n",
    "- What has changed between the two plots (of `tr_raw` and `tr_normed`)?\n",
    "- What is the _potential_ problem that the use of the transformer fitted on `tr_normed` to data from `tst_normed` might cause? *Hint*: this is why I asked you to investigate the data in the empty code cell above.\n",
    "- Can you explain what this is doing: `[MinMaxScaler().fit(df_train[x].values.reshape(-1,1)) for x in cols]`?\n",
    "- Can you explain what *this* is doing: `sc.transform(df_test[cols[i]].values.reshape(-1,1))`?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 4. Standardisation \n",
    "\n",
    "Standardisation is typically focussed on rescaling data to have a mean (or median) of 0 and standard deviation or IQR of 1. That these approaches are conceptually tied to the idea of symmetric, unimodal data such as that encountered in the standard normal distribution. Rather confusingly, many data scientists will use standardisation and normalisation largely interchangeably!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col = 'Vehicles-No cars or vans in hh'\n",
    "tr  = df_train[[col]].copy()\n",
    "tst = df_test[[col]].copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 4.1: Z-Score Standardisation\n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>Difficulty level</b>: Low.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stsc = StandardScaler().fit(tr[col].values.reshape(-1,1))\n",
    "\n",
    "tr[f\"Z. {col}\"]  = stsc.transform(???)\n",
    "tst[f\"Z. {col}\"] = stsc.transform(???)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 4.2: Inter-Quartile Standardisation\n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>Difficulty level</b>: Low.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rs = ???(quantile_range=(25.0, 75.0)).fit(???)\n",
    "\n",
    "tr[f\"IQR. {col}\"] = rs.transform(???)\n",
    "tst[f\"IQR. {col}\"] = rs.transform(???)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 4.3: Plot Distributions\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>&#128279; Connections</b>: The point of these next plots is simply to show that *linear* transformations (which are 'reversible') is about changing things like the magnitude/scale of our data but doesn't fundamentally change relationships *between* observations.\n",
    "</div>\n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>Difficulty level</b>: Low.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.jointplot(data=tr, x=f\"{col}\", y=f\"Z. {col}\", kind='kde'); # hex probably not the best choice"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.jointplot(data=tr, x=f\"{col}\", y=f\"IQR. {col}\", kind='kde'); # hex probably not the best choice"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.jointplot(data=tr, x=f\"Z. {col}\", y=f\"IQR. {col}\", kind='hex'); # hex probably not the best choice"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perhaps a little more useful..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.kdeplot(tr[f\"Z. {col}\"])\n",
    "sns.kdeplot(tr[f\"IQR. {col}\"], color='r', ax=ax)\n",
    "plt.legend(loc='upper right', labels=['Standard', 'Robust']) # title='Foo'\n",
    "ax.ticklabel_format(useOffset=False, style='plain')\n",
    "ax.set_xlabel(\"Standardised Value for No cars or vans in hh\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.2.1 Questions?\n",
    "\n",
    "- Can you see the differences between these two rescalers? \n",
    "- Can you explain why you might want to choose one over the other?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 5. Non-Linear Transformations\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>&#128279; Connections</b>: Now *these* transformations are not directly revsersible because they change the actual relationships between observations in the data. We use these when we need to achieve particular 'behaviours' from our data in order to improve our model's 'predictive' ability. Even a simple regression is doing a kind of prediction ($y = \\beta X + \\epsilon$). You covered these concepts in <a href=\"https://jreades.github.io/fsds/sessions/week8.html#lectures\">this week's lectures</a>.\n",
    "</div>\n",
    "\n",
    "So transformations are useful when a data series has features that make comparisons or analysis difficult, or that affect our ability to intuit meaningful difference. By manipulating the data using one or more mathematical operations we can sometimes make it more *tractable* for subsequent analysis. In other words, it's all about the _context_ of our data.\n",
    "\n",
    "[![How tall is tall?](http://img.youtube.com/vi/-VjcCr4uM6w/0.jpg)](http://www.youtube.com/watch?v=-VjcCr4uM6w)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From above, we know the _Median Income_ data are _not_ normally distributed, but can we work out what distribution best represents _Median Income_? This can be done by comparing the shape of the histogram to the shapes of theoretical distributitions. For example:\n",
    "\n",
    "- the [log-normal](https://en.wikipedia.org/wiki/Log-normal_distribution) distribution\n",
    "- the [exponential](https://en.wikipedia.org/wiki/Exponential_distribution) distribution\n",
    "- the [Poisson](https://en.wikipedia.org/wiki/Poisson_distribution) distribution (for non-continuous data)\n",
    " \n",
    "From looking at those theoretical distributions, we might make an initial guess as to the type of distribution. There are actually _many_ other distributions encountered in real life data, but these ones are particuarly common. A wider view of this would be that [quantile and power transformations](https://scikit-learn.org/stable/modules/preprocessing.html#non-linear-transformation) are ways of preserving the rank of values but lose many of the other features of the relationships that might be preserved by, for instance, the standard scaler.\n",
    "\n",
    "In the case of Median Income, taking a log-transform of the data might make it _appear_ more normal: you do **not** say that a transformation _makes_ data more normal, you say either that 'it allows us to treat the data as normally distributed' or that 'the transformed data follows a log-normal distribution'."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 5.1: The Normal Distribution\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Difficulty level</b>: Moderate.\n",
    "</div>\n",
    "\n",
    "Z-scores are often associated with the normal distribution because their interpretation implicitly assumes a normal distribution. Or to put it another way... You can always calculate z-scores for your data (it's just a formula applied to data points), but their _intuitive meaning_ is lost if your data don't have something like a normal distribution (or follow the [68-95-99.7 rule](https://en.wikipedia.org/wiki/68–95–99.7_rule)).\n",
    "\n",
    "But... what if our data are non-normal? Well, Just because data are non-normal doesn't mean z-scores can't be calculated (we already did that above); we just have to be careful what we do with them... and sometimes we should just avoid them entirely. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.1.1 Creating a Normal Distribution\n",
    "\n",
    "Below is a function to create that theoretical normal distribution. See if you can understand what's going and add comments to the code to explain what each line does. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def normal_from_dist(series): \n",
    "    mu = ???          # mean of our data\n",
    "    sd = ???          # standard deviation of our data\n",
    "    n  = ???          # count how many observations are in our data\n",
    "    s = np.random.normal(???, ???, ???)   #use the parameters of the data just calculated to generate n random numbers, drawn from a normal distributions \n",
    "    return s                   #return this set of random numbers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make it easier to understand what the function above is doing, let's use it! We'll use the function to plot both a distribution plot with both histogram and KDE for our data, and then add a _second_ overplot distplot to the same fig showing the theoretical normal distribution (in red). We'll do this in a loop for each of the three variables we want to examine."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.1.2 Visual Comparisons\n",
    "\n",
    "Looking at the output, which of the variables has a roughly normal distribution? Another way to think about this question is, for which of the variables are the mean and standard deviation _most_ appropriate as measures of centrality and spread?\n",
    "\n",
    "*Also*, how would you determine the _meaning_ of some of the departures from the normal distribution?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "selection = [x for x in df_train.columns.values if x.startswith('Composition')]\n",
    "\n",
    "for c in selection:\n",
    "    ax = sns.kdeplot(df_train[c])\n",
    "    sns.kdeplot(normal_from_dist(df_train[c]), color='r', fill=True, ax=ax)\n",
    "    plt.legend(loc='upper right', labels=['Observed', 'Normal']) # title='Foo'\n",
    "    ax.ticklabel_format(useOffset=False, style='plain')\n",
    "    if ax.get_xlim()[1] > 999999:\n",
    "        plt.xticks(rotation=45)\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.1.3 Questions\n",
    "\n",
    "- Which, if any, of the variables has a roughly normal distribution? Another way to think about this question is, for which of the variables are the mean and standard deviation _most_ appropriate as measures of centrality and spread?\n",
    "- How might you determine the _significance_ of some of the departures from the normal distribution?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 5.2: Logarithmic Transformations\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Difficulty level</b>: Moderate.\n",
    "</div>\n",
    "\n",
    "To create a new series in the data frame containing the natural log of the original value it’s a similar process to what we've done before, but since pandas doesn't provide a log-transform operator (i.e. you can’t call `df['MedianIncome'].log()` ) we need to use the `numpy` package since pandas data series are just numpy arrays with some fancy window dressing that makes them even _more_ useful.\n",
    "\n",
    "Let's perform the transform then compare to the un-transformed data. Comment the code below to ensure that you understand what it is doing. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.2.1 Apply and Plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cols = ['Median-2012','Total Mean hh Income']\n",
    "\n",
    "for m in cols:\n",
    "    s  = df_train[m] # s == series\n",
    "    ts = ???.???(s)   # ts == transformed series\n",
    "    \n",
    "    ax = sns.kdeplot(s)\n",
    "    sns.kdeplot(normal_from_dist(s), color='r', fill=True, ax=ax)\n",
    "    plt.legend(loc='upper right', labels=['Observed', 'Normal']) # title also an option\n",
    "    plt.title(\"Original Data\")\n",
    "    \n",
    "    ### USEFUL FORMATTING TRICKS ###\n",
    "    # This turns off scientific notation in the ticklabels\n",
    "    ax.ticklabel_format(useOffset=False, style='plain')\n",
    "    # Notice this snippet of code\n",
    "    ax.set_xlabel(ax.get_xlabel() + \" (Raw Distribution)\")\n",
    "    # Notice this little code snippet too\n",
    "    if ax.get_xlim()[1] > 999999:\n",
    "        plt.xticks(rotation=45)\n",
    "    \n",
    "    plt.show()\n",
    "    \n",
    "    ax = sns.kdeplot(ts)\n",
    "    sns.kdeplot(normal_from_dist(ts), color='r', fill=True, ax=ax)\n",
    "    plt.legend(loc='upper right', labels=['Observed', 'Normal'])\n",
    "    ax.ticklabel_format(useOffset=False, style='plain')\n",
    "    ax.set_xlabel(ax.get_xlabel() + \" (Logged Distribution)\")\n",
    "    if ax.get_xlim()[1] > 999999:\n",
    "        plt.xticks(rotation=45)\n",
    "    plt.title(\"Log-Transformed Data\")\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hopefully, you can see that the transformed data do indeed look 'more normal'; the peak of the red and blue lines are closer together and the blue line at the lower extreme is also closer to the red line, but we can check this by seeing what has happened to the z-scores."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 5.3: Power Transformations \n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Difficulty level</b>: Moderate.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cols = ['Median-2012','Total Mean hh Income']\n",
    "pt = ???(method='yeo-johnson')\n",
    "\n",
    "for m in cols:\n",
    "    s  = df_train[m] # s == series\n",
    "    ts = pt.fit_transform(s.values.reshape(-1,1))\n",
    "    print(f\"Using lambda (transform 'exponent') of {pt.lambdas_[0]:0.5f}\")\n",
    "    \n",
    "    ax = sns.kdeplot(ts.reshape(-1,))\n",
    "    \n",
    "    sns.kdeplot(normal_from_dist(???), color='r', fill=True, ax=ax)\n",
    "    plt.legend(loc='upper right', labels=['Observed', 'Normal'])\n",
    "    ax.ticklabel_format(useOffset=False, style='plain')\n",
    "    ax.set_xlabel(m + \" (Transformed Distribution)\")\n",
    "    if ax.get_xlim()[1] > 999999: # <-- What does this do?\n",
    "        plt.xticks(rotation=45)\n",
    "    plt.title(\"Power-Transformed Data\")\n",
    "    plt.show();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 6. Principal Components Analysis\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>&#128279; Connections</b>: This is all about <em>dimensionality</em> and the different ways that we can reduce dimensionality in our data in order to improve our models' robustness and gain a better understanding of whether (or not) there is structure in our data. We'll start with linear decomposition and then look at non-linear decomposition, which we also demonstrated with UMAP last week.\n",
    "</div>\n",
    "\n",
    "Now we're going to ask the question: how can we represent our data using a smaller number of components that capture the variance in the original data. You should have covered PCA in Quantitative Methods."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Optional Reload\n",
    "\n",
    "Use this is your data gets messy..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf = gpd.read_parquet(os.path.join('data','geo','MSOA_Atlas.geoparquet')).set_index('MSOA Code')\n",
    "print(gdf.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "categoricals = ['Borough','Subregion']\n",
    "for c in categoricals:\n",
    "    gdf[c] = gdf[c].astype('category')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 6.1: Calculating Shares\n",
    "\n",
    "<div class=\"alert alert-block alert-danger\">\n",
    "    <b>Difficulty level</b>: Hard.\n",
    "</div>\n",
    "\n",
    "Sadly, there's no transformer to work this out for you automatically, but let's start by converting the raw population and household figures to shares so that our later dimensionality reduction steps aren't impacted by the size of the MSOA."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf[['Age-All Ages','Households-All Households']].head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6.1.1 Specify Totals Columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_pop = gdf['Age-All Ages']\n",
    "total_hh  = gdf['Households-All Households']\n",
    "total_vec = gdf['Vehicles-Sum of all cars or vans in the area']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6.1.2 Specify Columns for Pop or HH Normalisation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pop_cols  = ['Age-', 'Composition-', 'Qualifications-', 'Economic Activity-', 'White', 'Mixed/multiple',\n",
    "             'Asian/Asian British', 'Black/African', 'BAME', 'Other ethnic',\n",
    "             'Country of Birth-']\n",
    "hh_cols   = [???, ???, ???, 'Detached', 'Semi-detached', 'Terraced', 'Flat, ']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "popre = re.compile(r'^(?:' + \"|\".join(pop_cols) + r')')\n",
    "hhre  = re.compile(r'^(?:' + \"|\".join(???) + r')')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6.1.3 Apply to Columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tr_gdf = gdf.copy()\n",
    "tr_gdf['Mean hh size'] = tr_gdf['Age-All Ages']/tr_gdf['Households-All Households']\n",
    "\n",
    "for c in gdf.columns:\n",
    "    print(c)\n",
    "    if popre.match(c):\n",
    "        print(\"  Normalising by total population.\")\n",
    "        tr_gdf[c] = gdf[c]/???\n",
    "    elif ???.match(???):\n",
    "        print(\"  Normalising by total households.\")\n",
    "        tr_gdf[c] = gdf[c]/???\n",
    "    elif c.startswith('Vehicles-') and not c.startswith('Vehicles-Cars per hh'):\n",
    "        print(\"  Normalising by total vehicles.\")\n",
    "        tr_gdf[c] = gdf[c]/???\n",
    "    else:\n",
    "        print(\"  Passing through.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 6.2: Removing Columns\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Difficulty level</b>: Moderate.\n",
    "</div>\n",
    "\n",
    "To perform dimensionality we can only have numeric data. In theory, categorical data can be converted to numeric and retained, but there are two issues:\n",
    "\n",
    "1. Nominal data has no _innate_ order so we _can't_ convert > 2 categories to numbers and have to convert them to One-Hot Encoded values.\n",
    "2. A binary (i.e. One-Hot Encoded) variable will account for a _lot_ of variance in the data because it's only two values they are 0 and 1!\n",
    "\n",
    "So in practice, it's probably a good idea to drop categorical data if you're planning to use PCA."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6.2.1 Drop Totals Columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pcadf = tr_gdf.drop(columns=['Age-All Ages', 'Households-All Households',\n",
    "                             'Vehicles-Sum of all cars or vans in the area'])\n",
    "pcadf = pcadf.set_index('MSOA Code')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6.2.2 Drop Non-Numeric Columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pcadf.select_dtypes(['category','object']).columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pcadf.drop(columns=pcadf.select_dtypes(['category','object']).columns.to_list(), inplace=True)\n",
    "pcadf.drop(columns=['BNG_E','BNG_N','geometry', 'LONG', 'LAT','Shape__Are', 'Shape__Len'], inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pcadf.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 6.3: Rescale & Reduce\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Difficulty level</b>: Moderate.\n",
    "</div>\n",
    "\n",
    "In order to ensure that our results aren't dominated by a single scale (e.g. House Prices!) we need to rescale all of our data. You could easily try different scalers as well as a different parameters to see what effect this has on your results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6.3.1 Robustly Rescale\n",
    "\n",
    "Set up the Robust Rescaler for inter-decile standardisation: 10th and 90th quantiles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rs = ???\n",
    "\n",
    "for c in pcadf.columns.values:\n",
    "    pcadf[c] = rs.fit_transform(pcadf[c].values.reshape(-1, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6.3.2 PCA Reduce"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA \n",
    "\n",
    "pca = PCA(n_components=50, whiten=True) \n",
    "\n",
    "pca.fit(pcadf)\n",
    "\n",
    "explained_variance = pca.explained_variance_ratio_\n",
    "singular_values = pca.singular_values_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6.3.3 Examine Explained Variance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.arange(1,???)\n",
    "plt.plot(x, explained_variance)\n",
    "plt.ylabel('Share of Variance Explained')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(0, 20):\n",
    "    print(f\"Component {i:>2} accounts for {explained_variance[i]*100:>2.2f}% of variance\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should get that Component 0 accounts for 31.35% of the variance and Component 19 accounts for 0.37%."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6.3.4: How Many Components?\n",
    "\n",
    "There are a number of ways that we could set a threshold for dimensionality reduction: \n",
    "- The most common is to look for the 'knee' in the Explained Variance plot above. That would put us at about 5 retained components.\n",
    "- Another is to just keep all components contributing more than 1% of the variance. That would put us at about 10 components.\n",
    "- You can also ([I discovered](https://medium.com/@nikolay.oskolkov/hi-jon-reades-my-sincere-apologies-for-this-very-late-reply-444f57054d14)) look to shuffle the data and repeatedly perform PCA to build confidence intervals. I have not implemented this (yet).\n",
    "\n",
    "In order to _do_ anything with these components we need to somehow reattach them to the MSOAs. So that entails taking the transformed results (`X_train` and `X_test`) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kn = knee_locator.KneeLocator(x, explained_variance, \n",
    "                              curve='convex', direction='decreasing', \n",
    "                              interp_method='interp1d')\n",
    "print(f\"Knee detected at: {kn.knee}\")\n",
    "kn.plot_knee()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "keep_n_components = 7\n",
    "\n",
    "# If we weren't changing the number of components we\n",
    "# could re-use the pca object created above. \n",
    "pca = PCA(n_components=keep_n_components, whiten=True)\n",
    "\n",
    "X_train = pca.fit_transform(???)\n",
    "\n",
    "# Notice that we get the _same_ values out,\n",
    "# so this is a *deterministic* process that\n",
    "# is fully replicable (allowing for algorithmic\n",
    "# and programming language differences).\n",
    "print(f\"Total explained variance: {pca.explained_variance_ratio_.sum()*100:2.2f}%\")\n",
    "for i in range(0, keep_n_components):\n",
    "    print(f\"  Component {i:>2} accounts for {pca.explained_variance_ratio_[i]*100:>5.2f}% of variance\")\n",
    "\n",
    "# Notice...\n",
    "print(f\"X-train shape: {len(X_train)}\")\n",
    "print(f\"PCA df shape: {pcadf.shape[0]}\")\n",
    "# So each observation has a row in X_train and there is \n",
    "# 1 column for each component. This defines the mapping\n",
    "# of the original data space into the reduced one\n",
    "print(f\"Row 0 of X-train contains {len(X_train[0])} elements.\") "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 6.4: Components to Columns\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Difficulty level</b>: Moderate.\n",
    "</div>\n",
    "\n",
    "You could actually do this more quickly (but less clearly) using `X_train.T` to transpose the matrix!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(0,keep_n_components):\n",
    "    s = pd.Series(X_train[:,???], index=pcadf.???)\n",
    "    pcadf[f\"Component {i+1}\"] = s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pcadf.sample(3).iloc[:,-10:-4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 6.7: (Re)Attaching GeoData\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Difficulty level</b>: Moderate.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "msoas = gpd.read_file(os.path.join('data','geo','Middle_Layer_Super_Output_Areas__December_2011__EW_BGC_V2-shp.zip'), driver='ESRI Shapefile')\n",
    "msoas = msoas.set_index('MSOA11CD')\n",
    "print(msoas.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "msoas.head(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pcadf.head(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gpcadf = pd.merge(msoas.set_index(['MSOA11CD'], drop=True), pcadf, left_index=True, right_index=True, how='inner')\n",
    "print(f\"Geo-PCA df has shape {gpcadf.shape[0]} x {gpcadf.shape[1]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should get `PCA df has shape 983 x 89`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gpcadf['Borough'] = gpcadf.MSOA11NM.apply(???)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 6.8: Map the First _n_ Components\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Difficulty level</b>: Moderate.\n",
    "</div>\n",
    "\n",
    "How would you automate this so that the loop creates one plot for each of the first 3 components? How do you interpret these?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for comp in [f\"Component {x}\" for x in range(1,3)]:\n",
    "    ax = gpcadf.plot(column=???, cmap='plasma', \n",
    "         scheme='FisherJenks', k=7, edgecolor='None', legend=True, figsize=(9,7));\n",
    "    boros.plot(ax=ax, edgecolor='w', facecolor='none', linewidth=1, alpha=0.7)\n",
    "    ax.set_title(f'PCA {comp}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your first component map should look something like this:\n",
    "    \n",
    "![PCA Component 1](https://github.com/jreades/fsds/raw/master/practicals/img/PCA-Component-1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 7. UMAP\n",
    "\n",
    "UMAP is a non-linear dimensionality reduction technique. Technically, it's called *manifold* learning: imagine being able to roll a piece of paper up in more than just the 3rd dimension...). As a way to see if there is structure in your data this is a *much* better technique than one you might encounter in many tutorials: *t-SNE*. It has to do with how the two techniques 'learn' the manifold to use with your data. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 7.1: UMAP on Raw Data\n",
    "\n",
    "<div class=\"alert alert-block alert-danger\">\n",
    "    <b>Difficulty level</b>: Hard.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from umap import UMAP\n",
    "\n",
    "# You might want to experiment with all\n",
    "# 3 of these values -- it may make sense \n",
    "# to package a lot of this up into a function!\n",
    "keep_dims=2\n",
    "rs=42\n",
    "\n",
    "u = UMAP(\n",
    "    n_neighbors=25,\n",
    "    min_dist=0.01,\n",
    "    n_components=keep_dims,\n",
    "    random_state=rs)\n",
    "\n",
    "X_embedded = u.fit_transform(???)\n",
    "print(X_embedded.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 7.2: Write to Data Frame\n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>Difficulty level</b>: Low.\n",
    "</div>\n",
    "\n",
    "Can probably also be solved using `X_embedded.T`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for ix in range(0,X_embedded.shape[1]):\n",
    "    print(ix)\n",
    "    s = pd.Series(X_embedded[:,???], index=pcadf.???)\n",
    "    gpcadf[f\"Dimension {ix+1}\"] = s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 7.3: Visualise!\n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>Difficulty level</b>: Low.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rddf = gpcadf.copy() # Reduced Dimension Data Frame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7.3.1 Simple Scatter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f,ax = plt.subplots(1,1,figsize=(8,6))\n",
    "sns.scatterplot(x=rddf[???], y=rddf[???], hue=rddf['Borough'], legend=False, ax=ax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7.3.2 Seaborn Jointplot\n",
    "\n",
    "That is *suggestive* of there being struccture in the data, but with 983 data points and 33 colours it's hard to make sense of what the structure *might* imply. Let's try this again using the Subregion instead and taking advantage of the Seaborn visualisation library's `jointplot` (joint distribution plot):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rddf['Subregion'] = rddf.Borough.apply(lambda x: mapping[x])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sets some handy 'keywords' to tweak the Seaborn plot\n",
    "kwds = dict(s=7,alpha=0.95,edgecolor=\"none\")\n",
    "# Set the *hue order* so that all plots have some colouring by Subregion\n",
    "ho = ['Inner East','Inner West','Outer West and North West','Outer South','Outer East and North East']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "g = sns.jointplot(data=rddf, x=???, y=???, height=8, \n",
    "                  hue=???, hue_order=ho, joint_kws=kwds)\n",
    "g.ax_joint.legend(loc='upper right', prop={'size': 8});"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your jointplot should look like this:\n",
    "\n",
    "![UMAP Jointplot](https://github.com/jreades/fsds/raw/master/practicals/img/UMAP-Jointplot.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What do you make of this?\n",
    "\n",
    "Maybe let's give this one last go splitting the plot out by subregion so that we can see how these vary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for r in rddf.Subregion.unique():\n",
    "    g = sns.jointplot(data=rddf[rddf.Subregion==r], x='Dimension 1', y='Dimension 2', \n",
    "                  hue='Borough', joint_kws=kwds)\n",
    "    g.ax_joint.legend(loc='upper right', prop={'size': 8});\n",
    "    g.ax_joint.set_ylim(0,15)\n",
    "    g.ax_joint.set_xlim(0,15)\n",
    "    plt.suptitle(r)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can't unfortunately do any clustering at this point to create groups from the data (that's next week!) so for now note that there are several large-ish groups (in terms of membership) and few small ones picked up by t-SNE. Alos note that there is strong evidence of some incipient structure: Inner East and West largely clump together, while Outher East and Outer South also seem to group together, with Outer West being more distinctive. If you look back at the PCA Components (especially \\#1) you might be able to speculate about some reasons for this! Please note: this is _only_ speculation at this time!\n",
    "\n",
    "Next week we'll also add the listings data back in as part of the picture!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 7.4: Map the _n_ Dimensions\n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>Difficulty level</b>: Low.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for comp in [f\"Dimension {x}\" for x in range(1,3)]:\n",
    "    f, ax = plt.subplots(1,1,figsize=(12,8))\n",
    "    rddf.plot(???);\n",
    "    boros.plot(edgecolor='w', facecolor='none', linewidth=1, alpha=0.7, ax=ax)\n",
    "    ax.set_title(f'UMAP {comp}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your first dimension map should look something like this:\n",
    "\n",
    "![UMAP Dimension 1](https://github.com/jreades/fsds/raw/master/practicals/img/UMAP-Dimension-1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 7.5: And Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rddf.to_parquet(os.path.join('data','clean','Reduced_Dimension_Data.geoparquet'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7.5.1 Questions\n",
    "\n",
    "- How would you compare/contrast PCA components with UMAP dimensions? Why do they not seem to show the same thing even though *both* seem to show *something*?\n",
    "- What might you do with the output of either the PCA or UMAP processes?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Credits!\n",
    "\n",
    "#### Contributors:\n",
    "The following individuals have contributed to these teaching materials: Jon Reades (j.reades@ucl.ac.uk).\n",
    "\n",
    "#### License\n",
    "These teaching materials are licensed under a mix of [The MIT License](https://opensource.org/licenses/mit-license.php) and the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/).\n",
    "\n",
    "#### Potential Dependencies:\n",
    "This notebook may depend on the following libraries: pandas, geopandas, sklearn, matplotlib, seaborn"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}