{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
`. Hint: you've got a dictionary and you know how to use it!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.4.1 Add Boroughs\n",
"\n",
"We first need to extract the borough names from one of the existing fields in the data frame... a *regex* that does *replacement* would be fastest and easiest: focus on what you *don't* need from the MSOA Name **string** and **replacing** that using a **regex**..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tidy['Borough'] = tidy['MSOA Name'].???\n",
"tidy.Borough.unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should get:\n",
"```\n",
"array(['City of London', 'Barking and Dagenham', 'Barnet', 'Bexley',\n",
" 'Brent', 'Bromley', 'Camden', 'Croydon', 'Ealing', 'Enfield',\n",
" 'Greenwich', 'Hackney', 'Hammersmith and Fulham', 'Haringey',\n",
" 'Harrow', 'Havering', 'Hillingdon', 'Hounslow', 'Islington',\n",
" 'Kensington and Chelsea', 'Kingston upon Thames', 'Lambeth',\n",
" 'Lewisham', 'Merton', 'Newham', 'Redbridge',\n",
" 'Richmond upon Thames', 'Southwark', 'Sutton', 'Tower Hamlets',\n",
" 'Waltham Forest', 'Wandsworth', 'Westminster'], dtype=object)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.4.2 Map Boroughs to Subregions\n",
"\n",
"And now you need to understand how to *apply* the mapping ot the Borough field using a `lambda` function. It's fairly straightforward once you know the syntax: just a dictionary lookup. But as usual, you might want to first create a new cell and experiment with the output from the apply function before using it to write the `Subregion` field of the data frame..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mapping = {}\n",
"for b in ['Enfield','Waltham Forest','Redbridge','Barking and Dagenham','Havering','Greenwich','Bexley']:\n",
" mapping[b]='Outer East and North East'\n",
"for b in ['Haringey','Islington','Hackney','Tower Hamlets','Newham','Lambeth','Southwark','Lewisham']:\n",
" mapping[b]='Inner East'\n",
"for b in ['Bromley','Croydon','Sutton','Merton','Kingston upon Thames']:\n",
" mapping[b]='Outer South'\n",
"for b in ['Wandsworth','Kensington and Chelsea','Hammersmith and Fulham','Westminster','Camden','City of London']:\n",
" mapping[b]='Inner West'\n",
"for b in ['Richmond upon Thames','Hounslow','Ealing','Hillingdon','Brent','Harrow','Barnet']:\n",
" mapping[b]='Outer West and North West'\n",
"\n",
"tidy['Subregion'] = tidy.Borough.apply(???)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.4.3 And Save\n",
"\n",
"There's a little snipped of useful code to work out here: we need to check if the `clean` directory exists in the `data` directory; if we don't then the `tidy.to_parquet()` call will fail."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"if not os.path.???(os.path.join('data','clean')):\n",
" os.makedirs(os.path.join('data','clean'))\n",
"tidy.to_parquet(os.path.join('data','clean','MSOA_Atlas.parquet'))\n",
"print(\"Done.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.4.4 Questions\n",
"\n",
"- What are the advantages to `apply` and `lambda` functions over looping and named functions? \n",
"- When might you choose a named function over a lambda function?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 1.5: Merge Data & Geography\n",
"\n",
"\n",
" Difficulty level: Low, except for plotting.\n",
"\n",
"\n",
"\n",
" 🔗 Connections: We'll cover joins (of which a merge is just one type) in the final week's lectures, but between what you'd done in GIS and what we have here there should be enough here for you to start being able to make sense of how they work so that you don't have to wait until Week 10 to think about how this could help you with your Group Assessment.\n",
"\n",
"\n",
"First, we need to download the MSOA source file, which is a zipped archive of a Shapefile:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Oh look, we can read a Shapefile without needing to unzip it!\n",
"msoas = gpd.read_file(\n",
" cache_data('https://github.com/jreades/fsds/blob/master/data/src/Middle_Layer_Super_Output_Areas__December_2011__EW_BGC_V2-shp.zip?raw=true', \n",
" os.path.join('data','geo')), driver='ESRI Shapefile')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.5.1 Identifying Matching Columns\n",
"\n",
"Looking at the first few columns of each data frame, which one might allow us to link the two files together? You've done this in GIS. *Remember*: the column *names* don't need to match for us to use them in a join, it's the *values* that matter."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"Column names: {', '.join(tidy.columns.tolist()[:5])}\")\n",
"tidy.iloc[:3,:5]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"Column names: {', '.join(msoas.columns.tolist()[:5])}\")\n",
"msoas.iloc[:3,:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.5.2 Merge\n",
"\n",
"One more thing: if you've got more than one choice I'd *always* go with a code over a name because one is intended for matching and other is not..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gdf = pd.merge(msoas, tidy, left_on=???, right_on=???, how='inner')\n",
"gdf = gdf.drop(columns=['MSOA11CD','MSOA11NM','OBJECTID'])\n",
"\n",
"print(f\"Final MSOA Atlas data frame has shape {gdf.shape[0]:,} x {gdf.shape[1]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should get `Final data frame has shape 983 x 86`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.5.3 Plot Choropleth\n",
"\n",
"Let's plot the median income in 2011 column using the `plasma` colour ramp... The rest is to show you how to customise a legend."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"col = 'Median-2011'\n",
"fig = gdf.plot(column=???, cmap='???', \n",
" scheme='FisherJenks', k=7, edgecolor='None', \n",
" legend=True, legend_kwds={'frameon':False, 'fontsize':8},\n",
" figsize=(9,8));\n",
"plt.title(col.replace('-',' '));\n",
"\n",
"# Now to modify the legend: googling \"geopandas format legend\"\n",
"# brings me to: https://stackoverflow.com/a/56591102/4041902\n",
"leg = fig.get_legend()\n",
"leg._loc = 3\n",
"\n",
"for lbl in leg.get_texts():\n",
" label_text = lbl.get_text()\n",
" [low, hi] = label_text.split(', ')\n",
" new_text = f'£{float(low):,.0f} - £{float(hi):,.0f}'\n",
" lbl.set_text(new_text)\n",
"\n",
"plt.show();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.5.4 Save"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gdf.to_geoparquet(os.path.join('data','geo','MSOA_Atlas.geoparquet'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.5.5 Questions\n",
"\n",
"- Try changing the colour scheme, classification scheme, and number of classes to see if you feel there's a *better* opeion than the one shown above... Copy the cell (click on anywhere outside the code and then hit `C` to copy. Then click on this cell *once*, and hit `V` to paste."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 2. Splitting into Test & Train\n",
"\n",
"\n",
" 🔗 Connections: Here you will be using a standard approach in Machine Learningknown as test/train split, although we won't be doing any actual ML (😅). Here we are just going to use it to explore some issues raised by normalisation and standardisation in this week's lectures.\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A standard approach to Machine Learning, and something that is becoming more widely used elsewhere, is the splitting of a large data into set into testing and training components. Typically, you would take 80-90% of your data to 'train' your algorithm and withold between 10-20% for validation ('testing'). An even 'stricter' approach, in the sense of trying to ensure the robustness of your model against outlier effects, is [cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) such as [k-folds cross-validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html).\n",
"\n",
"Sci-Kit Learn is probably *the* most important reason Python has become the *de fact* language of data science. Test/train-split is used when to avoid over-fitting when we are trying to **predict** something; so here Sci-Kit Learn _expects_ that you'll have an `X` which is your **predictors** (the inputs to your model) and a `y` which is the thing you're **trying to predict** (because: $y = \\beta X + \\epsilon$). \n",
"\n",
"We're not building a model here (that's for Term 2!) so we'll just 'pretend' that we're trying to predict the price of a listing and will set that up as our `y` data set *so that* we can see how the choice of normalisation/standardisation technique affects the robustness of the model against 'new' data. Notice too that you can pass a data frame directly to Sci-Kit Learn and it will split it for you."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 2.1: Reload\n",
"\n",
"\n",
" 💡 Tip: In future 'runs' of this notebook you can now just pick up here and skip all of Task 1.\n",
"\n",
"\n",
"On subsequent runs of this notebook you might just want to start here!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Notice this handy code: we check if the data is already\n",
"# in memory. And notice this is just a list comprehension\n",
"# to see what is locally loaded.\n",
"if 'gdf' not in locals():\n",
" gdf = gpd.read_parquet(os.path.join('data','geo','MSOA_Atlas.geoparquet'))\n",
"print(gdf.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"categoricals = ['Borough','Subregion']\n",
"for c in categoricals:\n",
" gdf[c] = gdf[c].astype('category')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 2.2: Split\n",
"\n",
"\n",
" Difficulty level: Low!\n",
"\n",
"\n",
"For our purposes this is a little bit overkill as you could also use pandas' `sample(frac=0.2)` and the indexes, but it's useful to see how this works. You use test/train split to get **four** data sets out: the training data gives you two (predictors + target as separate data sets) and the testing data gives you two as well (predictors + target as separate data sets). These are sized accoridng to the `test_size` specfied in the `test_train_split` parameters."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split \n",
"\n",
"pdf = gdf['Median-2011'].copy() # pdf for Median *P*rice b/c we need *something*\n",
"\n",
"# df == *data* frame (a.k.a. predictors or independent variables)\n",
"# pr == *predicted* value (a.k.a. dependent variable)\n",
"# Notice we don't want the median price included in our training data\n",
"df_train, df_test, pr_train, pr_test = train_test_split(\n",
" gdf.drop(columns=['Median-2011']), pdf, test_size=0.2, random_state=44)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below you should see that the data has been split roughly on the basis of the `test_size` parameter."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"Original data size: {gdf.shape[0]:,} x {gdf.shape[1]}\")\n",
"print(f\" Training data size: {df_train.shape[0]:,} x {df_train.shape[1]} ({(df_train.shape[0]/gdf.shape[0])*100:.0f}%)\")\n",
"print(f\" Testing data size: {df_test.shape[0]:,} x {df_test.shape[1]} ({(df_test.shape[0]/gdf.shape[0])*100:.0f}%)\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also notice the indexes of each pair of data sets match:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\", \".join([str(x) for x in df_train.index[:10]]))\n",
"print(\", \".join([str(x) for x in pr_train.index[:10]]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 2.3: Plot Test/Train Data\n",
"\n",
"\n",
" Difficulty level: Low, but important!\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"boros = gpd.read_file(os.path.join('data','geo','Boroughs.gpkg'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"f,axes = plt.subplots(1,2, figsize=(12,5))\n",
"df_train.plot(ax=???)\n",
"df_test.plot(ax=???)\n",
"boros.plot(ax=???, facecolor='none', edgecolor='r', linewidth=.5, alpha=0.4)\n",
"boros.plot(ax=???, facecolor='none', edgecolor='r', linewidth=.5, alpha=0.4)\n",
"axes[0].set_title('Training Data')\n",
"axes[1].set_title('Testing Data');\n",
"axes[0].set_ylim(150000,210000)\n",
"axes[1].set_ylim(150000,210000)\n",
"axes[0].set_xlim(500000,565000)\n",
"axes[1].set_xlim(500000,565000)\n",
"axes[1].set_yticks([]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.3.1 Questions\n",
"\n",
"- Why might it be useful to produce a *map* of a test/train split? \n",
"- Why might it matter *more* if you were dealing with user locations or behaviours?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 3. Normalisation\n",
"\n",
"The developers of [SciKit-Learn](https://scikit-learn.org/) define [normalisation](https://scikit-learn.org/stable/modules/preprocessing.html#normalization) as \"scaling individual samples to have **unit norm**.\" There are a _lot_ of subtleties to this when you start dealing with 'sparse' data, but for the most part it's worthwhile to think of this as a rescaling of the raw data to have similar ranges in order achieve some kind of comparison. This is such a common problem that sklearn offers a range of such (re)scalers including: `MinMaxScaler`.\n",
"\n",
"Let's see what effect this has on the data!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Sets some handy 'keywords' to tweak the Seaborn plot\n",
"kwds = dict(s=7,alpha=0.95,edgecolor=\"none\")\n",
"\n",
"# Set the *hue order* so that all plots have the *same*\n",
"# colour on the Subregion\n",
"ho = ['Inner East','Inner West','Outer West and North West','Outer South','Outer East and North East']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 3.1: Select Columns\n",
"\n",
"\n",
" Difficulty level: Low.\n",
"\n",
"\n",
"One thing you'll need to explain is why I keep writing `df[cols+['Subregion']` and why I don't just add it to the `cols` variable at the start? Don't try to answer this now, get through the rest of Tasks 3 and 4 and see what you think."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cols = ['Tenure-Owned outright', 'Tenure-Owned with a mortgage or loan',\n",
" 'Tenure-Social rented', 'Tenure-Private rented']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Answer*: one part of the answer is that it makes it easy to change the columns we select without having to remember to keep `Subregion`, but the more important reason is that it allows us to re-use *this* 'definition' of `cols` elsewhere throughout the rest of this practical without needing to remember to *remove* `Subregion`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tr_raw = df_train[cols+['Subregion']].copy() # train raw\n",
"tst_raw = df_test[cols+['Subregion']].copy() # test raw"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 3.2: Fit to Data\n",
"\n",
"\n",
" Difficulty level: Moderate if you want to understand what reshape is doing.\n",
"\n",
"\n",
"Fit the training data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import MinMaxScaler\n",
"\n",
"# Notice what this is doing! See if you can explain it clearly.\n",
"scalers = [???.fit(???[x].values.reshape(-1,1)) for x in cols]"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### Task 3.3: Apply Transformations\n",
"\n",
"\n",
" Difficulty level: Moderate.\n",
"\n",
"\n",
"Train:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tr_normed = tr_raw.copy()\n",
"\n",
"for i, sc in enumerate(scalers):\n",
" # Ditto this -- can you explain what this code is doing\n",
" tr_normed[cols[i]] = sc.transform(df_???[cols[i]].values.reshape(-1,1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Test:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tst_normed = tst_raw.copy()\n",
"\n",
"for i, sc in enumerate(scalers):\n",
" tst_normed[cols[i]] = sc.transform(df_???[cols[i]].values.reshape(-1,1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
" 🔗 Connections: You don't have to fully understand the next section, but if you are able to get your head around it then this will seriously help you to prepare for the use of more advanced techniques in modelling and programming.\n",
"\n",
"\n",
"Check out the properties of `tst_normed` below. If you've understood what the `MinMaxScaler` is doing then you should be able to spot something unexpected in the transformed *test* outputs. If you've *really* understood this, you'll see why this result is problematic for *models*. *Hint*: one way to think of it is an issue of **extrapolation**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for c in cols:\n",
" print(f\" Minimum: {tst_normed[c].min():.4f}\")\n",
" ???"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 3.4: Plot Distributions\n",
"\n",
"\n",
" Difficulty level: Moderate.\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tr_raw.columns = [re.sub('(-|/)',\"\\n\",x) for x in tr_raw.columns.values]\n",
"tst_raw.columns = [re.sub('(-|/)',\"\\n\",x) for x in tst_raw.columns.values]\n",
"tr_normed.columns = [re.sub('(-|/)',\"\\n\",x) for x in tr_normed.columns.values]\n",
"tst_normed.columns = [re.sub('(-|/)',\"\\n\",x) for x in tst_normed.columns.values]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.pairplot(data=tr_raw, hue='Subregion', diag_kind='kde', corner=True, plot_kws=kwds, hue_order=ho);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.pairplot(data=tr_normed, hue='Subregion', diag_kind='kde', corner=True, plot_kws=kwds, hue_order=ho);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.4.1 Questions\n",
"\n",
"- Why do I keep writing `df[cols+['Subregion']`? Why I don't just add Subregions to the `cols` variable at the start?\n",
"- What has changed between the two plots (of `tr_raw` and `tr_normed`)?\n",
"- What is the _potential_ problem that the use of the transformer fitted on `tr_normed` to data from `tst_normed` might cause? *Hint*: this is why I asked you to investigate the data in the empty code cell above.\n",
"- Can you explain what this is doing: `[MinMaxScaler().fit(df_train[x].values.reshape(-1,1)) for x in cols]`?\n",
"- Can you explain what *this* is doing: `sc.transform(df_test[cols[i]].values.reshape(-1,1))`?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 4. Standardisation \n",
"\n",
"Standardisation is typically focussed on rescaling data to have a mean (or median) of 0 and standard deviation or IQR of 1. That these approaches are conceptually tied to the idea of symmetric, unimodal data such as that encountered in the standard normal distribution. Rather confusingly, many data scientists will use standardisation and normalisation largely interchangeably!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"col = 'Vehicles-No cars or vans in hh'\n",
"tr = df_train[[col]].copy()\n",
"tst = df_test[[col]].copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 4.1: Z-Score Standardisation\n",
"\n",
"\n",
" Difficulty level: Low.\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"stsc = StandardScaler().fit(tr[col].values.reshape(-1,1))\n",
"\n",
"tr[f\"Z. {col}\"] = stsc.transform(???)\n",
"tst[f\"Z. {col}\"] = stsc.transform(???)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 4.2: Inter-Quartile Standardisation\n",
"\n",
"\n",
" Difficulty level: Low.\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rs = ???(quantile_range=(25.0, 75.0)).fit(???)\n",
"\n",
"tr[f\"IQR. {col}\"] = rs.transform(???)\n",
"tst[f\"IQR. {col}\"] = rs.transform(???)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 4.3: Plot Distributions\n",
"\n",
"\n",
" 🔗 Connections: The point of these next plots is simply to show that *linear* transformations (which are 'reversible') is about changing things like the magnitude/scale of our data but doesn't fundamentally change relationships *between* observations.\n",
"\n",
"\n",
"\n",
" Difficulty level: Low.\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.jointplot(data=tr, x=f\"{col}\", y=f\"Z. {col}\", kind='kde'); # hex probably not the best choice"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.jointplot(data=tr, x=f\"{col}\", y=f\"IQR. {col}\", kind='kde'); # hex probably not the best choice"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.jointplot(data=tr, x=f\"Z. {col}\", y=f\"IQR. {col}\", kind='hex'); # hex probably not the best choice"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Perhaps a little more useful..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ax = sns.kdeplot(tr[f\"Z. {col}\"])\n",
"sns.kdeplot(tr[f\"IQR. {col}\"], color='r', ax=ax)\n",
"plt.legend(loc='upper right', labels=['Standard', 'Robust']) # title='Foo'\n",
"ax.ticklabel_format(useOffset=False, style='plain')\n",
"ax.set_xlabel(\"Standardised Value for No cars or vans in hh\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.2.1 Questions?\n",
"\n",
"- Can you see the differences between these two rescalers? \n",
"- Can you explain why you might want to choose one over the other?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 5. Non-Linear Transformations\n",
"\n",
"\n",
" 🔗 Connections: Now *these* transformations are not directly revsersible because they change the actual relationships between observations in the data. We use these when we need to achieve particular 'behaviours' from our data in order to improve our model's 'predictive' ability. Even a simple regression is doing a kind of prediction ($y = \\beta X + \\epsilon$). You covered these concepts in this week's lectures.\n",
"\n",
"\n",
"So transformations are useful when a data series has features that make comparisons or analysis difficult, or that affect our ability to intuit meaningful difference. By manipulating the data using one or more mathematical operations we can sometimes make it more *tractable* for subsequent analysis. In other words, it's all about the _context_ of our data.\n",
"\n",
"[![How tall is tall?](http://img.youtube.com/vi/-VjcCr4uM6w/0.jpg)](http://www.youtube.com/watch?v=-VjcCr4uM6w)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From above, we know the _Median Income_ data are _not_ normally distributed, but can we work out what distribution best represents _Median Income_? This can be done by comparing the shape of the histogram to the shapes of theoretical distributitions. For example:\n",
"\n",
"- the [log-normal](https://en.wikipedia.org/wiki/Log-normal_distribution) distribution\n",
"- the [exponential](https://en.wikipedia.org/wiki/Exponential_distribution) distribution\n",
"- the [Poisson](https://en.wikipedia.org/wiki/Poisson_distribution) distribution (for non-continuous data)\n",
" \n",
"From looking at those theoretical distributions, we might make an initial guess as to the type of distribution. There are actually _many_ other distributions encountered in real life data, but these ones are particuarly common. A wider view of this would be that [quantile and power transformations](https://scikit-learn.org/stable/modules/preprocessing.html#non-linear-transformation) are ways of preserving the rank of values but lose many of the other features of the relationships that might be preserved by, for instance, the standard scaler.\n",
"\n",
"In the case of Median Income, taking a log-transform of the data might make it _appear_ more normal: you do **not** say that a transformation _makes_ data more normal, you say either that 'it allows us to treat the data as normally distributed' or that 'the transformed data follows a log-normal distribution'."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 5.1: The Normal Distribution\n",
"\n",
"\n",
" Difficulty level: Moderate.\n",
"\n",
"\n",
"Z-scores are often associated with the normal distribution because their interpretation implicitly assumes a normal distribution. Or to put it another way... You can always calculate z-scores for your data (it's just a formula applied to data points), but their _intuitive meaning_ is lost if your data don't have something like a normal distribution (or follow the [68-95-99.7 rule](https://en.wikipedia.org/wiki/68–95–99.7_rule)).\n",
"\n",
"But... what if our data are non-normal? Well, Just because data are non-normal doesn't mean z-scores can't be calculated (we already did that above); we just have to be careful what we do with them... and sometimes we should just avoid them entirely. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 5.1.1 Creating a Normal Distribution\n",
"\n",
"Below is a function to create that theoretical normal distribution. See if you can understand what's going and add comments to the code to explain what each line does. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def normal_from_dist(series): \n",
" mu = ??? # mean of our data\n",
" sd = ??? # standard deviation of our data\n",
" n = ??? # count how many observations are in our data\n",
" s = np.random.normal(???, ???, ???) #use the parameters of the data just calculated to generate n random numbers, drawn from a normal distributions \n",
" return s #return this set of random numbers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To make it easier to understand what the function above is doing, let's use it! We'll use the function to plot both a distribution plot with both histogram and KDE for our data, and then add a _second_ overplot distplot to the same fig showing the theoretical normal distribution (in red). We'll do this in a loop for each of the three variables we want to examine."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 5.1.2 Visual Comparisons\n",
"\n",
"Looking at the output, which of the variables has a roughly normal distribution? Another way to think about this question is, for which of the variables are the mean and standard deviation _most_ appropriate as measures of centrality and spread?\n",
"\n",
"*Also*, how would you determine the _meaning_ of some of the departures from the normal distribution?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"selection = [x for x in df_train.columns.values if x.startswith('Composition')]\n",
"\n",
"for c in selection:\n",
" ax = sns.kdeplot(df_train[c])\n",
" sns.kdeplot(normal_from_dist(df_train[c]), color='r', fill=True, ax=ax)\n",
" plt.legend(loc='upper right', labels=['Observed', 'Normal']) # title='Foo'\n",
" ax.ticklabel_format(useOffset=False, style='plain')\n",
" if ax.get_xlim()[1] > 999999:\n",
" plt.xticks(rotation=45)\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 5.1.3 Questions\n",
"\n",
"- Which, if any, of the variables has a roughly normal distribution? Another way to think about this question is, for which of the variables are the mean and standard deviation _most_ appropriate as measures of centrality and spread?\n",
"- How might you determine the _significance_ of some of the departures from the normal distribution?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 5.2: Logarithmic Transformations\n",
"\n",
"\n",
" Difficulty level: Moderate.\n",
"\n",
"\n",
"To create a new series in the data frame containing the natural log of the original value it’s a similar process to what we've done before, but since pandas doesn't provide a log-transform operator (i.e. you can’t call `df['MedianIncome'].log()` ) we need to use the `numpy` package since pandas data series are just numpy arrays with some fancy window dressing that makes them even _more_ useful.\n",
"\n",
"Let's perform the transform then compare to the un-transformed data. Comment the code below to ensure that you understand what it is doing. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 5.2.1 Apply and Plot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cols = ['Median-2012','Total Mean hh Income']\n",
"\n",
"for m in cols:\n",
" s = df_train[m] # s == series\n",
" ts = ???.???(s) # ts == transformed series\n",
" \n",
" ax = sns.kdeplot(s)\n",
" sns.kdeplot(normal_from_dist(s), color='r', fill=True, ax=ax)\n",
" plt.legend(loc='upper right', labels=['Observed', 'Normal']) # title also an option\n",
" plt.title(\"Original Data\")\n",
" \n",
" ### USEFUL FORMATTING TRICKS ###\n",
" # This turns off scientific notation in the ticklabels\n",
" ax.ticklabel_format(useOffset=False, style='plain')\n",
" # Notice this snippet of code\n",
" ax.set_xlabel(ax.get_xlabel() + \" (Raw Distribution)\")\n",
" # Notice this little code snippet too\n",
" if ax.get_xlim()[1] > 999999:\n",
" plt.xticks(rotation=45)\n",
" \n",
" plt.show()\n",
" \n",
" ax = sns.kdeplot(ts)\n",
" sns.kdeplot(normal_from_dist(ts), color='r', fill=True, ax=ax)\n",
" plt.legend(loc='upper right', labels=['Observed', 'Normal'])\n",
" ax.ticklabel_format(useOffset=False, style='plain')\n",
" ax.set_xlabel(ax.get_xlabel() + \" (Logged Distribution)\")\n",
" if ax.get_xlim()[1] > 999999:\n",
" plt.xticks(rotation=45)\n",
" plt.title(\"Log-Transformed Data\")\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hopefully, you can see that the transformed data do indeed look 'more normal'; the peak of the red and blue lines are closer together and the blue line at the lower extreme is also closer to the red line, but we can check this by seeing what has happened to the z-scores."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 5.3: Power Transformations \n",
"\n",
"\n",
" Difficulty level: Moderate.\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cols = ['Median-2012','Total Mean hh Income']\n",
"pt = ???(method='yeo-johnson')\n",
"\n",
"for m in cols:\n",
" s = df_train[m] # s == series\n",
" ts = pt.fit_transform(s.values.reshape(-1,1))\n",
" print(f\"Using lambda (transform 'exponent') of {pt.lambdas_[0]:0.5f}\")\n",
" \n",
" ax = sns.kdeplot(ts.reshape(-1,))\n",
" \n",
" sns.kdeplot(normal_from_dist(???), color='r', fill=True, ax=ax)\n",
" plt.legend(loc='upper right', labels=['Observed', 'Normal'])\n",
" ax.ticklabel_format(useOffset=False, style='plain')\n",
" ax.set_xlabel(m + \" (Transformed Distribution)\")\n",
" if ax.get_xlim()[1] > 999999: # <-- What does this do?\n",
" plt.xticks(rotation=45)\n",
" plt.title(\"Power-Transformed Data\")\n",
" plt.show();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 6. Principal Components Analysis\n",
"\n",
"\n",
" 🔗 Connections: This is all about dimensionality and the different ways that we can reduce dimensionality in our data in order to improve our models' robustness and gain a better understanding of whether (or not) there is structure in our data. We'll start with linear decomposition and then look at non-linear decomposition, which we also demonstrated with UMAP last week.\n",
"\n",
"\n",
"Now we're going to ask the question: how can we represent our data using a smaller number of components that capture the variance in the original data. You should have covered PCA in Quantitative Methods."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Optional Reload\n",
"\n",
"Use this is your data gets messy..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gdf = gpd.read_parquet(os.path.join('data','geo','MSOA_Atlas.geoparquet')).set_index('MSOA Code')\n",
"print(gdf.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"categoricals = ['Borough','Subregion']\n",
"for c in categoricals:\n",
" gdf[c] = gdf[c].astype('category')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 6.1: Calculating Shares\n",
"\n",
"\n",
" Difficulty level: Hard.\n",
"\n",
"\n",
"Sadly, there's no transformer to work this out for you automatically, but let's start by converting the raw population and household figures to shares so that our later dimensionality reduction steps aren't impacted by the size of the MSOA."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gdf[['Age-All Ages','Households-All Households']].head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 6.1.1 Specify Totals Columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"total_pop = gdf['Age-All Ages']\n",
"total_hh = gdf['Households-All Households']\n",
"total_vec = gdf['Vehicles-Sum of all cars or vans in the area']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 6.1.2 Specify Columns for Pop or HH Normalisation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pop_cols = ['Age-', 'Composition-', 'Qualifications-', 'Economic Activity-', 'White', 'Mixed/multiple',\n",
" 'Asian/Asian British', 'Black/African', 'BAME', 'Other ethnic',\n",
" 'Country of Birth-']\n",
"hh_cols = [???, ???, ???, 'Detached', 'Semi-detached', 'Terraced', 'Flat, ']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"popre = re.compile(r'^(?:' + \"|\".join(pop_cols) + r')')\n",
"hhre = re.compile(r'^(?:' + \"|\".join(???) + r')')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 6.1.3 Apply to Columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tr_gdf = gdf.copy()\n",
"tr_gdf['Mean hh size'] = tr_gdf['Age-All Ages']/tr_gdf['Households-All Households']\n",
"\n",
"for c in gdf.columns:\n",
" print(c)\n",
" if popre.match(c):\n",
" print(\" Normalising by total population.\")\n",
" tr_gdf[c] = gdf[c]/???\n",
" elif ???.match(???):\n",
" print(\" Normalising by total households.\")\n",
" tr_gdf[c] = gdf[c]/???\n",
" elif c.startswith('Vehicles-') and not c.startswith('Vehicles-Cars per hh'):\n",
" print(\" Normalising by total vehicles.\")\n",
" tr_gdf[c] = gdf[c]/???\n",
" else:\n",
" print(\" Passing through.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 6.2: Removing Columns\n",
"\n",
"\n",
" Difficulty level: Moderate.\n",
"\n",
"\n",
"To perform dimensionality we can only have numeric data. In theory, categorical data can be converted to numeric and retained, but there are two issues:\n",
"\n",
"1. Nominal data has no _innate_ order so we _can't_ convert > 2 categories to numbers and have to convert them to One-Hot Encoded values.\n",
"2. A binary (i.e. One-Hot Encoded) variable will account for a _lot_ of variance in the data because it's only two values they are 0 and 1!\n",
"\n",
"So in practice, it's probably a good idea to drop categorical data if you're planning to use PCA."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 6.2.1 Drop Totals Columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pcadf = tr_gdf.drop(columns=['Age-All Ages', 'Households-All Households',\n",
" 'Vehicles-Sum of all cars or vans in the area'])\n",
"pcadf = pcadf.set_index('MSOA Code')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 6.2.2 Drop Non-Numeric Columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pcadf.select_dtypes(['category','object']).columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pcadf.drop(columns=pcadf.select_dtypes(['category','object']).columns.to_list(), inplace=True)\n",
"pcadf.drop(columns=['BNG_E','BNG_N','geometry', 'LONG', 'LAT','Shape__Are', 'Shape__Len'], inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pcadf.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 6.3: Rescale & Reduce\n",
"\n",
"\n",
" Difficulty level: Moderate.\n",
"\n",
"\n",
"In order to ensure that our results aren't dominated by a single scale (e.g. House Prices!) we need to rescale all of our data. You could easily try different scalers as well as a different parameters to see what effect this has on your results."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 6.3.1 Robustly Rescale\n",
"\n",
"Set up the Robust Rescaler for inter-decile standardisation: 10th and 90th quantiles."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rs = ???\n",
"\n",
"for c in pcadf.columns.values:\n",
" pcadf[c] = rs.fit_transform(pcadf[c].values.reshape(-1, 1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 6.3.2 PCA Reduce"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA \n",
"\n",
"pca = PCA(n_components=50, whiten=True) \n",
"\n",
"pca.fit(pcadf)\n",
"\n",
"explained_variance = pca.explained_variance_ratio_\n",
"singular_values = pca.singular_values_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 6.3.3 Examine Explained Variance"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = np.arange(1,???)\n",
"plt.plot(x, explained_variance)\n",
"plt.ylabel('Share of Variance Explained')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for i in range(0, 20):\n",
" print(f\"Component {i:>2} accounts for {explained_variance[i]*100:>2.2f}% of variance\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should get that Component 0 accounts for 31.35% of the variance and Component 19 accounts for 0.37%."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 6.3.4: How Many Components?\n",
"\n",
"There are a number of ways that we could set a threshold for dimensionality reduction: \n",
"- The most common is to look for the 'knee' in the Explained Variance plot above. That would put us at about 5 retained components.\n",
"- Another is to just keep all components contributing more than 1% of the variance. That would put us at about 10 components.\n",
"- You can also ([I discovered](https://medium.com/@nikolay.oskolkov/hi-jon-reades-my-sincere-apologies-for-this-very-late-reply-444f57054d14)) look to shuffle the data and repeatedly perform PCA to build confidence intervals. I have not implemented this (yet).\n",
"\n",
"In order to _do_ anything with these components we need to somehow reattach them to the MSOAs. So that entails taking the transformed results (`X_train` and `X_test`) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"kn = knee_locator.KneeLocator(x, explained_variance, \n",
" curve='convex', direction='decreasing', \n",
" interp_method='interp1d')\n",
"print(f\"Knee detected at: {kn.knee}\")\n",
"kn.plot_knee()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"keep_n_components = 7\n",
"\n",
"# If we weren't changing the number of components we\n",
"# could re-use the pca object created above. \n",
"pca = PCA(n_components=keep_n_components, whiten=True)\n",
"\n",
"X_train = pca.fit_transform(???)\n",
"\n",
"# Notice that we get the _same_ values out,\n",
"# so this is a *deterministic* process that\n",
"# is fully replicable (allowing for algorithmic\n",
"# and programming language differences).\n",
"print(f\"Total explained variance: {pca.explained_variance_ratio_.sum()*100:2.2f}%\")\n",
"for i in range(0, keep_n_components):\n",
" print(f\" Component {i:>2} accounts for {pca.explained_variance_ratio_[i]*100:>5.2f}% of variance\")\n",
"\n",
"# Notice...\n",
"print(f\"X-train shape: {len(X_train)}\")\n",
"print(f\"PCA df shape: {pcadf.shape[0]}\")\n",
"# So each observation has a row in X_train and there is \n",
"# 1 column for each component. This defines the mapping\n",
"# of the original data space into the reduced one\n",
"print(f\"Row 0 of X-train contains {len(X_train[0])} elements.\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 6.4: Components to Columns\n",
"\n",
"\n",
" Difficulty level: Moderate.\n",
"\n",
"\n",
"You could actually do this more quickly (but less clearly) using `X_train.T` to transpose the matrix!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for i in range(0,keep_n_components):\n",
" s = pd.Series(X_train[:,???], index=pcadf.???)\n",
" pcadf[f\"Component {i+1}\"] = s"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pcadf.sample(3).iloc[:,-10:-4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 6.7: (Re)Attaching GeoData\n",
"\n",
"\n",
" Difficulty level: Moderate.\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"msoas = gpd.read_file(os.path.join('data','geo','Middle_Layer_Super_Output_Areas__December_2011__EW_BGC_V2-shp.zip'), driver='ESRI Shapefile')\n",
"msoas = msoas.set_index('MSOA11CD')\n",
"print(msoas.columns)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"msoas.head(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pcadf.head(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gpcadf = pd.merge(msoas.set_index(['MSOA11CD'], drop=True), pcadf, left_index=True, right_index=True, how='inner')\n",
"print(f\"Geo-PCA df has shape {gpcadf.shape[0]} x {gpcadf.shape[1]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should get `PCA df has shape 983 x 89`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gpcadf['Borough'] = gpcadf.MSOA11NM.apply(???)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 6.8: Map the First _n_ Components\n",
"\n",
"\n",
" Difficulty level: Moderate.\n",
"\n",
"\n",
"How would you automate this so that the loop creates one plot for each of the first 3 components? How do you interpret these?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for comp in [f\"Component {x}\" for x in range(1,3)]:\n",
" ax = gpcadf.plot(column=???, cmap='plasma', \n",
" scheme='FisherJenks', k=7, edgecolor='None', legend=True, figsize=(9,7));\n",
" boros.plot(ax=ax, edgecolor='w', facecolor='none', linewidth=1, alpha=0.7)\n",
" ax.set_title(f'PCA {comp}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Your first component map should look something like this:\n",
" \n",
"![PCA Component 1](https://github.com/jreades/fsds/raw/master/practicals/img/PCA-Component-1.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 7. UMAP\n",
"\n",
"UMAP is a non-linear dimensionality reduction technique. Technically, it's called *manifold* learning: imagine being able to roll a piece of paper up in more than just the 3rd dimension...). As a way to see if there is structure in your data this is a *much* better technique than one you might encounter in many tutorials: *t-SNE*. It has to do with how the two techniques 'learn' the manifold to use with your data. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 7.1: UMAP on Raw Data\n",
"\n",
"\n",
" Difficulty level: Hard.\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from umap import UMAP\n",
"\n",
"# You might want to experiment with all\n",
"# 3 of these values -- it may make sense \n",
"# to package a lot of this up into a function!\n",
"keep_dims=2\n",
"rs=42\n",
"\n",
"u = UMAP(\n",
" n_neighbors=25,\n",
" min_dist=0.01,\n",
" n_components=keep_dims,\n",
" random_state=rs)\n",
"\n",
"X_embedded = u.fit_transform(???)\n",
"print(X_embedded.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 7.2: Write to Data Frame\n",
"\n",
"\n",
" Difficulty level: Low.\n",
"\n",
"\n",
"Can probably also be solved using `X_embedded.T`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for ix in range(0,X_embedded.shape[1]):\n",
" print(ix)\n",
" s = pd.Series(X_embedded[:,???], index=pcadf.???)\n",
" gpcadf[f\"Dimension {ix+1}\"] = s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 7.3: Visualise!\n",
"\n",
"\n",
" Difficulty level: Low.\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rddf = gpcadf.copy() # Reduced Dimension Data Frame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 7.3.1 Simple Scatter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"f,ax = plt.subplots(1,1,figsize=(8,6))\n",
"sns.scatterplot(x=rddf[???], y=rddf[???], hue=rddf['Borough'], legend=False, ax=ax)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 7.3.2 Seaborn Jointplot\n",
"\n",
"That is *suggestive* of there being struccture in the data, but with 983 data points and 33 colours it's hard to make sense of what the structure *might* imply. Let's try this again using the Subregion instead and taking advantage of the Seaborn visualisation library's `jointplot` (joint distribution plot):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rddf['Subregion'] = rddf.Borough.apply(lambda x: mapping[x])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Sets some handy 'keywords' to tweak the Seaborn plot\n",
"kwds = dict(s=7,alpha=0.95,edgecolor=\"none\")\n",
"# Set the *hue order* so that all plots have some colouring by Subregion\n",
"ho = ['Inner East','Inner West','Outer West and North West','Outer South','Outer East and North East']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"g = sns.jointplot(data=rddf, x=???, y=???, height=8, \n",
" hue=???, hue_order=ho, joint_kws=kwds)\n",
"g.ax_joint.legend(loc='upper right', prop={'size': 8});"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Your jointplot should look like this:\n",
"\n",
"![UMAP Jointplot](https://github.com/jreades/fsds/raw/master/practicals/img/UMAP-Jointplot.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What do you make of this?\n",
"\n",
"Maybe let's give this one last go splitting the plot out by subregion so that we can see how these vary:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for r in rddf.Subregion.unique():\n",
" g = sns.jointplot(data=rddf[rddf.Subregion==r], x='Dimension 1', y='Dimension 2', \n",
" hue='Borough', joint_kws=kwds)\n",
" g.ax_joint.legend(loc='upper right', prop={'size': 8});\n",
" g.ax_joint.set_ylim(0,15)\n",
" g.ax_joint.set_xlim(0,15)\n",
" plt.suptitle(r)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can't unfortunately do any clustering at this point to create groups from the data (that's next week!) so for now note that there are several large-ish groups (in terms of membership) and few small ones picked up by t-SNE. Alos note that there is strong evidence of some incipient structure: Inner East and West largely clump together, while Outher East and Outer South also seem to group together, with Outer West being more distinctive. If you look back at the PCA Components (especially \\#1) you might be able to speculate about some reasons for this! Please note: this is _only_ speculation at this time!\n",
"\n",
"Next week we'll also add the listings data back in as part of the picture!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 7.4: Map the _n_ Dimensions\n",
"\n",
"\n",
" Difficulty level: Low.\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for comp in [f\"Dimension {x}\" for x in range(1,3)]:\n",
" f, ax = plt.subplots(1,1,figsize=(12,8))\n",
" rddf.plot(???);\n",
" boros.plot(edgecolor='w', facecolor='none', linewidth=1, alpha=0.7, ax=ax)\n",
" ax.set_title(f'UMAP {comp}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Your first dimension map should look something like this:\n",
"\n",
"![UMAP Dimension 1](https://github.com/jreades/fsds/raw/master/practicals/img/UMAP-Dimension-1.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 7.5: And Save"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rddf.to_parquet(os.path.join('data','clean','Reduced_Dimension_Data.geoparquet'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 7.5.1 Questions\n",
"\n",
"- How would you compare/contrast PCA components with UMAP dimensions? Why do they not seem to show the same thing even though *both* seem to show *something*?\n",
"- What might you do with the output of either the PCA or UMAP processes?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Credits!\n",
"\n",
"#### Contributors:\n",
"The following individuals have contributed to these teaching materials: Jon Reades (j.reades@ucl.ac.uk).\n",
"\n",
"#### License\n",
"These teaching materials are licensed under a mix of [The MIT License](https://opensource.org/licenses/mit-license.php) and the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/).\n",
"\n",
"#### Potential Dependencies:\n",
"This notebook may depend on the following libraries: pandas, geopandas, sklearn, matplotlib, seaborn"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}