{ "cells": [ { "cell_type": "raw", "metadata": {}, "source": [ "\n", "
\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "In my [second post on HDB resale flat prices](https://dataandstuff.wordpress.com/2017/09/09/resale-flats-and-clusters/), I attempted to price resale flats in Jurong West by creating clusters of flats. The big idea behind this methodology was the realisation that it is impossible to explicitly account for all qualitative reasons why home buyers would want to buy a flat in a specific area. It could be because of good schools, proximity to amenities like transport centers and malls, or the liveliness of the areas. Furthermore, each factor would have different importance to different people. Modeling this would be a nightmare. Hence, I chose just two features to develop a spatial representation of these preferences: (1) latitude and (2) longitude. In fact, I showed that clusters of nearby flats shared a statistically meaningful relationship with resale prices in those clusters. \n", " \n", "In that post, I computed clusters for Jurong West. This time, I apply that methodology to all 26 towns in the dataset. The objective is to develop a set of clusters for **each town** to be used as categorical features in our model of HDB resale flat prices." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Import modules\n", "import gmaps\n", "import json\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn.apionly as sns\n", "from scipy.spatial import ConvexHull\n", "from sklearn.cluster import KMeans\n", "from sklearn.preprocessing import MinMaxScaler\n", "import tabulate\n", "import urllib.request as ur\n", "import warnings\n", "\n", "# Settings\n", "%matplotlib inline\n", "warnings.filterwarnings('ignore')\n", "\n", "# Colours\n", "def get_cols():\n", " \n", " print('[Colours]:')\n", " print('Orange: #ff9966')\n", " print('Navy Blue: #133056')\n", " print('Light Blue: #b1ceeb')\n", " print('Green: #6fceb0')\n", " print('Red: #f85b74')\n", "\n", " return" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Modify settings\n", "mpl.rcParams['axes.grid'] = True\n", "mpl.rcParams['axes.grid.axis'] = 'y'\n", "mpl.rcParams['grid.color'] = '#e8e8e8'\n", "mpl.rcParams['axes.spines.right'] = False\n", "mpl.rcParams['axes.spines.top'] = False\n", "mpl.rcParams['xtick.color'] = '#494949'\n", "mpl.rcParams['xtick.labelsize'] = 12\n", "mpl.rcParams['ytick.color'] = '#494949'\n", "mpl.rcParams['ytick.labelsize'] = 12\n", "mpl.rcParams['axes.edgecolor'] = '#494949'\n", "mpl.rcParams['axes.labelsize'] = 15\n", "mpl.rcParams['axes.labelpad'] = 15\n", "mpl.rcParams['axes.labelcolor'] = '#494949'\n", "mpl.rcParams['axes.axisbelow'] = True\n", "mpl.rcParams['figure.titlesize'] = 20\n", "mpl.rcParams['figure.titleweight'] = 'bold'\n", "mpl.rcParams['font.family'] = 'sans-serif'\n", "mpl.rcParams['font.sans-serif'] = 'Raleway'\n", "mpl.rcParams['scatter.marker'] = 'h'\n", "\n", "# Colours\n", "def get_cols():\n", " \n", " print('[Colours]:')\n", " print('Orange: #ff9966')\n", " print('Navy Blue: #133056')\n", " print('Light Blue: #b1ceeb')\n", " print('Green: #6fceb0')\n", " print('Red: #f85b74')\n", "\n", " return" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Geocoding\n", "ArcGIS defines geocoding as \"the process of transforming a description of a location—such as a pair of coordinates, an address, or a name of a place—to a location on the earth's surface\". In the HDB resale flat dataset, we are provided with a block number and a street name. Combining these two features gives us an address that can be converted into latitude and longitude. If we think of latitude and longitude as the *y* and *x* axes on a graph, then each flat is simply a point on the graph, and groups of flats can be identified easily. \n", " \n", "How do we collect this data? Easy: we get it [HERE](https://developer.here.com/). HERE provides a host of location services like interactive maps, geocoding, traffic, tracking and routing. Its clients include Bing, Samsung, Audi, and Grab. Yes, Singapore's Grab. What's amazing is that it provides users with an API (see the link above) with up to **250,000 free requests**. That is an insanely large amount of requests, at least when compared to Google Places' 40,000 request limit. \n", " \n", "To begin, we load the HDB resale flat dataset and create two addresses: \n", " \n", "1. **Full Address:** To facilitating identification of the address. For example: *174 ANG MO KIO AVE 4*\n", "2. **Search Address:** This is the address that we are going to append to our query to HERE. For example: *174+ANG+MO+KIO+AVE+4+SINGAPORE*\n", " \n", "Note that I extracted the unique addresses from the full list of search addresses we created. This is to save time and stick within the query limit, because we need not geocode the same address more than once. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": false }, "outputs": [], "source": [ "# Read data\n", "hdb = pd.read_csv('resale-flat-prices-based-on-registration-date-from-jan-2015-onwards.csv')\n", "\n", "# Create addresses\n", "hdb['full_address'] = hdb.block + ' ' + hdb.street_name\n", "hdb['search_address'] = hdb.block + '+' + hdb.street_name.str.replace(' ', '+') + '+SINGAPORE'\n", "\n", "# Extract search addresses\n", "all_adds = hdb.search_address.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we need to configure two parameters: our individual app ID and app code. These can be found on your project page after you have created an account. We save these as variables:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Set parameters\n", "APP_ID = '[YOUR HERE APP ID]'\n", "APP_CODE = '[YOUR HERE APP CODE]'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will loop through the unique list of addresses, use `urllib` to query the API, `json` to process the JSON response (into a Python dictionary), and a custom function to extract the data we need from the processed response. The function `get_loc` retrieves two sets of latitude and longitude: the display position and the navigation position. Based on some testing, I discovered that the **average** of these two sets of coordinates was more accurate than either of them individually. Hence, I chose to compute the average latitude and average longitude." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Define function to extract average location\n", "# Takes a dictionary object\n", "# Returns the average of display position and navigation position in a dictionary\n", "def get_loc(result):\n", " \n", " # Output\n", " output = dict()\n", " \n", " if len(result['Response']['View']) > 0:\n", " \n", " # Get display position lat/long\n", " lat_dp = result['Response']['View'][0]['Result'][0]['Location']['DisplayPosition']['Latitude']\n", " lon_dp = result['Response']['View'][0]['Result'][0]['Location']['DisplayPosition']['Longitude']\n", "\n", " # Get navigation position lat/long\n", " lat_np = result['Response']['View'][0]['Result'][0]['Location']['NavigationPosition'][0]['Latitude']\n", " lon_np = result['Response']['View'][0]['Result'][0]['Location']['NavigationPosition'][0]['Longitude']\n", " \n", " # Configure output\n", " output['lat'] = (lat_dp + lat_np) / 2\n", " output['lon'] = (lon_dp + lon_np) / 2\n", " \n", " else:\n", " \n", " # Configure output\n", " output['lat'] = np.nan\n", " output['lon'] = np.nan\n", " \n", " return output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the search addresses created, app ID and code configured, and the function defined, we are ready to geocode. Note that the output of `get_loc` for each search address was a dictionary, and all the dictionaries were saved into a list. This facilitated conversion into a Pandas dataframe. The code below will take approximately 5 minutes to run and 8.5k queries out of your HERE API limit. I've run the code and saved the results to a CSV file for quick loading." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Initialise results\n", "all_latlon = []\n", "\n", "# Loop through to get lat lon\n", "for i in range(len(all_adds)):\n", " \n", " # Extract address\n", " temp_add = all_adds[i]\n", " \n", " # Configure URL\n", " temp_url = 'https://geocoder.api.here.com/6.2/geocode.json' + \\\n", " '?app_id=' + APP_ID + \\\n", " '&app_code=' + APP_CODE + \\\n", " '&searchtext=' + temp_add\n", " \n", " # Pull data\n", " temp_response = ur.urlopen(ur.Request(temp_url)).read()\n", " temp_result = json.loads(temp_response)\n", " \n", " # Process data\n", " temp_latlon = get_loc(temp_result)\n", " \n", " # Add address\n", " temp_latlon['address'] = temp_add\n", " \n", " # Append\n", " all_latlon.append(temp_latlon)\n", " \n", " # Update\n", " print(str(i) + '. ', 'Getting data for: ' + str(temp_add))\n", "\n", "# Convert to data frame\n", "full_latlon = pd.DataFrame(all_latlon)\n", "\n", "# Save\n", "full_latlon.to_csv('latlon_data.csv', index = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a preview of the data:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
addresslatlon
0174+ANG+MO+KIO+AVE+4+SINGAPORE1.375270103.837640
1541+ANG+MO+KIO+AVE+10+SINGAPORE1.374025103.855695
2163+ANG+MO+KIO+AVE+4+SINGAPORE1.373885103.838110
3446+ANG+MO+KIO+AVE+10+SINGAPORE1.367855103.855395
4557+ANG+MO+KIO+AVE+10+SINGAPORE1.371540103.857790
\n", "
" ], "text/plain": [ " address lat lon\n", "0 174+ANG+MO+KIO+AVE+4+SINGAPORE 1.375270 103.837640\n", "1 541+ANG+MO+KIO+AVE+10+SINGAPORE 1.374025 103.855695\n", "2 163+ANG+MO+KIO+AVE+4+SINGAPORE 1.373885 103.838110\n", "3 446+ANG+MO+KIO+AVE+10+SINGAPORE 1.367855 103.855395\n", "4 557+ANG+MO+KIO+AVE+10+SINGAPORE 1.371540 103.857790" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load data\n", "map_latlon = pd.read_csv('latlon_data.csv')\n", "\n", "# View\n", "map_latlon.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use this data frame to map the search addresses in the main dataframe to a latitude and longitude pair. With that, we are done with geocoding." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Set index - I should have named the search address as search_address in the loop\n", "map_latlon = map_latlon.rename({'address': 'search_address'})\n", "map_latlon = map_latlon.set_index('address')\n", "\n", "# Separate maps\n", "map_lat = map_latlon['lat']\n", "map_lon = map_latlon['lon']\n", "\n", "# Map\n", "hdb['lat'] = hdb.search_address.map(map_lat)\n", "hdb['lon'] = hdb.search_address.map(map_lon)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Clustering\n", "Next, we develop clusters within each of the 26 towns using the resale flats' coordinates. Let's use Jurong West as an example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Jurong West\n", "First, we extract the coordinates of flats in Jurong West and scale the coordinates using the `MinMaxScaler`. This transforms each of the coordinates into features that range between 0 and 1 to ensure that longitude, which has a larger magnitude (100 vs. 1), does not affect the distance calculations in clustering." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latlon
00.6858210.624126
10.7034770.616068
20.7580600.652565
30.6858210.624126
40.8243560.927597
\n", "
" ], "text/plain": [ " lat lon\n", "0 0.685821 0.624126\n", "1 0.703477 0.616068\n", "2 0.758060 0.652565\n", "3 0.685821 0.624126\n", "4 0.824356 0.927597" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# For Jurong West\n", "dat_jw = hdb[['lat', 'lon']][hdb.town == 'JURONG WEST']\n", "dat_jw = dat_jw.reset_index(drop = True)\n", "\n", "# Normalise\n", "mmscale = MinMaxScaler()\n", "mmscale.fit(dat_jw)\n", "dat_jw_scaled = pd.DataFrame(mmscale.transform(dat_jw), columns = ['lat', 'lon'])\n", "dat_jw_scaled.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we apply K-Means clustering on the dataset. Testing out different values of *k* (2 to 22) and using the elbow method to decide on an optimal *k*, we get *k* = 7." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Set up values of k\n", "all_k = np.arange(2, 23, 1)\n", "\n", "# Initialise results\n", "k_results = []\n", "\n", "# Loop through values of k\n", "for k in all_k:\n", " \n", " # Set up kmeans\n", " km1 = KMeans(n_clusters = k, random_state = 123)\n", "\n", " # Fit data\n", " km1.fit(dat_jw_scaled)\n", "\n", " # Score data\n", " k_results.append(km1.inertia_)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# CODE FOR CUSTOM GRAPHICS NOT INCLUDED" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hence, we fit a K-Means model with *k* = 7 and assign each resale flat to a cluster." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Fitting 7 clusters:\n", "km_final = KMeans(n_clusters = 7, random_state = 123)\n", "km_final.fit(dat_jw_scaled)\n", "dat_jw['label'] = km_final.labels_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We find some differentiation in resale prices across the clusters. For example, Jurong West Cluster 3 contains flats that are generally more expensive than Clusters 1, 2, 5, and 6." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "dat_jw_full = hdb[hdb.town == 'JURONG WEST']\n", "dat_jw_full['label'] = km_final.labels_ + 1" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# CODE FOR CUSTOM GRAPHICS NOT INCLUDED" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though the clustering looks good statistically, it always makes sense to do a visual check, which can be done using the `gmaps` module. `gmaps` uses the Google Maps Javascript API (Google provides a free API key) to generate a HTML Google Map. We plot the **unique** coordinates of flats, color-coded by label, to give us a map with 7 distinct zones. The map shows that 7 clusters is a good fit (or at least isn't terrible)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": false }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3efd234e0deb49749c78aba49891a3ee", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Figure(layout=FigureLayout(height='420px'))" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# GMAPS\n", "GOOGLE_API = '[YOUR GOOGLE API KEY HERE]'\n", "gmaps.configure(api_key = GOOGLE_API)\n", "\n", "# Configure labels and column names\n", "dat_jw_plot = dat_jw[['lat', 'lon', 'label']].copy()\n", "dat_jw_plot['label'] = dat_jw_plot['label'] + 1\n", "dat_jw_plot = dat_jw_plot.rename(columns = {'lat': 'latitude', 'lon': 'longitude'})\n", "\n", "# Remove duplicates\n", "dat_jw_plot = dat_jw_plot.drop_duplicates()\n", "latlon_jw = dat_jw_plot[['latitude', 'longitude']]\n", "\n", "# Create layers\n", "# Colour code:\n", "# - 1: Blue\n", "# - 2: Orange\n", "# - 3: Green\n", "# - 4: Red\n", "# - 5: Purple\n", "# - 6: Brown\n", "# - 7: Pink\n", "# - 8: Grey\n", "c1 = gmaps.symbol_layer(latlon_jw[dat_jw_plot.label == 1], fill_color = '#1f77b4', stroke_color='#1f77b4', scale = 2)\n", "c2 = gmaps.symbol_layer(latlon_jw[dat_jw_plot.label == 2], fill_color = '#ff7f0e', stroke_color='#ff7f0e', scale = 2) \n", "c3 = gmaps.symbol_layer(latlon_jw[dat_jw_plot.label == 3], fill_color = '#2ca02c', stroke_color='#2ca02c', scale = 2)\n", "c4 = gmaps.symbol_layer(latlon_jw[dat_jw_plot.label == 4], fill_color = '#d62728', stroke_color='#d62728', scale = 2)\n", "c5 = gmaps.symbol_layer(latlon_jw[dat_jw_plot.label == 5], fill_color = '#9467bd', stroke_color='#9467bd', scale = 2)\n", "c6 = gmaps.symbol_layer(latlon_jw[dat_jw_plot.label == 6], fill_color = '#8c564b', stroke_color='#8c564b', scale = 2)\n", "c7 = gmaps.symbol_layer(latlon_jw[dat_jw_plot.label == 7], fill_color = '#e377c2', stroke_color='#e377c2', scale = 2)\n", "\n", "# Create base map\n", "t1 = gmaps.figure()\n", "\n", "# Add layers\n", "t1.add_layer(c1)\n", "t1.add_layer(c2)\n", "t1.add_layer(c3)\n", "t1.add_layer(c4)\n", "t1.add_layer(c5)\n", "t1.add_layer(c6)\n", "t1.add_layer(c7)\n", "\n", "# Visualise\n", "t1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Function to Plot Elbow Graph" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def clust_town(town):\n", " # Extract town data\n", " temp_dat = hdb[['lat', 'lon']][hdb.town == town]\n", " temp_dat = temp_dat.reset_index(drop = True)\n", "\n", " # Normalise\n", " temp_mm = MinMaxScaler()\n", " temp_mm.fit(temp_dat)\n", " temp_dat_scaled = pd.DataFrame(temp_mm.transform(temp_dat), columns = ['lat', 'lon'])\n", "\n", " # Set up values of k\n", " temp_k = np.arange(2, 23, 1)\n", "\n", " # Initialise results\n", " temp_results = []\n", "\n", " # Loop through values of k\n", " for k in temp_k:\n", "\n", " # Set up kmeans\n", " temp_km = KMeans(n_clusters = k, random_state = 123)\n", "\n", " # Fit data\n", " temp_km.fit(temp_dat_scaled)\n", "\n", " # Score data\n", " temp_results.append(temp_km.inertia_)\n", "\n", " # Plot\n", " plt.plot(temp_k, temp_results)\n", " plt.title(town)\n", " plt.show()\n", " " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# NOT RUN\n", "# Perform clustering on each town using k = 2 to 22 to choose an optimal k\n", "# for t in hdb.town.value_counts().index:\n", " \n", "# clust_town(t)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Function to Plot Maps" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def clust_map(town):\n", " \n", " # Extract town data\n", " temp_dat = hdb[['lat', 'lon']][hdb.town == town]\n", " temp_dat = temp_dat.reset_index(drop = True)\n", " \n", " # Normalise\n", " temp_mm = MinMaxScaler()\n", " temp_mm.fit(temp_dat)\n", " temp_dat_scaled = pd.DataFrame(temp_mm.transform(temp_dat), columns = ['lat', 'lon'])\n", " \n", " # Get optimal clusters\n", " opt_clust = disp_clust.loc[town][0]\n", " \n", " # Fitting 7 clusters:\n", " temp_km = KMeans(n_clusters = opt_clust, random_state = 123)\n", " temp_km.fit(temp_dat_scaled)\n", " \n", " # Configure labels and column names\n", " plot_labels = temp_dat[['lat', 'lon']].copy()\n", " plot_labels['label'] = temp_km.labels_ + 1\n", " plot_labels = plot_labels.rename(columns = {'lat': 'latitude', 'lon': 'longitude'})\n", "\n", " # Remove duplicates\n", " plot_labels = plot_labels.drop_duplicates()\n", " plotdata = plot_labels[['latitude', 'longitude']]\n", " \n", " # Configure colours\n", " temp_colors = sns.color_palette().as_hex()\n", " \n", " # Create base graph\n", " out_graph = gmaps.figure()\n", " \n", " # Add layers sequentially\n", " for i in range(opt_clust):\n", " \n", " # Add layer\n", " out_graph.add_layer(\n", " gmaps.symbol_layer(\n", " plotdata[plot_labels.label == i + 1],\n", " fill_color = temp_colors[i], stroke_color = temp_colors[i],\n", " scale = 2\n", " )\n", " )\n", " \n", " # Output\n", " return out_graph\n", "\n", "# Function to plot all points\n", "def clust_map_all(town):\n", " \n", " # Extract town data\n", " temp_dat = hdb[['lat', 'lon']][hdb.town == town]\n", " temp_dat = temp_dat.reset_index(drop = True)\n", " \n", " # Remove duplicates\n", " temp_dat = temp_dat.drop_duplicates()\n", " \n", " # Create base graph\n", " out_graph = gmaps.figure()\n", " \n", " # Add layer\n", " out_graph.add_layer(gmaps.symbol_layer(temp_dat, scale = 2))\n", " \n", " # Output\n", " return out_graph" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# NOT RUN\n", "# Run clust_map for all towns to check that the k values make sense\n", "# clust_map('[TOWN HERE]')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Decide on clusters\n", "clust_results = [7, 5, 6, 7, 7, 4, 5, 6, 5, 5, 3, 6, 5, 7, 7, 6, 6, 5, 5, 8, 5, 5, 4, 2, 1, 2]\n", "\n", "# Create dataframe\n", "disp_clust = pd.DataFrame(\n", " [hdb.town.value_counts().index, clust_results], index = ['Town', 'Clusters']\n", ").T.set_index('Town')\n", "\n", "# Create markdown table\n", "# print(tabulate.tabulate(disp_clust, tablefmt=\"pipe\", headers = 'keys'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate Optimal Clusters for All Towns\n", "Performing the same process of selecting an optimal *k* using the elbow and graphical methods, we obtained the following results: \n", " \n", "| Town | Clusters |\n", "|:----------------|-----------:|\n", "| JURONG WEST | 7 |\n", "| SENGKANG | 5 |\n", "| WOODLANDS | 6 |\n", "| TAMPINES | 7 |\n", "| BEDOK | 7 |\n", "| YISHUN | 4 |\n", "| PUNGGOL | 5 |\n", "| HOUGANG | 6 |\n", "| ANG MO KIO | 5 |\n", "| CHOA CHU KANG | 5 |\n", "| BUKIT BATOK | 3 |\n", "| BUKIT MERAH | 6 |\n", "| BUKIT PANJANG | 5 |\n", "| TOA PAYOH | 7 |\n", "| KALLANG/WHAMPOA | 7 |\n", "| PASIR RIS | 6 |\n", "| SEMBAWANG | 6 |\n", "| GEYLANG | 5 |\n", "| QUEENSTOWN | 5 |\n", "| CLEMENTI | 8 |\n", "| JURONG EAST | 5 |\n", "| SERANGOON | 5 |\n", "| BISHAN | 4 |\n", "| CENTRAL AREA | 2 |\n", "| MARINE PARADE | 1 |\n", "| BUKIT TIMAH | 2 |\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Attach Clusters to Dataset\n", "To obtain the cluster labels for each town, we fit a K-Means model to the scaled data and the optimal *k* value for each town. This generates labels, which we then attach to the original dataset, town by town, under the feature `label`. To distinguish between the clusters of different towns, we generate a new feature `cluster` that appends the town name to the cluster label. That gives us a total of 134 clusters across Singapore." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Get list of towns\n", "all_towns = hdb.town.value_counts().index\n", "\n", "# Initialise label\n", "hdb['label'] = 0\n", "\n", "# Loop through\n", "for town in all_towns:\n", " \n", " # Extract town data\n", " temp_dat = hdb[['lat', 'lon']][hdb.town == town]\n", " temp_dat = temp_dat.reset_index(drop = True)\n", "\n", " # Normalise\n", " temp_mm = MinMaxScaler()\n", " temp_mm.fit(temp_dat)\n", " temp_dat_scaled = pd.DataFrame(temp_mm.transform(temp_dat), columns = ['lat', 'lon'])\n", " \n", " # Get optimal clusters\n", " opt_clust = disp_clust.loc[town][0]\n", " \n", " # Fit optimal clusters:\n", " temp_km = KMeans(n_clusters = opt_clust, random_state = 123)\n", " temp_km.fit(temp_dat_scaled)\n", " \n", " # Attach labels\n", " hdb['label'][hdb.town == town] = temp_km.labels_ + 1\n", "\n", "# Attach town name to cluster label\n", "hdb['clust'] = hdb.town + '_' + hdb.label.astype('str')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the full map of all clusters in Singapore:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# Extract town data\n", "full_dat = hdb[['lat', 'lon', 'clust', 'town']]\n", "full_dat = full_dat.reset_index(drop = True)\n", "\n", "# Remove duplicates\n", "full_dat = full_dat.drop_duplicates()\n", "\n", "# Rename columns\n", "full_dat = full_dat.rename(columns = {'lat': 'latitude', 'lon': 'longitude'})\n", "\n", "# Sort\n", "full_dat = full_dat.sort_values(['longitude', 'latitude'])\n", "\n", "# Extract towns\n", "all_towns = list(full_dat.town.unique())\n", "\n", "# Extract cluster names\n", "all_clust = list(full_dat.clust.unique())\n", "\n", "# Configure colours\n", "all_colors = sns.color_palette().as_hex()\n", "\n", "# Configure town colors\n", "town_colors = all_colors * 3" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Create base graph\n", "out_graph = gmaps.figure()\n", "\n", "# Loop through clusters\n", "for t in range(len(all_towns)):\n", " \n", " # Get clusters\n", " temp_clust = full_dat.clust[full_dat.town == all_towns[t]].unique()\n", " \n", " # Town coordinates\n", " temp_towndata = full_dat[['latitude', 'longitude']][full_dat.town == all_towns[t]].copy()\n", " \n", " # Get town coords\n", " temp_towncoords = np.array([[n, m] for n, m in zip(temp_towndata.longitude, temp_towndata.latitude)])\n", " \n", " # Calculate convex hull\n", " temp_townhull = ConvexHull(temp_towncoords)\n", " \n", " # Drawing\n", " temp_towndrawing = gmaps.drawing_layer(features=[\n", " gmaps.Polygon(\n", " [(n, m) for n, m in zip(temp_towncoords[temp_townhull.vertices, 1], temp_towncoords[temp_townhull.vertices, 0])],\n", " fill_color = town_colors[t], fill_opacity = 0.3, stroke_color = 'black'\n", " )\n", " ], show_controls = False)\n", "\n", " # Add drawing\n", " out_graph.add_layer(temp_towndrawing)\n", " \n", " # Get town center\n", " \n", " # Add point\n", " out_graph.add_layer(\n", " gmaps.symbol_layer(\n", " [(temp_towndata.latitude.median(), temp_towndata.longitude.median())],\n", " fill_color = town_colors[t], stroke_color = town_colors[t],\n", " scale = 1\n", " )\n", " )\n", " \n", " for c in range(len(temp_clust)):\n", "\n", " # Extract coordinates\n", " temp_plotdata = full_dat[['latitude', 'longitude']][full_dat.clust == temp_clust[c]].copy()\n", "\n", " # Get coords\n", " temp_coords = np.array([[x, y] for x, y in zip(temp_plotdata.longitude, temp_plotdata.latitude)])\n", "\n", " # Calculate convex hull\n", " temp_hull = ConvexHull(temp_coords)\n", "\n", " # Drawing\n", " temp_drawing = gmaps.drawing_layer(features=[\n", " gmaps.Polygon(\n", " [(x, y) for x, y in zip(temp_coords[temp_hull.vertices, 1], temp_coords[temp_hull.vertices, 0])],\n", " fill_color = all_colors[c], fill_opacity = 0.3, stroke_color = all_colors[c]\n", " )\n", " ], show_controls = False)\n", "\n", " # Add drawing\n", " out_graph.add_layer(temp_drawing)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "e43f243e39b04161882786fe0edee5d2", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Figure(layout=FigureLayout(height='420px'))" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "out_graph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion\n", "In this post, I demonstrated how an address created from block numbers and streets in the HDB resale flat dataset could be used to generate new features. Geocoding was used to convert addresses into geographic coordinates, and coordinates were used to generate clusters within each town. This produced a total of 134 clusters across Singapore. Hopefully, these will be useful when we develop our machine learning model to predict resale flat prices." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }