{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Back in 2013, housing.com was exactly what I wanted - an easy to navigate, color-coded rental map of places. I could consume more information without having to trawl through tiring listing after listing (which was exactly what almost all other websites offered).\n", "\n", "(I loved them for putting out exactly what I had in mind and I also remember congratulating myself for having their vision! :=P)\n", "\n", "Fast forwarding to 2017, the same guy who was looking for renting places is now looking for a home to purchase. Alas, housing.com has gone back to being a decorative listing place (no surprise there). \n", "\n", "There was no way he was going to filter-search-trawl-and-repeat to get to a place. The road forked; it was going to be either - approach a real estate agent or scrape this data. He chose the latter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![old is gold](http://www.geekycube.com/wp-content/uploads/housing-maps.png)\n", " source: http://www.geekycube.com/wp-content/uploads/housing-maps.png " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scraping it!" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Open \"housing.com\" in Google chrome while you are in the Network tab in the console.\n", "If you nose around long enough, you will notice this jewel of an API - https://search.housing.com/api/v1/buy/similar-properties?source=web&limit=4&flat_id=\n", "\n", "![view of the console](chrome_console.png)\n", "\n", "And, this is all you need. The readable naming convention seems to hint that it's their recommendation API. I tried a few combinations and settled down with a _limit=50_. What happened next was pretty simple as you must have guessed - you query this API for a *flat_id* and it would return a JSON response containing a few other flat IDs. You save this data somewhere and repeat the process with each of the new *flat_ids*. And, alas, you have a scraper. \n", "\n", "The scraper in all its uglyness is [here](www.github.com/fx86).\n", "\n", "This lazy script got me around 25 thousand listings in a span of a week." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Can you smell all that data ?!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "New Year 2017 resolution: When you have data lying around, visualize it!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And, so we went. I was happy with Pandas and its plotting extensions with a smattering of Seaborn colors schemes and aesthetics for the most part of it." ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# The following are my default imports\n", "\n", "import pandas as pd\n", "import seaborn as sns\n", "from matplotlib import pyplot as plt\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pd.options.display.mpl_style = False" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "%matplotlib inline\n", "sns.set(style=\"whitegrid\", color_codes=True)\n", "sns.set(font_scale=1)\n", "\n", "# I wanted to be able to read the long addresses so I increased the column width\n", "pd.set_option('display.max_colwidth', 1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, that I had imported all the packages that I needed to begin with, I did what all of you would do, with the data." ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(58551, 227)\n" ] } ], "source": [ "df = pd.read_csv(\"housing.csv\", encoding=\"utf-8\")\n", "print df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data set initially was 58 thousand rows long and 227 columns wide. This was because there was a lot of overlap; thanks in-part to the nested JSON structure and save duplicates listings." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While the script was humming along, I had pretty-printed it and found some columns of interest." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "columns = {\"completion_date\":\"completed_on\",\"ctr\":\"CTR\",\n", " \"date_added\":\"date_added\",\"date_added_in_seconds\":\"listing_date_seconds\",\n", " \"developer_name\":\"developer_name\",\"featured_type\":\"featured_type\",\n", " \"impressions\":\"impressions\",\"inventory_amenities_has_gas_pipeline\":\"has_gas\",\n", " \"inventory_amenities_has_gym\":\"has_gym\",\"inventory_amenities_has_lift\":\"has_lift\",\n", " \"inventory_amenities_has_parking\":\"has_parking\",\n", " \"inventory_amenities_has_servant_room\":\"has_servant_room\",\n", " \"inventory_amenities_has_swimming_pool\":\"has_pool\",\n", " \"inventory_amenities_is_gated_community\":\"has_gated_community\",\n", " \"inventory_canonical_url\":\"URL\",\n", " \"inventory_configs_0_apartment_type_id\":\"appt_type_id\",\n", " \"inventory_configs_0_area\":\"area\",\n", " \"inventory_configs_0_facing\":\"facing\",\n", " \"inventory_configs_0_is_available\":\"is_available\",\n", " \"inventory_configs_0_number_of_bedrooms\":\"bedrooms\",\n", " \"inventory_configs_0_number_of_toilets\":\"toilets\",\n", " \"inventory_configs_0_parking_count\":\"parking_count\",\n", " \"inventory_configs_0_per_sqft_rate\":\"sqft_rate\",\n", " \"inventory_configs_0_price\":\"price\",\n", " \"inventory_configs_0_property_type_id\":\"property_type\",\n", " \"is_blocked\":\"is_blocked\",\n", " \"is_featured\":\"is_featured\",\n", " \"is_uc_property\":\"is_uc_property\",\n", " \"polygons_hash_city_name\":\"city\",\n", " \"polygons_hash_housing_region_name\":\"housing_region\",\n", " \"polygons_hash_state_name\":\"state\",\n", " \"polygons_hash_sublocality_name\":\"sub_locality\",\n", " \"price_on_request\":\"price_on_request\",\n", " \"show_loan_option\":\"has_loan\",\n", " \"status\":\"status\",\n", " \"street_info\":\"street\",\n", " \"title\":\"title\",\"type\":\"sale_type\",\n", " \"updated_at\":\"updated_on\",\n", " \"_id\": \"id\",\n", " \"display_neighbourhood_0\": \"locality\"}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, I kept the necessary columns and renamed them. " ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(24482, 41)\n" ] } ], "source": [ "temp = df[columns.keys()]\n", "temp = temp.rename(columns=columns).drop_duplicates('id')\n", "print temp.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have reduced to 41 columns and 24 thousand properties now. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Missing values in data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parental advisory: if your kid comes home with a data set with a long list of columns, do not let them try the following visualization. It may cause dizziness on empty stomach." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But, the same viz helps to see the amount of null values that exist in each column." ] }, { "cell_type": "code", "execution_count": 129, "metadata": { "collapsed": true, "scrolled": false }, "outputs": [], "source": [ "df_temp = pd.DataFrame({col: {'Empty': temp[temp[col].isnull()].shape[0], 'Total': temp.shape[0]} for col in temp})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we made a new data frame, which kept a count of empty cells against non-empty cells. It looked like this:" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Empty | \n", "Total | \n", "
---|---|---|
CTR | \n", "0 | \n", "24482 | \n", "
URL | \n", "0 | \n", "24482 | \n", "
appt_type_id | \n", "0 | \n", "24482 | \n", "
area | \n", "0 | \n", "24482 | \n", "
bedrooms | \n", "0 | \n", "24482 | \n", "
\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"
\\n\"+\n",
" \"<Bokeh Notebook handle for In[200]>