{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![alt text](../figures/header.png \"Logo Title Text 1\")\n", "\n", "\n", "# A Deep Dive into Geospatial Analysis\n", "\n", "Many of the datasets that data scientists handle have some kind of geospatial component to them, and that information is oftentimes useful-to-critical for understanding the problem at hand. As such, an understanding of spatial data and how to work with it is a valuable skill for any data scientist to have. Even better, Python provides a rich toolset for working in this domain, and recent advances have greatly simplified and consolidated these.\n", "\n", "In this tutorial we will take a deep dive into geospatial analysis in Python, using tools like geopandas, shapely, and pysal to analyze a dataset, provided by [Kaggle](https://www.kaggle.com/airbnb-data/boston-airbnb-open-data) (and originally from [Inside AirBnB](http://insideairbnb.com/get-the-data.html)), of sample AirBnB locations in Boston, Massachusetts.\n", "\n", "This tutorial is targeted at folks who know a thing or two about data but haven't used Python's geospatial data tools just yet. As such, it assumes a high level of familiarity with pandas. Some familiarity with scikit-learn, statsmodels, matplotlib, and seaborn is also helpful." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "pd.set_option(\"max_columns\", None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of Contents\n", "\n", "* [**Plotting Points**](#Plotting-Points)\n", "* [**Plotting Geometries**](#Plotting-Geometries)\n", "* [**Upconverting DataFrame to GeoDataFrame Objects**](#Upconverting-DataFrame-to-GeoDataFrame-Objects)\n", "* [**Plotting Geometries and Points**](#Plotting-Geometries-and-Points)\n", "* [**Spatial Weights**](#Spatial-Weights)\n", "* [**Spatial Lag**](#Spatial-Lag)\n", "* [**Spatial Clustering**](#Spatial-Clustering)\n", "* [**Spatial Regression**](#Spatial-Regression)\n", "* [**Conclusion**](#Conclusion)\n", "* [**Extra Credit**](#Extra-Credit)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting Points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's boot up and examine our data. Since our data comes in a simple CSV file, we load it into a pandas DataFrame." ] }, { "cell_type": "code", "execution_count": 353, "metadata": { "collapsed": true }, "outputs": [], "source": [ "listings = pd.read_csv(\"../input/listings.csv\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Master bedroom high... \n", "1 Charming and quiet room in a second floor 1910... \n", "2 Come stay with a friendly, middle-aged guy in ... \n", "3 Come experience the comforts of home away from... \n", "4 My comfy, clean and relaxing home is one block... \n", "\n", " space \\\n", "0 The house has an open and cozy feel at the sam... \n", "1 Small but cozy and quite room with a full size... \n", "2 Come stay with a friendly, middle-aged guy in ... \n", "3 Most places you find in Boston are small howev... \n", "4 Clean, attractive, private room, one block fro... \n", "\n", " description experiences_offered \\\n", "0 Cozy, sunny, family home. Master bedroom high... none \n", "1 Charming and quiet room in a second floor 1910... none \n", "2 Come stay with a friendly, middle-aged guy in ... none \n", "3 Come experience the comforts of home away from... none \n", "4 My comfy, clean and relaxing home is one block... none Bus stops a few... \n", "2 PUBLIC TRANSPORTATION: From the house, quick p... \n", "3 There are buses that stop right in front of th... \n", "4 From Logan Airport and South Station you have... NaN \n", "1 I live in Boston and I like to travel and have... within an hour \n", "2 I am a middle-aged, single male with a wide ra... within a few hours \n", "3 My husband and I live on the property. He's a... within a few hours \n", "4 I work full time for a public school district.... within an hour Roslindale \n", "2 https://a2.muscache.com/im/users/16701/profile... Roslindale \n", "3 https://a2.muscache.com/im/pictures/5d430cde-7... NaN \n", "4 https://a0.muscache.com/im/users/15396970/prof... Roslindale \n", "\n", " host_listings_count host_total_listings_count NaN$250.00 \n", "1 {TV,Internet,\"Wireless Internet\",\"Air Conditio... NaN $65.00 \n", "2 {TV,\"Cable TV\",\"Wireless Internet\",\"Air Condit... NaN$65.00 \n", "3 {TV,Internet,\"Wireless Internet\",\"Air Conditio... NaN $75.00 \n", "4 {Internet,\"Wireless Internet\",\"Air Conditionin... NaN$79.00 This means that it's easy for us to, say, plot every BnB location on a map: "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.scatter(listings['longitude'], listings['latitude'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Chances are you've already done this before, and it's a perfectly adequate way to get started working with locations. In this plot we see...not much, really. If you're very intimately familiar with the layout of the city of Boston, you will probably be able to make sense of some of these clusters which are, to me, not being from the city, totally mysterious.\n", "\n", "In other words, this plot is missing something important: **geospatial context**.\n", "\n", "Additionally, this display is **unprojected**—it's displayed in terms of raw coordinates. The amount of distance contained in a coordinate degree varies greatly depending on where you are, so this naive plot potentially pretty badly distorts distances.\n", "\n", "We'll come back to the projection issue later; for now, there's an easy fix for both these problems.\n", "\n", "Enter [mplleaflet](https://github.com/jwass/mplleaflet). mplleaflet is a tool that automatically takes a coordinate matplotlib plot of any kind and places it on top of a [leaflet](http://leafletjs.com/) slippy map. The best part is that it's just one additional line of code. Just throw mplleaflet.display() after generating your plot to drop it inline in your Jupyter notebook:" ] }, { "cell_type": "code", "execution_count": 526, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "