{ "metadata": { "name": "", "signature": "sha256:f08f221bc04c629a5c179e2646a4b63cb5d87381ff3ffa56b3a0ab1bbf513639" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Python Lab" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Introduction to python for scientific computing" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "import pandas as pd\n", "import numpy as np\n", "import pysal as ps\n", "import matplotlib.pyplot as plt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "`pandas`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data structures to read, interact, transform and write structured data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Read your data:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "db = pd.read_csv('../workshop_data/Houston_pop00.csv')\n", "db.info()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Explore it:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "db.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "db.tail()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "db.describe().T" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Advanced subsetting:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "downtown = db[db['dcbd'] < 15]\n", "downtown.info()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And many more operations on tabular (multi-)indexed data. Check the [documentation](http://pandas.pydata.org) for more info and tutorials." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "`numpy`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`numpy` and `scipy` are the foundational libraries for any kind of numeric computing in Python. Numpy offers the efficient matrix structure denominated `array` or `ndarray` (Numpy data array) as well as some basic statistical functions that may be applied to arrays.\n", "\n", "To whet your appetite, let's first create a simple array. You can do this from a pre-existing Python list, for example:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "l = [1, 2, 3, 4]\n", "a = np.array(l)\n", "a" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At first sight, `a` is not very different from `l`; however, under the hood, it provides much more efficient structures for data manipulation (including C-optimized functions and other performance enhancements). Arrays contain only one data type and may have several dimensions, opening up the door for very fancy matrix manipulation." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print type(a[0])\n", "l += 'a'\n", "a = np.array(l)\n", "print type(a[0])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "l = [[1, 2], [3, 4], [5, 6]]\n", "a = np.array(l)\n", "print 'Array a has a dimension of: ', a.shape\n", "print a" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`numpy` supports operations betwee arrays, such as sumation, difference, multiplication and division:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "a = np.random.random((3, 2))\n", "b = np.random.random((2, 3))\n", "a, b" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Sum (note the transpose for dimensionality alignment)\n", "a + b.T" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Difference (note the transpose for dimensionality alignment)\n", "a - b.T" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Matrix product\n", "c = np.dot(a, b)\n", "c" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Matrix division by a scalar\n", "c = a / 2.\n", "c" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`scipy` is the sister library of Numpy and offers a wide arrange of statistical functions to operate on Numpy arrays. This provides much of the functionality that you find in the core packages of other statistical languages like R (in the r-base package) or Matlab.\n", "\n", "Besides the core of scipy, the project also includes additional packages called `scikits` that expand the main functionality in some particular way. Check out the [scikits website](http://scikits.appspot.com) to get a sense of what is covered.\n", "\n", "`pandas` heavily relies on `numpy` under the hood, and it inherits many of its capabilities. For example, we can operate on vectors as one would do on `numpy` arrays:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "db['pop_dens'] = db['POP00'] / db['ALAND']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "`matplotlib`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`matplotlib` is the main tool for static graphical display in Python. It provides 2D and 3D functionality to plot data in a static way. The library may not appear as very intuitive at first, but if you get over the learning curve, it is extremely flexible and it allows to tweak every aspect of a figure. Because of its focus on flexibility, the defaults may not be the prettiest, but with some work on them, Matplotlib can create beautiful figures that rival in quality with any other library for static plotting (such as R's `ggplot2`).\n", "\n", "Part of the basic functionality is wrapped around `pandas`, so that is a convenient way to get introduced to the library." ] }, { "cell_type": "code", "collapsed": false, "input": [ "db['pop_dens'].plot(kind='kde')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "db['dcbd'].hist(bins=50, color='k', alpha=0.5, grid=False)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "db.pop_dens = db['POP00'] / db['ALAND']\n", "db['pop_dens'].describe()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "`PySAL` introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `dbf` files IO:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "dbf = ps.open('../workshop_data/houston_tract_pop_emp_wgs84.dbf')\n", "dbf.header" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df = pd.DataFrame({'emp_dens': dbf.by_col('emp_dens'), \\\n", " 'pop_dens': dbf.by_col('pop_dens'), \\\n", " 'dcbd': dbf.by_col('dcbd'), \\\n", " 'downtown': dbf.by_col('downtown')})\n", "df.info()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Spatial Weights\n", "\n", "You can create them from a shapefile:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "w = ps.queen_from_shapefile('../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "w" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inspect and explore:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "w.n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "w[0]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "w.transform = 'R'" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "w[0]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And save into a file:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "f = ps.open('houston_tract_queen.gal', 'w')\n", "f.write(w)\n", "f.close()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`PySAL` has a submodule (`pysal.weights.Wsets`) that allows to combine different weights to obtain more sophisticated representations of your geography:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.display import HTML\n", "HTML('')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us create a `W` matrix that combines contiguity for the Houston tracts but excludes neighbors that are across the boundary of downtown (i.e. even if they are contiguous, they are not taken as neighbors if one is downtown and the other one is in the suburbs." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Downtown/suburbs weights\n", "dt_sb = ps.weights.block_weights(dbf.by_col('downtown'))\n", "dt_sb = ps.weights.regime_weights(dbf.by_col('downtown'))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Queen example\n", "queen = ps.queen_from_shapefile('../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The matrix we want is the result of intersecting the two matrices we just created:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "w = ps.weights.Wsets.w_intersection(queen, dt_sb)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "w.transform = 'R'" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can find an example to create a spatial weights matrix that combines contiguity and a block structure [here](http://nbviewer.ipython.org/gist/darribas/847138dced15727f9fcf)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Spatial lag" ] }, { "cell_type": "code", "collapsed": false, "input": [ "wy = ps.lag_spatial(w, df['emp_dens'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Basic choropleth mapping" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pysal.contrib.viz import mapping as maps" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "shp_link = '../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp'\n", "maps.plot_choropleth(shp_link, df['downtown'], 'unique_values', \\\n", " figsize=(12, 12))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "shp_link = '../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp'\n", "maps.plot_choropleth(shp_link, df['emp_dens'], 'quantiles', figsize=(12, 12))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Global spatial autocorrelation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ESDA tools are contained in the `esda` module." ] }, { "cell_type": "code", "collapsed": false, "input": [ "mi = ps.Moran(df['emp_dens'], w)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "mi.I" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Local spatial autocorrelation" ] }, { "cell_type": "code", "collapsed": false, "input": [ "lmi = ps.Moran_Local(df['emp_dens'].values, w)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "lmi.p_sim" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "`spreg`" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "**OLS**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Standard:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "ols_base = ps.spreg.OLS(df['emp_dens'].values[:, None], \\\n", " df[['pop_dens', 'dcbd']].values, \\\n", " name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print ols_base.summary" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ols_base.betas" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ols_base.std_err" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using White correction:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "ols_white = ps.spreg.OLS(df['emp_dens'].values[:, None], \\\n", " df[['pop_dens', 'dcbd']].values, \\\n", " robust='white', \\\n", " name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ols_white.std_err" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "**Spatial diagnostics**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "ols_sp_diag = ps.spreg.OLS(df['emp_dens'].values[:, None], \\\n", " df[['pop_dens', 'dcbd']].values, w, \\\n", " name_x = ['pop_dens', 'dcbd'], name_y='emp_dens', \\\n", " spat_diag=True)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ols_sp_diag.lm_error" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ols_sp_diag.lm_lag" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print ols_sp_diag.summary" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "**OLS + spatial fixed effects**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not by default in `PySAL` but very straightforward with `pandas`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "fes = pd.get_dummies(df['downtown'], prefix='downtown')\n", "fes.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "x = df[['pop_dens', 'dcbd']].join(fes.drop('downtown_0', axis=1))\n", "ols_fe = ps.spreg.OLS(df['emp_dens'].values[:, None], x.values, w, \\\n", " name_x = list(x.columns), name_y='emp_dens', \\\n", " spat_diag=True)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print ols_fe.summary" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "**OLS Regimes**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "downtown = df['downtown'].map({1: 'downtown', 0: 'suburbs'})" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ols_regimes = ps.spreg.OLS_Regimes(df['emp_dens'].values[:, None], \\\n", " df[['pop_dens', 'dcbd']].values, downtown, w=w, \\\n", " name_x = ['pop_dens', 'dcbd'], name_y='emp_dens', \\\n", " spat_diag=True)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print ols_regimes.summary" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "**OLS + WX**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df['w_pop_dens'] = ps.lag_spatial(w, df['pop_dens'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ols_wx = ps.spreg.OLS(df['emp_dens'].values[:, None], \\\n", " df[['pop_dens', 'w_pop_dens', 'dcbd']].values, \\\n", " name_x = ['pop_dens', 'w_pop_dens', 'dcbd'], name_y='emp_dens')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print ols_wx.summary" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "**Spatial lag**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using Instrumental Variables (IV), as in Kelejian & Prucha (1999):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "lag_iv = ps.spreg.GM_Lag(df['emp_dens'].values[:, None], \\\n", " df[['pop_dens', 'dcbd']].values, w=w, \\\n", " name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print lag_iv.summary" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using Maximum Likelihood (ML):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "lag_ml = ps.spreg.ML_Lag(df['emp_dens'].values[:, None], \\\n", " df[['pop_dens', 'dcbd']].values, w=w, \\\n", " name_x = ['pop_dens', 'dcbd'], name_y='emp_dens', \\\n", " method='ord')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print lag_ml.summary" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Spatial error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using GMM proposed by Arraiz et al. (2010):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "ps.spreg.GM_Endog_Error_Het?" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "error_gmm_arraiz = ps.spreg.GM_Error_Het(df['emp_dens'].values[:, None], \\\n", " df[['pop_dens', 'dcbd']].values, w=w, \\\n", " name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print error_gmm_arraiz.summary" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using GMM proposed by Drukker et al. (2010):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "error_gmm_drucker = ps.spreg.GM_Error_Hom(df['emp_dens'].values[:, None], \\\n", " df[['pop_dens', 'dcbd']].values, w=w, \\\n", " name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print error_gmm_drucker.summary" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using Maximum Likelihood:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "error_ml = ps.spreg.ML_Error(df['emp_dens'].values[:, None], \\\n", " df[['pop_dens', 'dcbd']].values, w=w, \\\n", " name_x = ['pop_dens', 'dcbd'], name_y='emp_dens')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print error_ml.summary" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Applying a Spatial-HAC correction to the VC matrix:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "wk = ps.kernelW_from_shapefile('../../sdar_mini_repo/data/houston_tract_pop_emp_wgs84.shp', \\\n", " k=15,function='triangular', fixed=False)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ols_hac = ps.spreg.OLS(df['emp_dens'].values[:, None], \\\n", " df[['pop_dens', 'dcbd']].values, \\\n", " name_x = ['pop_dens', 'dcbd'], name_y='emp_dens', \\\n", " robust='hac', gwk=wk)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print ols_hac.summary" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Batch example: diagnostics for several weights matrices" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Import the LM tests method\n", "from pysal.spreg import LMtests\n", "#Specify the files for the weights we want to try as a list\n", "ws = [w, queen, dt_sb]\n", "#Run the OLS\n", "model = ps.spreg.OLS(df['emp_dens'].values[:, None], \\\n", " df[['pop_dens', 'dcbd']].values, \\\n", " spat_diag=False, nonspat_diag=False)\n", "#Setup the loop over the weights files\n", "for weights in ws:\n", " lms = LMtests(model, w)\n", " print '\\tLM error: %.4f\\t(%.4f)'%lms.lme\n", " print '\\tLM lag: %.4f\\t(%.4f)'%lms.lml\n", " print '\\tSARMA: %.4f\\t(%.4f)'%lms.sarma\n", " print '\\tRobust LM error: %.4f\\t(%.4f)'%lms.rlme\n", " print '\\tRobust LM lag: %.4f\\t(%.4f)'%lms.rlml\n", " print '----------------\\n'" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }