{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Leveraging Open-Source Python Packages for Data Analysis within the ArcGIS Environment (Direct Integration Strategy)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Using NumPy as the common denominator\n", "\n", "- Could use the ArcPy Data Access Module directly, but there are host of issues/information one must take into account:\n", " * How to deal with projections and other environment settings?\n", " * How Cursors affect the accounting of features?\n", " * How to deal with bad records/bad data and error handling?\n", " * How to honor/account for full field object control?\n", " * How do I create output features that correspond to my inputs?\n", " - Points are easy, what about Polygons and Polylines?\n", "- Spatial Statistics Data Object (SSDataObject)\n", " * Almost 30 Spatial Statistics Tools written in Python that ${\\bf{must}}$ behave like traditional GP Tools\n", " * Use SSDataObject and your code should adhere" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## The Data Analysis Python Modules\n", "\n", "- [PANDAS (Python Data Analysis Library)](http://pandas.pydata.org/)\n", " \n", "- [SciPy (Scientific Python)](http://www.scipy.org/)\n", "\n", "- [PySAL (Python Spatial Analysis Library)](https://geodacenter.asu.edu/pysal)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Basic Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "import arcpy as ARCPY\n", "import numpy as NUM\n", "import SSDataObject as SSDO\n", "import scipy as SCIPY\n", "import pandas as PANDA\n", "import pysal as PYSAL" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Initialize Data Object and Query Attribute Fields" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "FID OID\n", "POP1999 Integer\n", "POP1998 Integer\n", "POP1997 Integer\n", "POP1996 Integer\n", "POP1995 Integer\n", "POP1994 Integer\n", "POP1993 Integer\n", "POP1992 Integer\n", "POP1991 Integer\n", "POP1990 Integer\n", "POPDEN70 Double\n", "SOUTH Integer\n", "AMERI_ES Double\n", "MED_AGE_F Double\n", "MULT_RACE Double\n", "HAWN_PI Double\n", "PERCVACANT Double\n", "POP2009 Integer\n", "NEW_NAME String\n", "POP1988 Integer\n", "POP1989 Integer\n", "POP1984 Integer\n", "POP1985 Integer\n", "POP1986 Integer\n", "POP1987 Integer\n", "POP1980 Integer\n", "POP1981 Integer\n", "POP1982 Integer\n", "POP1983 Integer\n", "PCR1991 Double\n", "PCR1990 Double\n", "PCR1993 Double\n", "GROWTH Double\n", "PCR1995 Double\n", "PCR1994 Double\n", "PCR1997 Double\n", "PCR1996 Double\n", "PCR1999 Double\n", "PCR1998 Double\n", "POV1979 Integer\n", "NEW_FIPS String\n", "SHAPE_LENG Double\n", "SHAPE_AREA Double\n", "AGE_UNDER5 Double\n", "AGE_18_21 Double\n", "PCR1988 Double\n", "PCR1989 Double\n", "PCR1986 Double\n", "PCR1987 Double\n", "PCR1984 Double\n", "PCR1985 Double\n", "PCR1982 Double\n", "PCR1983 Double\n", "PCR1980 Double\n", "PCR1981 Double\n", "BIRTH1970 Integer\n", "HOUSEHOLDS Double\n", "PERCPOV69 Double\n", "POP2010 Integer\n", "CRUDEDEATH Double\n", "AGE_65_UP Double\n", "POV2009 Integer\n", "BEA_6 Integer\n", "AVG_SALE97 Double\n", "VACANT Double\n", "GR_69_02 Double\n", "SOCAL SmallInteger\n", "POP2000 Integer\n", "POP2001 Integer\n", "POP2002 Integer\n", "POP2003 Integer\n", "POP2004 Integer\n", "POP2005 Integer\n", "POP2006 Integer\n", "POP2007 Integer\n", "POP2008 Integer\n", "PCR2010 Double\n", "CRUDEBIRTH Double\n", "LOGPCR69 Double\n", "BEA_4 Double\n", "MYID Integer\n", "AGE_22_29 Double\n", "BEA_1 Integer\n", "BEA_2 Integer\n", "BEA_3 Integer\n", "FEMALES Double\n", "BEA_5 Double\n", "RENTER_OCC Double\n", "HISPANIC Double\n", "BEA_8 Integer\n", "PCR1971 Double\n", "PCR1973 Double\n", "PCR1972 Double\n", "SHAPE Geometry\n", "PCR1970 Double\n", "PCR1977 Double\n", "PCR1976 Double\n", "PCR1975 Double\n", "PCR1974 Double\n", "POINT_X Double\n", "PCR1979 Double\n", "PCR1978 Double\n", "ASIAN Double\n", "PERCNOHS Double\n", "POV1989 Integer\n", "STATE String\n", "STATEFIP String\n", "AGE_40_49 Double\n", "MED_AGE Double\n", "BEA_7 Integer\n", "POINT_Y Double\n", "OWNER_OCC Double\n", "PCR1969 Double\n", "BEA_NAME String\n", "POP1975 Integer\n", "POP1974 Integer\n", "POP1977 Integer\n", "POP1976 Integer\n", "POP1971 Integer\n", "POP1970 Integer\n", "POP1973 Integer\n", "POP1972 Integer\n", "POP1979 Integer\n", "POP1978 Integer\n", "BLACK Double\n", "POV1969 Integer\n", "TOTVACANT7 Integer\n", "TOTHOUSE70 Integer\n", "MALES Double\n", "PCR2006 Double\n", "PCR2007 Double\n", "PCR2004 Double\n", "PCR2005 Double\n", "PCR2002 Double\n", "PCR2003 Double\n", "PCR2000 Double\n", "PCR2001 Double\n", "MED_AGE_M Double\n", "PCR2008 Double\n", "PCR2009 Double\n", "POP1969 Integer\n", "OTHER Double\n", "WHITE Double\n", "AVE_HH_SZ Double\n", "AGE_30_39 Double\n", "TESTZVAR Double\n", "AGE_50_64 Double\n", "DEATH1970 Integer\n", "PCR1992 Double\n", "POV1999 Integer\n" ] } ], "source": [ "inputFC = r'../data/CA_Polygons.shp'\n", "ssdo = SSDO.SSDataObject(inputFC)\n", "for fieldName, fieldObject in ssdo.allFields.iteritems():\n", " print fieldName, fieldObject.type\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Select Fields to Read Into NumPy Arrays \n", "- The Unique ID Field (Object ID in this example) will keep track of the order of your features\n", " * You have no control over Object ID Fields. It is quick, assures \"uniqueness\", but can't assume they will not get \"scrambled\" during copies.\n", " * To assure full control I advocate the \"Add Field (LONG)\" --> \"Calculate Field (From Object ID)\" workflow.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "ssdo.obtainData(ssdo.oidName, ['GROWTH', 'PCR1970', 'POPDEN70', 'PERCNOHS'])\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 5.56947324e-01 2.68032831e-04 7.75864375e-03 2.38301421e-02\n", " 5.10823361e-03]\n" ] } ], "source": [ "popInfo = ssdo.fields['POPDEN70']\n", "popData = popInfo.data\n", "print popData[0:5]" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Adding Results to Input/Output\n", "- Example: Adding a field of random standard normal values to your input/output" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Create a Dictionary of Candidate Fields" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "import numpy.random as RAND\n", "ARCPY.env.overwriteOutput = True\n", "outArray = RAND.normal(0,1, (ssdo.numObs,))\n", "outDict = {}\n", "outField = SSDO.CandidateField('STDNORM', 'DOUBLE', outArray, alias = 'My Standard Normal Result')\n", "outDict[outField.name] = outField\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Add New Field to Input\n", "- Be Carefull!" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "ssdo.addFields2FC(outDict)\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Copy Features, Selected Attribute Field(s), New Result Field(s) to Output" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "import os as OS\n", "outputFC = OS.path.abspath(r'../data/testMyOutput.shp')\n", "ssdo.output2NewFC(outputFC, outDict, appendFields = ['GROWTH', 'PERCNOHS'])\n", "del ssdo" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Getting More Advanced - SciPy and PANDAS" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " NEW_NAME PCR1975 PCR1980 PCR1985 PCR1990 PCR1995 PCR2000 PCR2005 \\\n", "158 Alameda 1.169255 1.195712 1.200988 1.165406 1.158115 1.307115 1.248997 \n", "159 Alpine 0.844546 0.906803 0.855655 0.924508 0.820581 0.949886 0.930033 \n", "160 Amador 0.991467 0.963228 0.921839 0.823639 0.815521 0.814954 0.864324 \n", "161 Butte 0.910668 0.898385 0.817796 0.794387 0.773955 0.763665 0.790418 \n", "162 Calaveras 0.941372 0.875469 0.891595 0.870938 0.806776 0.867385 0.880388 \n", "\n", " PCR2010 SOCAL \n", "158 1.206422 0 \n", "159 1.079837 0 \n", "160 0.886305 0 \n", "161 0.816018 0 \n", "162 0.877746 0 \n" ] } ], "source": [ "ssdo = SSDO.SSDataObject(inputFC)\n", "years = NUM.arange(1975, 2015, 5)\n", "fieldNames = ['PCR' + str(i) for i in years]\n", "fieldNamesAll = fieldNames + ['NEW_NAME', 'SOCAL']\n", "ssdo.obtainData(\"MYID\", fieldNamesAll)\n", "ids = [ssdo.order2Master[i] for i in xrange(ssdo.numObs)]\n", "convertDictDF = {}\n", "for fieldName, fieldObject in ssdo.fields.iteritems():\n", " convertDictDF[fieldName] = fieldObject.data\n", "df = PANDA.DataFrame(convertDictDF, index = ids)\n", "print df[0:5]\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Using GroupBy for Conditional Statistics\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Example: One Liner for Average Incomes Based on Southern/Non-Southern California" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " PCR1975 PCR1980 PCR1985 PCR1990 PCR1995 PCR2000 PCR2005 PCR2010\n", "SOCAL \n", "0 1.075988 1.063532 0.978680 0.946142 0.932030 0.970583 0.971438 0.977341\n", "1 1.076797 1.097469 1.051641 1.016739 0.952055 0.943493 0.986927 0.954825\n" ] } ], "source": [ "groups = df.groupby('SOCAL')\n", "print groups.mean()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Now the Median..." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " PCR1975 PCR1980 PCR1985 PCR1990 PCR1995 PCR2000 PCR2005 PCR2010\n", "SOCAL \n", "0 1.015875 1.002767 0.902323 0.871296 0.841908 0.837941 0.862593 0.887863\n", "1 1.071270 1.093269 1.076200 1.012586 0.960921 0.965993 1.015992 1.013383\n" ] } ], "source": [ "print groups.median()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Example: Calculating the Trend of Rolling Means" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "pcr = df.ix[:,1:9]\n", "rollMeans = NUM.apply_along_axis(PANDA.rolling_mean, 1, pcr, 4)\n", "timeInts = NUM.arange(0, 5)\n", "outArray = NUM.empty((ssdo.numObs, 5), float)\n", "for i in xrange(ssdo.numObs):\n", " outArray[i] = SCIPY.stats.linregress(timeInts, rollMeans[i,3:])" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Write to Output (Same as Always...)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "outputFC = OS.path.abspath(r'../data/testMyRollingMeanInfo.shp')\n", "outFields = [ \"SLOPE\", \"INTERCEPT\", \"R_SQRAURED\", \"P_VALUE\", \"STD_ERR\" ]\n", "outDict = {}\n", "for fieldInd, fieldName in enumerate(outFields):\n", " outDict[fieldName] = SSDO.CandidateField(fieldName, \"DOUBLE\", outArray[:,fieldInd])\n", "ssdo.output2NewFC(outputFC, outDict, fieldOrder = outFields)\n", "del ssdo" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Even More Advanced: PySAL \n" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Example: Max(p) Regional Clustering" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 [7, 11, 52, 44, 22, 51, 17, 5, 24, 16, 3, 46, 8, 45, 10, 4, 54, 57, 50, 21, 9]\n", "1 [1, 2, 33, 47, 56, 25, 13, 37, 27, 30, 28, 31]\n", "2 [36, 32]\n", "3 [41, 55, 29]\n", "4 [15, 23, 53, 34, 14, 49, 19, 38]\n", "5 [40, 0, 6, 42]\n", "6 [26, 39, 43]\n", "7 [18]\n", "8 [20, 48]\n", "9 [12, 35]\n" ] } ], "source": [ "ssdo = SSDO.SSDataObject(inputFC)\n", "ssdo.obtainData(ssdo.oidName, ['GROWTH', 'POP1970', 'PERCNOHS'])\n", "w = PYSAL.weights.knnW(ssdo.xyCoords, k=5)\n", "X = NUM.empty((ssdo.numObs,2), float)\n", "X[:,0] = ssdo.fields['GROWTH'].data\n", "X[:,1] = ssdo.fields['PERCNOHS'].data\n", "floorVal = 1000000.0\n", "floorVar = ssdo.fields['POP1970'].returnDouble()\n", "maxp = PYSAL.region.Maxp(w, X, floorVal, floor_variable = floorVar)\n", "outArray = NUM.empty((ssdo.numObs,), int)\n", "for regionID, orderIDs in enumerate(maxp.regions):\n", " outArray[orderIDs] = regionID\n", " print regionID, orderIDs\n", " " ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "outputFC = OS.path.abspath(r'../data/testMaxPInfo.shp')\n", "outDict = {}\n", "outDict[\"REGIONID\"] = SSDO.CandidateField(\"REGIONID\", \"DOUBLE\", outArray)\n", "ssdo.output2NewFC(outputFC, outDict, appendFields = ['GROWTH', 'POP1970', 'PERCNOHS'])\n", "del ssdo" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.3" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 0 }