{ "metadata": { "name": "", "signature": "sha256:b772c21023f29ec67033d3d8e1d03d05209899138ec0847e44c23c03907f6079" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Weather Analysis using MapReduce (Part 4)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import numpy as np\n", "import sklearn as sk\n", "\n", "print 'pandas version: ',pd.__version__\n", "print 'numpy version:',np.__version__\n", "print 'sklearn version:',sk.__version__" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "pandas version: 0.13.1\n", "numpy version: 1.8.1\n", "sklearn version: 0.14.1\n" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "home_dir='/home/ubuntu/UCSD_BigData'\n", "sys.path.append(home_dir+'/utils')\n", "from find_waiting_flow import *\n", "from AWS_keypair_management import *" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "4. Merging partitions using medium description length (MDL)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous section, we have computed the mean vector and covariance matrix for each partition and have performed the complete PCA for a single partition. \n", "\n", "Our final task now is to run the PCA for all partitions and merge them, using the medium description length (MDL) as a measure.\n", "\n", "The MDL criterion for when to merge two regions $1$ and $2$, to a new region $3$ is\n", "\n", "$$n_1\\cdot k_1+(k_1+1)\\cdot(2\\times 365)+n_2\\cdot k_2+(k_2+1)\\cdot(2\\times 365) > n_3\\cdot k_3+(k_3+1)\\cdot(2\\times 365)$$\n", "\n", "where $k_i$ is the number of required eigenvectors for region $i$ to explain 99% of the variance and $n_i$ are the number of measurements in region $i$. \n", "\n", "In order to do that, we will proceed as follows:\n", "\n", "1. Find neighbors for each partition\n", "2. Store neighbor information in a graph\n", "3. Merge partitions based on MDL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Find neighbors for each partition ###" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pickle\n", "\n", "dfFinal = pickle.load(open('finalTable.pkl', 'rb'))\n", "\n", "dfBounds = dfFinal.ix[:,['partitionID','lon_min','lon_max','lat_min','lat_max']]\n", "dfBounds = dfBounds.drop_duplicates('partitionID').set_index('partitionID')\n", "dfBounds.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lon_minlon_maxlat_minlat_max
partitionID
7126 37.1499 47.7500 41.01655 42.05000
3565 47.7500 69.2335 40.63300 42.05000
7125 37.1499 47.7500 40.70850 41.01655
3563 47.7500 71.5250 39.43350 40.63300
3560 44.1250 47.7500 39.43350 40.70850
\n", "

5 rows \u00d7 4 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ " lon_min lon_max lat_min lat_max\n", "partitionID \n", "7126 37.1499 47.7500 41.01655 42.05000\n", "3565 47.7500 69.2335 40.63300 42.05000\n", "7125 37.1499 47.7500 40.70850 41.01655\n", "3563 47.7500 71.5250 39.43350 40.63300\n", "3560 44.1250 47.7500 39.43350 40.70850\n", "\n", "[5 rows x 4 columns]" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two partitions are neighbours if\n", "\n", "1. they share at least one common longitude or latitude as a border \n", "2. the shared borders overlap\n", "\n", "The following generator checks these two criterion for a given partition with respect to all other partitions and returns the neighbors." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def find_neighbors(partitionID):\n", " bounds0 = np.array(dfBounds.ix[partitionID,:])\n", " for p in dfBounds.index.values:\n", " bounds = np.array(dfBounds.ix[p,['lon_min','lon_max','lat_min','lat_max']])\n", " #Check whether the two partitions share a median\n", " sharedlon = any([i in bounds0[:2] for i in bounds[:2]])\n", " sharedlat = any([i in bounds0[2:] for i in bounds[2:]])\n", " if sharedlon:\n", " #Check whether the bounds of two partitions overlap\n", " if bounds[2]