{ "metadata": { "name": "", "signature": "sha256:b772c21023f29ec67033d3d8e1d03d05209899138ec0847e44c23c03907f6079" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Weather Analysis using MapReduce (Part 4)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import numpy as np\n", "import sklearn as sk\n", "\n", "print 'pandas version: ',pd.__version__\n", "print 'numpy version:',np.__version__\n", "print 'sklearn version:',sk.__version__" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "pandas version: 0.13.1\n", "numpy version: 1.8.1\n", "sklearn version: 0.14.1\n" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "home_dir='/home/ubuntu/UCSD_BigData'\n", "sys.path.append(home_dir+'/utils')\n", "from find_waiting_flow import *\n", "from AWS_keypair_management import *" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "4. Merging partitions using medium description length (MDL)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous section, we have computed the mean vector and covariance matrix for each partition and have performed the complete PCA for a single partition. \n", "\n", "Our final task now is to run the PCA for all partitions and merge them, using the medium description length (MDL) as a measure.\n", "\n", "The MDL criterion for when to merge two regions $1$ and $2$, to a new region $3$ is\n", "\n", "$$n_1\\cdot k_1+(k_1+1)\\cdot(2\\times 365)+n_2\\cdot k_2+(k_2+1)\\cdot(2\\times 365) > n_3\\cdot k_3+(k_3+1)\\cdot(2\\times 365)$$\n", "\n", "where $k_i$ is the number of required eigenvectors for region $i$ to explain 99% of the variance and $n_i$ are the number of measurements in region $i$. \n", "\n", "In order to do that, we will proceed as follows:\n", "\n", "1. Find neighbors for each partition\n", "2. Store neighbor information in a graph\n", "3. Merge partitions based on MDL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Find neighbors for each partition ###" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pickle\n", "\n", "dfFinal = pickle.load(open('finalTable.pkl', 'rb'))\n", "\n", "dfBounds = dfFinal.ix[:,['partitionID','lon_min','lon_max','lat_min','lat_max']]\n", "dfBounds = dfBounds.drop_duplicates('partitionID').set_index('partitionID')\n", "dfBounds.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | lon_min | \n", "lon_max | \n", "lat_min | \n", "lat_max | \n", "
---|---|---|---|---|
partitionID | \n", "\n", " | \n", " | \n", " | \n", " |
7126 | \n", "37.1499 | \n", "47.7500 | \n", "41.01655 | \n", "42.05000 | \n", "
3565 | \n", "47.7500 | \n", "69.2335 | \n", "40.63300 | \n", "42.05000 | \n", "
7125 | \n", "37.1499 | \n", "47.7500 | \n", "40.70850 | \n", "41.01655 | \n", "
3563 | \n", "47.7500 | \n", "71.5250 | \n", "39.43350 | \n", "40.63300 | \n", "
3560 | \n", "44.1250 | \n", "47.7500 | \n", "39.43350 | \n", "40.70850 | \n", "
5 rows \u00d7 4 columns
\n", "