{
 "metadata": {
  "name": "",
  "signature": "sha256:b772c21023f29ec67033d3d8e1d03d05209899138ec0847e44c23c03907f6079"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Weather Analysis using MapReduce (Part 4)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "import numpy as np\n",
      "import sklearn as sk\n",
      "\n",
      "print 'pandas version: ',pd.__version__\n",
      "print 'numpy version:',np.__version__\n",
      "print 'sklearn version:',sk.__version__"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "pandas version:  0.13.1\n",
        "numpy version: 1.8.1\n",
        "sklearn version: 0.14.1\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import sys\n",
      "home_dir='/home/ubuntu/UCSD_BigData'\n",
      "sys.path.append(home_dir+'/utils')\n",
      "from find_waiting_flow import *\n",
      "from AWS_keypair_management import *"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "4. Merging partitions using medium description length (MDL)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the previous section, we have computed the mean vector and covariance matrix for each partition and have performed the complete PCA for a single partition. \n",
      "\n",
      "Our final task now is to run the PCA for all partitions and merge them, using the medium description length (MDL) as a measure.\n",
      "\n",
      "The MDL criterion for when to merge two regions $1$ and $2$, to a new region $3$ is\n",
      "\n",
      "$$n_1\\cdot k_1+(k_1+1)\\cdot(2\\times 365)+n_2\\cdot k_2+(k_2+1)\\cdot(2\\times 365) > n_3\\cdot k_3+(k_3+1)\\cdot(2\\times 365)$$\n",
      "\n",
      "where $k_i$ is the number of required eigenvectors for region $i$ to explain 99% of the variance and $n_i$ are the number of measurements in region $i$. \n",
      "\n",
      "In order to do that, we will proceed as follows:\n",
      "\n",
      "1. Find neighbors for each partition\n",
      "2. Store neighbor information in a graph\n",
      "3. Merge partitions based on MDL"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Find neighbors for each partition ###"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pickle\n",
      "\n",
      "dfFinal = pickle.load(open('finalTable.pkl', 'rb'))\n",
      "\n",
      "dfBounds = dfFinal.ix[:,['partitionID','lon_min','lon_max','lat_min','lat_max']]\n",
      "dfBounds = dfBounds.drop_duplicates('partitionID').set_index('partitionID')\n",
      "dfBounds.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>lon_min</th>\n",
        "      <th>lon_max</th>\n",
        "      <th>lat_min</th>\n",
        "      <th>lat_max</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>partitionID</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>7126</th>\n",
        "      <td> 37.1499</td>\n",
        "      <td> 47.7500</td>\n",
        "      <td> 41.01655</td>\n",
        "      <td> 42.05000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3565</th>\n",
        "      <td> 47.7500</td>\n",
        "      <td> 69.2335</td>\n",
        "      <td> 40.63300</td>\n",
        "      <td> 42.05000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>7125</th>\n",
        "      <td> 37.1499</td>\n",
        "      <td> 47.7500</td>\n",
        "      <td> 40.70850</td>\n",
        "      <td> 41.01655</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3563</th>\n",
        "      <td> 47.7500</td>\n",
        "      <td> 71.5250</td>\n",
        "      <td> 39.43350</td>\n",
        "      <td> 40.63300</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3560</th>\n",
        "      <td> 44.1250</td>\n",
        "      <td> 47.7500</td>\n",
        "      <td> 39.43350</td>\n",
        "      <td> 40.70850</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 4 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "             lon_min  lon_max   lat_min   lat_max\n",
        "partitionID                                      \n",
        "7126         37.1499  47.7500  41.01655  42.05000\n",
        "3565         47.7500  69.2335  40.63300  42.05000\n",
        "7125         37.1499  47.7500  40.70850  41.01655\n",
        "3563         47.7500  71.5250  39.43350  40.63300\n",
        "3560         44.1250  47.7500  39.43350  40.70850\n",
        "\n",
        "[5 rows x 4 columns]"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Two partitions are neighbours if\n",
      "\n",
      "1. they share at least one common longitude or latitude as a border \n",
      "2. the shared borders overlap\n",
      "\n",
      "The following generator checks these two criterion for a given partition with respect to all other partitions and returns the neighbors."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def find_neighbors(partitionID):\n",
      "    bounds0 = np.array(dfBounds.ix[partitionID,:])\n",
      "    for p in dfBounds.index.values:\n",
      "        bounds = np.array(dfBounds.ix[p,['lon_min','lon_max','lat_min','lat_max']])\n",
      "        #Check whether the two partitions share a median\n",
      "        sharedlon = any([i in bounds0[:2] for i in bounds[:2]])\n",
      "        sharedlat = any([i in bounds0[2:] for i in bounds[2:]])\n",
      "        if sharedlon:\n",
      "            #Check whether the bounds of two partitions overlap\n",
      "            if bounds[2]<bounds0[2]<bounds[3] or bounds[2]<bounds0[3]<bounds[3] \\\n",
      "            or bounds0[2]<bounds[2]<bounds0[3] or bounds0[2]<bounds[3]<bounds0[3]:\n",
      "                yield int(p)\n",
      "            elif sharedlat: \n",
      "                #Check whether the bounds of two partitions overlap\n",
      "                if bounds[0]<bounds0[0]<bounds[1] or bounds[0]<bounds0[1]<bounds[1] \\\n",
      "                or bounds0[0]<bounds[0]<bounds0[1] or bounds0[0]<bounds[1]<bounds0[1]:\n",
      "                    yield int(p)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To provide an example, for partition 3560, we get the neighbors"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "neighbours = list(find_neighbors(3560))\n",
      "print neighbours"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[3565, 7125, 3563]\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "With the help of this generator, we want to build a graph. We will take advantage of the dictionary structure in Python to save all neighbors for all partitions."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "partitions = dfBounds.index.values\n",
      "print partitions"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[7126.0 3565.0 7125.0 ..., 3847.0 7693.0 3841.0]\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%time\n",
      "\n",
      "g = dict()\n",
      "i = 0\n",
      "for p in partitions:\n",
      "    if i%100==0:\n",
      "        print i\n",
      "    g[int(p)] = list(find_neighbors(p))\n",
      "    i = i+1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0\n",
        "100"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "200"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "300"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "400"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "500"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "600"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "700"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "800"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "900"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "1000"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "1100"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "1200"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "1300"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "1400"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "1500"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "1600"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "1700"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "1800"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "1900"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "2000"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "2100"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "CPU times: user 1h 5min 31s, sys: 2.27 s, total: 1h 5min 33s"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Wall time: 1h 16min 6s\n"
       ]
      }
     ],
     "prompt_number": 107
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Save partition graph as pickle file\n",
      "with open('partition_graph.pkl','wb') as file:\n",
      "    pickle.dump(g,file, pickle.HIGHEST_PROTOCOL)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 108
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "g = pickle.load(open('partition_graph.pkl', 'rb'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print g.items()[90:100] "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[(2135, [2129, 2124, 2118, 2137]), (2136, [2138, 2139, 2141]), (2137, [2135, 2124, 2126]), (2138, [2141, 2136]), (2139, [2141, 2136]), (2140, [2337, 2335, 2134, 2142]), (2141, [2138, 2139, 2136]), (2142, [2343, 2337, 2140, 2164]), (2143, [2078, 2121, 2076, 2102, 2100, 2145]), (2144, [2149, 2147, 2146])]\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Merging partitions ###"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us recall that the covariance matrix of a $M\\times N$ matrix $X$ can be computed via\n",
      "\n",
      "$$ Cov(X) = \\frac{1}{N}\\sum^N_{i=1} (\\boldsymbol{x_i}-\\boldsymbol{\\mu}) (\\boldsymbol{x_i}-\\boldsymbol{\\mu})^T \n",
      "= \\frac{1}{N} \\sum^N_{i=1} \\boldsymbol{x_i}\\boldsymbol{x_i}^T - \\boldsymbol{\\mu}\\boldsymbol{\\mu}^T $$\n",
      "\n",
      "where $\\boldsymbol{x_i}, i = 1,...,N$ are the column vectors of matrix $X$ and \n",
      "\n",
      "$$\\boldsymbol{\\mu}=\\frac{1}{N}\\sum^N_{i=1} \\boldsymbol{x_i}$$\n",
      "\n",
      "is the mean vector of all column vectors of matrix $X$.\n",
      "\n",
      "Our goal now is to calculate the covariance matrix of a merged region given that we know the covariance matrices and mean vectors of the partitions to be merged.\n",
      "\n",
      "Suppose we are given two matrices $X$ and $Y$ with dimension $M\\times N_x$ and $M\\times N_y$, respectively. Furthermore, we know their mean vectors $\\boldsymbol{\\mu_x}, \\boldsymbol{\\mu_y}$ , as well as their covariances $Cov(X)$ and $Cov(Y)$. Now, consider the concatenated matrix $Z = [X,Y]$ with dimension $M\\times N_z$, where $N_z=N_x+N_y$. In our case, $Z$ would represent the measurements of the merged region.\n",
      "\n",
      "The covariance of $Z$ can then be expressed as\n",
      "\n",
      "$$ Cov(Z) = \\frac{1}{N_z} \\sum^{N_z}_{i=1} \\boldsymbol{z_i}\\boldsymbol{z_i}^T - \\boldsymbol{\\mu_z}\\boldsymbol{\\mu_z}^T\n",
      "= \\frac{1}{N_x+N_y} [ \\sum^{N_x}_{i=1} \\boldsymbol{x_i}\\boldsymbol{x_i}^T + \\sum^{N_y}_{i=1} \\boldsymbol{y_i}\\boldsymbol{y_i}^T ] - \\boldsymbol{\\mu_z}\\boldsymbol{\\mu_z}^T $$\n",
      "$$ = \\frac{1}{N_x+N_y} [ N_x(Cov(X)+\\boldsymbol{\\mu_x}\\boldsymbol{\\mu_x}^T) + Ny(Cov(Y)+\\boldsymbol{\\mu_y}\\boldsymbol{\\mu_y}^T)] - \\boldsymbol{\\mu_z}\\boldsymbol{\\mu_z}^T $$\n",
      "\n",
      "where\n",
      "\n",
      "$$ \\boldsymbol{\\mu_z}=\\frac{N_x\\boldsymbol{\\mu_x}+N_y\\boldsymbol{\\mu_y}}{N_x+N_y} $$\n",
      "\n",
      "is the mean vector of matrix $Z$."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dfCov = pickle.load(open('covTable.pkl', 'rb'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### To be continued... ###"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}