{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# The Split-Apply-Combine Pattern in Data Science and Python\n",
    "\n",
    "## Tobias Brandt\n",
    "\n",
    "<img src=\"img/argon_logo.png\" align=left width=200>\n",
    "\n",
    "<!-- <img src=\"http://www.argonassetmanagement.co.za/css/img/logo.png\" align=left width=200> -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<img src='img/argon_website.png' align='middle'>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Google trends chart\n",
    "\n",
    "![\"data science\" vs \"data analysis\"](img/data_science_vs_data_analysis.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Data Science\n",
    "\n",
    "According to https://en.wikipedia.org/wiki/Data_science:\n",
    "\n",
    "In November 1997, C.F. Jeff Wu gave the inaugural lecture entitled **\"Statistics = Data Science?\"**[5] for his appointment to the H. C. Carver Professorship at the University of Michigan.[6] In this lecture, he characterized statistical work as a trilogy of **data collection**, **data modeling and analysis**, and **decision making**. In his conclusion, he initiated the modern, non-computer science, usage of the term \"data science\" and advocated that statistics be renamed data science and statisticians data scientists.[5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## The Github Archive Dataset\n",
    "\n",
    "https://www.githubarchive.org/\n",
    "\n",
    "Open-source developers all over the world are working on millions of projects: writing code & documentation, fixing & submitting bugs, and so forth. GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.\n",
    "\n",
    "GitHub provides 20+ event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. These events are aggregated into hourly archives, which you can access with any HTTP client:\n",
    "\n",
    "  * gzipped json files\n",
    "  * yyyy-mm-dd-HH.json.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import gzip\n",
    "import ujson as json\n",
    "\n",
    "directory = 'data/github_archive'\n",
    "filename = '2015-01-29-16.json.gz'\n",
    "\n",
    "path = os.path.join(directory, filename)\n",
    "with gzip.open(path) as f:\n",
    "        events = [json.loads(line) for line in f]\n",
    "#print json.dumps(events[0], indent=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<pre>\n",
    "{\n",
    "    <b>\"payload\": {</b>\n",
    "        \"master_branch\": \"master\", \n",
    "        \"ref_type\": \"branch\", \n",
    "        \"ref\": \"disable_dropdown\", \n",
    "        \"description\": \"OOI UI Source Code\", \n",
    "        \"pusher_type\": \"user\"\n",
    "    }, \n",
    "    <b>\"created_at\": \"2015-01-29T16:00:00Z\", </b>\n",
    "    \"actor\": {\n",
    "        \"url\": \"https://api.github.com/users/birdage\", \n",
    "        <b>\"login\": \"birdage\", </b>\n",
    "        \"avatar_url\": \"https://avatars.githubusercontent.com/u/547228?\", \n",
    "        \"id\": 547228, \n",
    "        \"gravatar_id\": \"\"\n",
    "    }, \n",
    "    \"id\": \"2545235518\", \n",
    "    \"repo\": {\n",
    "        \"url\": \"https://api.github.com/repos/birdage/ooi-ui\", \n",
    "        \"id\": 23796192, \n",
    "        <b>\"name\": \"birdage/ooi-ui\"</b>\n",
    "    }, \n",
    "    <b>\"type\": \"CreateEvent\", </b>\n",
    "    \"public\": true\n",
    "}\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Typical Questions\n",
    "\n",
    "  * How many Github repositories are created per hour/day/month?\n",
    "  * To which repositories are the most commits are pushed per hour/day/month?\n",
    "  * Which projects receive the most pull requests?\n",
    "  * What are the most popular languages on Github?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Example 1 - Number of Repositories Created"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "new_repo_count = 0\n",
    "for event in events:\n",
    "    new_repo_count += \\\n",
    "        1 if event['type']==\"CreateEvent\" else 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3516\n"
     ]
    }
   ],
   "source": [
    "print new_repo_count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Example 2 - Number of commits pushed per repository"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "repo_commits = {}\n",
    "for event in events:\n",
    "    if event['type']==\"PushEvent\":\n",
    "        repo = event['repo']['name']\n",
    "        commits = event['payload']['size']\n",
    "        repo_commits[repo] = \\\n",
    "            repo_commits.get(repo, 0) + commits "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "eberhardt/moodle                         3335\n",
      "sakai-mirror/melete                      3209\n",
      "jfaris/phonegap-facebook-plugin          3201\n",
      "sakai-mirror/mneme                       2922\n",
      "wolfe-pack/wolfe                         2001\n"
     ]
    }
   ],
   "source": [
    "def print_top_items(dct, N=5):\n",
    "    sorted_items = sorted(\n",
    "        dct.iteritems(), key=lambda t: t[1], reverse=True)\n",
    "    for key, value in sorted_items[:N]:\n",
    "        print \"{:40} {}\".format(key, value)\n",
    "\n",
    "print_top_items(repo_commits)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# The Split-Apply-Combine Pattern\n",
    "\n",
    "## Hadley Wickham <img src=\"http://pix-media.s3.amazonaws.com/blog/1001/HadleyObama.png\" width=250 align=left>\n",
    "\n",
    "[Hadley Wickham, the man who revolutionized R](http://priceonomics.com/hadley-wickham-the-man-who-revolutionized-r/)\n",
    "\n",
    "*If you don’t spend much of your time coding in the open-source statistical programming language R, \n",
    "his name is likely not familiar to you -- but the statistician Hadley Wickham is, \n",
    "in his own words, “nerd famous.” The kind of famous where people at statistics conferences \n",
    "line up for selfies, ask him for autographs, and are generally in awe of him. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe src=\"http://www.jstatsoft.org/v40/i01\" width=800 height=400></iframe>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "HTML('<iframe src=\"http://www.jstatsoft.org/v40/i01\" width=800 height=400></iframe>')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<img src=\"http://i.imgur.com/CoJHnAF.jpg\">\n",
    "\n",
    "  * StackOverflow: [split-apply-combine tag](http://stackoverflow.com/tags/split-apply-combine/info)\n",
    "  * Pandas documentation: [Group By: split-apply-combine](http://pandas.pydata.org/pandas-docs/stable/groupby.html)\n",
    "  * PyTools documentation: [Split-apply-combine with groupby and reduceby](http://toolz.readthedocs.org/en/latest/streaming-analytics.html#split-apply-combine-with-groupby-and-reduceby)\n",
    "  * Blaze documentation: [Split-Apply-Combine - Grouping](http://blaze.pydata.org/en/stable/split-apply-combine.html)\n",
    "  * R plyr: [plyr: Tools for Splitting, Applying and Combining Data](https://cran.r-project.org/web/packages/plyr/index.html)\n",
    "  * Julia documentation: [The Split-Apply-Combine Strategy](https://dataframesjl.readthedocs.org/en/latest/split_apply_combine.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## The Basic Pattern\n",
    "\n",
    " 1. **Split** the data by some **grouping variable**\n",
    " 2. **Apply** some function to each group **independently**\n",
    " 3. **Combine** the data into some output dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "  * The **apply** step is usually one of\n",
    "      * **aggregate**\n",
    "      * **transform**\n",
    "      * or **filter**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Example 2 - examined"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "eberhardt/moodle                         3335\n",
      "sakai-mirror/melete                      3209\n",
      "jfaris/phonegap-facebook-plugin          3201\n",
      "sakai-mirror/mneme                       2922\n",
      "wolfe-pack/wolfe                         2001\n"
     ]
    }
   ],
   "source": [
    "repo_commits = {}\n",
    "for event in events:\n",
    "    if event['type']==\"PushEvent\":\n",
    "        repo = event['repo']['name']\n",
    "        commits = event['payload']['size']\n",
    "        repo_commits[repo] = \\\n",
    "            repo_commits.get(repo, 0) + commits \n",
    "print_top_items(repo_commits)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "This\n",
    "\n",
    "  * filters out only the \"PushEvent\"s\n",
    "  * **splits** the dataset by *repository*\n",
    "  * **sums** the commits for each group\n",
    "  * **combines** the groups and their sums into a dictionary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Pandas - Python Data Analysis Library\n",
    "\n",
    "<p><a href=\"http://pandas.pydata.org/\"><img src=\"http://pandas.pydata.org/_static/pandas_logo.png\" align=right width=400></a></p>\n",
    "\n",
    "  * Provides high-performance, easy-to-use data structures and data analysis tools.\n",
    "  * Provides core data structure **DataFrame**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### pandas.DataFrame\n",
    "\n",
    "  * Basically in-memory database tables (or spreadsheets!)\n",
    "  * Tabular data that allows for columns of different dtypes\n",
    "  * Labeled rows and columns (index)\n",
    "  * Hierarchical indexing allows for representing Panel data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>type_</th>\n",
       "      <th>user</th>\n",
       "      <th>repo</th>\n",
       "      <th>commits</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>created_at</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2015-01-29 16:00:00+00:00</th>\n",
       "      <td>CreateEvent</td>\n",
       "      <td>birdage</td>\n",
       "      <td>birdage/ooi-ui</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2015-01-29 16:00:00+00:00</th>\n",
       "      <td>PushEvent</td>\n",
       "      <td>ArniR</td>\n",
       "      <td>ArniR/ArniR.github.io</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2015-01-29 16:00:00+00:00</th>\n",
       "      <td>IssueCommentEvent</td>\n",
       "      <td>CrossEye</td>\n",
       "      <td>ramda/ramda</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2015-01-29 16:00:00+00:00</th>\n",
       "      <td>PushEvent</td>\n",
       "      <td>yluoyu</td>\n",
       "      <td>yluoyu/demo</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2015-01-29 16:00:00+00:00</th>\n",
       "      <td>IssueCommentEvent</td>\n",
       "      <td>EJBQ</td>\n",
       "      <td>prmr/JetUML</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                       type_      user                   repo  \\\n",
       "created_at                                                                      \n",
       "2015-01-29 16:00:00+00:00        CreateEvent   birdage         birdage/ooi-ui   \n",
       "2015-01-29 16:00:00+00:00          PushEvent     ArniR  ArniR/ArniR.github.io   \n",
       "2015-01-29 16:00:00+00:00  IssueCommentEvent  CrossEye            ramda/ramda   \n",
       "2015-01-29 16:00:00+00:00          PushEvent    yluoyu            yluoyu/demo   \n",
       "2015-01-29 16:00:00+00:00  IssueCommentEvent      EJBQ            prmr/JetUML   \n",
       "\n",
       "                           commits  \n",
       "created_at                          \n",
       "2015-01-29 16:00:00+00:00      NaN  \n",
       "2015-01-29 16:00:00+00:00        1  \n",
       "2015-01-29 16:00:00+00:00      NaN  \n",
       "2015-01-29 16:00:00+00:00        1  \n",
       "2015-01-29 16:00:00+00:00      NaN  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from collections import namedtuple\n",
    "GithubEvent = namedtuple('GithubEvent', ['type_', 'user', 'repo', 'created_at', 'commits'])\n",
    "\n",
    "def make_record(event):\n",
    "    return GithubEvent(\n",
    "        event['type'], event['actor']['login'], \n",
    "        event['repo']['name'], pd.Timestamp(event['created_at']),\n",
    "        event['payload']['size'] if event['type']=='PushEvent' else np.nan\n",
    "        )\n",
    "\n",
    "df = pd.DataFrame.from_records(\n",
    "    (make_record(ev) for ev in events),\n",
    "    columns=GithubEvent._fields, index='created_at')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Example 1 (using Pandas) - Number of Repositories Created"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>type_</th>\n",
       "      <th>user</th>\n",
       "      <th>repo</th>\n",
       "      <th>commits</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>created_at</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2015-01-29 16:00:00+00:00</th>\n",
       "      <td>CreateEvent</td>\n",
       "      <td>birdage</td>\n",
       "      <td>birdage/ooi-ui</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2015-01-29 16:00:02+00:00</th>\n",
       "      <td>CreateEvent</td>\n",
       "      <td>filipe-maia</td>\n",
       "      <td>Lucas-Andrade/ProjectManager_FLM</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2015-01-29 16:00:02+00:00</th>\n",
       "      <td>CreateEvent</td>\n",
       "      <td>filipe-maia</td>\n",
       "      <td>Lucas-Andrade/ProjectManager_FLM</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2015-01-29 16:00:02+00:00</th>\n",
       "      <td>CreateEvent</td>\n",
       "      <td>frewsxcv</td>\n",
       "      <td>frewsxcv/gargoyle</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2015-01-29 16:00:03+00:00</th>\n",
       "      <td>CreateEvent</td>\n",
       "      <td>schnere</td>\n",
       "      <td>bluevisiontec/GoogleShoppingApi</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                 type_         user  \\\n",
       "created_at                                            \n",
       "2015-01-29 16:00:00+00:00  CreateEvent      birdage   \n",
       "2015-01-29 16:00:02+00:00  CreateEvent  filipe-maia   \n",
       "2015-01-29 16:00:02+00:00  CreateEvent  filipe-maia   \n",
       "2015-01-29 16:00:02+00:00  CreateEvent     frewsxcv   \n",
       "2015-01-29 16:00:03+00:00  CreateEvent      schnere   \n",
       "\n",
       "                                                       repo  commits  \n",
       "created_at                                                            \n",
       "2015-01-29 16:00:00+00:00                    birdage/ooi-ui      NaN  \n",
       "2015-01-29 16:00:02+00:00  Lucas-Andrade/ProjectManager_FLM      NaN  \n",
       "2015-01-29 16:00:02+00:00  Lucas-Andrade/ProjectManager_FLM      NaN  \n",
       "2015-01-29 16:00:02+00:00                 frewsxcv/gargoyle      NaN  \n",
       "2015-01-29 16:00:03+00:00   bluevisiontec/GoogleShoppingApi      NaN  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.type_=='CreateEvent'].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3516"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df[df.type_=='CreateEvent'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Example 2 (using Pandas) - Number of commits pushed per repo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "eberhardt/moodle                         3335\n",
      "sakai-mirror/melete                      3209\n",
      "jfaris/phonegap-facebook-plugin          3201\n",
      "sakai-mirror/mneme                       2922\n",
      "wolfe-pack/wolfe                         2001\n"
     ]
    }
   ],
   "source": [
    "repo_commits = {}\n",
    "for event in events:\n",
    "    if event['type']==\"PushEvent\":\n",
    "        repo = event['repo']['name']\n",
    "        commits = event['payload']['size']\n",
    "        repo_commits[repo] = \\\n",
    "            repo_commits.get(repo, 0) + commits \n",
    "print_top_items(repo_commits)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "repo\n",
       "eberhardt/moodle                   3335\n",
       "sakai-mirror/melete                3209\n",
       "jfaris/phonegap-facebook-plugin    3201\n",
       "sakai-mirror/mneme                 2922\n",
       "wolfe-pack/wolfe                   2001\n",
       "Name: commits, dtype: float64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "repo_commits = df[df.type_=='PushEvent'].groupby('repo').commits.sum()\n",
    "repo_commits.sort(ascending=False)\n",
    "repo_commits.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Example 1 - revisited"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "type_\n",
       "PushEvent            15443\n",
       "IssueCommentEvent     3718\n",
       "CreateEvent           3516\n",
       "WatchEvent            2682\n",
       "PullRequestEvent      1891\n",
       "Name: repo, dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "event_counts = df.groupby('type_').repo.count()\n",
    "event_counts.sort(ascending=False)\n",
    "event_counts.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Great for interactive work:\n",
    "\n",
    "  * tab-completion!\n",
    "  * inspect data with `df.head()` & `df.tail()`\n",
    "  * quick overview of data ranges with `df.describe()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<img src=\"http://i.imgur.com/6b2AF7e.jpg\" width=300 align='middle'>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "However ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Pandas currently only handles in-memory datasets!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### So not suitable for big data!\n",
    "\n",
    "<img src=\"img/devops_borat.jpg\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# MapReduce\n",
    "\n",
    "*\"If you want to process Big Data, you need some MapReduce framework like one of the following\"*\n",
    "<p>\n",
    "<a href=\"https://hadoop.apache.org/\"><img src=\"https://hadoop.apache.org/images/hadoop-logo.jpg\" width=200 align=left></a>\n",
    "<a href=\"http://spark.apache.org/\"><img src=\"http://spark.apache.org/images/spark-logo.png\" width=100 align=left></a>\n",
    "</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<img src=\"https://mitpress.mit.edu/sicp/full-text/book/cover.jpg\" align=right width=150>\n",
    "\n",
    "The key to these frameworks is adopting a **functional** [programming] mindset. In Python this means, think **iterators**!\n",
    "\n",
    "See [The Structure and Interpretation of Computer Programs](https://mitpress.mit.edu/sicp/full-text/book/book.html)\n",
    "(the \"*Wizard book*\")\n",
    "\n",
    "  * in particular [Chapter 2 Building Abstractions with Data](https://mitpress.mit.edu/sicp/full-text/book/book-Z-H-13.html#%_chap_2) \n",
    "  * and [Section 2.2.3 Sequences as Conventional Interfaces](https://mitpress.mit.edu/sicp/full-text/book/book-Z-H-15.html#%_sec_2.2.3)\n",
    "\n",
    "Luckily, the Split-Apply-Combine pattern is well suited to this!  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Example 1 - revisited"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3516\n"
     ]
    }
   ],
   "source": [
    "new_repo_count = 0\n",
    "for event in events:\n",
    "    new_repo_count += \\\n",
    "        1 if event['type']==\"CreateEvent\" else 0\n",
    "print new_repo_count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3516"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reduce(lambda x,y: x+y, \n",
    "       map(lambda ev: 1 if ev['type']=='CreateEvent' else 0, \n",
    "           events))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Would prefer to write\n",
    "\n",
    "    events | map(...) | reduce(...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Example 1 - pipelined"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3516"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def datapipe(data, *transforms):\n",
    "    for transform in transforms:\n",
    "        data = transform(data)\n",
    "    return data\n",
    "\n",
    "datapipe(\n",
    "    events,\n",
    "    lambda events: map(lambda ev: 1 if ev['type']=='CreateEvent' else 0, events),\n",
    "    lambda counts: reduce(lambda x,y: x+y, counts)\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## PyToolz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "## Example 1 - pipeline using PyToolz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3516"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from toolz.curried import pipe, map, reduce\n",
    "\n",
    "pipe(events,\n",
    "     map(lambda ev: 1 if ev['type']=='CreateEvent' else 0),\n",
    "     reduce(lambda x,y: x+y)\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Example 2 - pipelined with PyToolz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "eberhardt/moodle                         3335\n",
      "sakai-mirror/melete                      3209\n",
      "jfaris/phonegap-facebook-plugin          3201\n",
      "sakai-mirror/mneme                       2922\n",
      "wolfe-pack/wolfe                         2001\n"
     ]
    }
   ],
   "source": [
    "repo_commits = {}\n",
    "for event in events:\n",
    "    if event['type']==\"PushEvent\":\n",
    "        repo = event['repo']['name']\n",
    "        commits = event['payload']['size']\n",
    "        repo_commits[repo] = \\\n",
    "            repo_commits.get(repo, 0) + commits \n",
    "print_top_items(repo_commits)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "eberhardt/moodle                         3335\n",
      "sakai-mirror/melete                      3209\n",
      "jfaris/phonegap-facebook-plugin          3201\n",
      "sakai-mirror/mneme                       2922\n",
      "wolfe-pack/wolfe                         2001\n"
     ]
    }
   ],
   "source": [
    "from toolz.curried import filter, reduceby\n",
    "pipe(events,\n",
    "     filter(lambda ev: ev['type']=='PushEvent'),\n",
    "     reduceby(lambda ev: ev['repo']['name'],\n",
    "              lambda commits, ev: commits+ev['payload']['size'],\n",
    "              init=0),\n",
    "     print_top_items\n",
    "     )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### The Point of Learning Patterns\n",
    "\n",
    "From Cosma Shalizi's [Statistical Computing](http://www.stat.cmu.edu/~cshalizi/statcomp/13/lectures/12/lecture-12.pdf) course:\n",
    "  \n",
    "  * Distinguish between **what** you want to do and **how you want to do it**.\n",
    "  * Focusing on **what** brings clarity to intentions.\n",
    "  * **How** also matters, but can obscure the high level problem.\n",
    " \n",
    " Learn the pattern, recognize the pattern, love the pattern!\n",
    " \n",
    " Re-use *good* solutions!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Iteration Considered Unhelpful\n",
    "\n",
    "Could always do the same thing with `for` loops, but those are\n",
    "  \n",
    "  * *verbose* - lots of \"how\" obscures the \"what\"\n",
    "  * painful/error-prone bookkeeping (indices, placeholders, ...)\n",
    "  * clumsy - hard to parallelize"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Out-of-core processing - toolz example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "eberhardt/moodle                         3335\n",
      "sakai-mirror/melete                      3209\n",
      "jfaris/phonegap-facebook-plugin          3201\n",
      "sakai-mirror/mneme                       2922\n",
      "wolfe-pack/wolfe                         2001\n"
     ]
    }
   ],
   "source": [
    "def count_commits(filename):\n",
    "    import gzip\n",
    "    import json\n",
    "    from toolz.curried import pipe, filter, reduceby\n",
    "    with gzip.open(filename) as f:\n",
    "        repo_commits = pipe(\n",
    "            map(json.loads, f),\n",
    "            filter(lambda ev: ev['type']=='PushEvent'),\n",
    "            reduceby(lambda ev: ev['repo']['name'],\n",
    "                     lambda commits, e: commits+e['payload']['size'],\n",
    "                     init=0)\n",
    "                     )\n",
    "    return repo_commits\n",
    "print_top_items(count_commits(path))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "744\n"
     ]
    }
   ],
   "source": [
    "import glob\n",
    "files = glob.glob('C:/ARGO/talks/split-apply-combine/data/github_archive/2015-01-*')\n",
    "print len(files)\n",
    "N = 24  #len(files)    # 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false,
    "scrolled": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sakai-mirror/melete                      77016\n",
      "sakai-mirror/mneme                       70128\n",
      "sakai-mirror/ambrosia                    18480\n",
      "jsonn/pkgsrc                             17629\n",
      "devhd/rulus                              9890\n",
      "Wall time: 16.1 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "from toolz.curried import reduceby\n",
    "from __builtin__ import map as pmap\n",
    "repo_commits = \\\n",
    "    pipe(pmap(count_commits, files[:N]),\n",
    "         lambda lst: reduce(lambda out, dct: out + dct.items(), lst, []),\n",
    "         reduceby(lambda t: t[0], lambda s,t: s+t[1], init=0)\n",
    "         )\n",
    "print_top_items(repo_commits)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sakai-mirror/melete                      77016\n",
      "sakai-mirror/mneme                       70128\n",
      "sakai-mirror/ambrosia                    18480\n",
      "jsonn/pkgsrc                             17629\n",
      "devhd/rulus                              9890\n",
      "Wall time: 5.4 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# Remember to start the ipcluster!\n",
    "# ipcluster start -n 4\n",
    "from IPython.parallel import Client\n",
    "p = Client()[:]\n",
    "pmap = p.map_sync\n",
    "repo_commits = \\\n",
    "    pipe(pmap(count_commits, files[:N]),\n",
    "         lambda lst: reduce(lambda out, dct: out + dct.items(), lst, []),\n",
    "         reduceby(lambda t: t[0], lambda s,t: s+t[1], init=0)\n",
    "         )\n",
    "print_top_items(repo_commits)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# New tools\n",
    "\n",
    "## [Blaze](http://blaze.pydata.org/en/latest/) \n",
    "\n",
    "<img src=\"img/blaze_med.png\" width=400>\n",
    "\n",
    "\n",
    "## [Dask](http://dask.pydata.org/en/latest/)\n",
    "\n",
    "<img src=\"img/dask-collections-schedulers.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Example 2 - using blaze (and pandas)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "repo\n",
       "eberhardt/moodle                   3335\n",
       "sakai-mirror/melete                3209\n",
       "jfaris/phonegap-facebook-plugin    3201\n",
       "sakai-mirror/mneme                 2922\n",
       "wolfe-pack/wolfe                   2001\n",
       "Name: commits, dtype: float64"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "repo_commits = df[df.type_=='PushEvent'].groupby('repo').commits.sum()\n",
    "repo_commits.sort(ascending=False)\n",
    "repo_commits.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from blaze import Symbol, by\n",
    "event = Symbol('event', 'var * {created_at: datetime, type_: string, user: string, repo: string, commits: int}')\n",
    "push_events = event[event.type_=='PushEvent']\n",
    "repo_commits = by(push_events.repo, commits=push_events.commits.sum())\n",
    "top_repos = repo_commits.sort('commits', ascending=False).head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                                 repo  commits\n",
      "3906                 eberhardt/moodle     3335\n",
      "7476              sakai-mirror/melete     3209\n",
      "5122  jfaris/phonegap-facebook-plugin     3201\n",
      "7477               sakai-mirror/mneme     2922\n",
      "8693                 wolfe-pack/wolfe     2001\n"
     ]
    }
   ],
   "source": [
    "from blaze import compute\n",
    "print compute(top_repos, df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## You can run the same **computation** on different backends!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false,
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Table('event', MetaData(bind=Engine(sqlite:///data/github_archive.sqlite)), Column('type_', Text(), table=<event>), Column('user', Text(), table=<event>), Column('repo', Text(), table=<event>), Column('commits', Float(precision=53), table=<event>), schema=None)"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from odo import odo\n",
    "uri = 'sqlite:///data/github_archive.sqlite::event'\n",
    "odo(df, uri)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "ename": "NotImplementedError",
     "evalue": "Don't know how to compute:\nexpr: by(event[event.type_ == 'PushEvent'].repo, commits=sum(event[event.type_ == 'PushEvent'].commits)).sort('commits', ascending=False).head(5)\ndata: {event:                 type_             user                              repo  \\\n0         CreateEvent          birdage                    birdage/ooi-ui   \n1           PushEvent            ArniR             ArniR/ArniR.github.io   \n2   IssueCommentEvent         CrossEye                       ramda/ramda   \n3           PushEvent           yluoyu                       yluoyu/demo   \n4   IssueCommentEvent             EJBQ                       prmr/JetUML   \n5           PushEvent         ThibaudL                  cinemaouvert/OCT   \n6          WatchEvent         ekmartin    davecheney/golang-crosscompile   \n7          WatchEvent      davidsanfal    docker-library/official-images   \n8           PushEvent  GET-TUDA-CHOPPA   gamesbyangelina/whatareyoudoing   \n9         CreateEvent      filipe-maia  Lucas-Andrade/ProjectManager_FLM   \n10          PushEvent  tomaszzielinski               appsembler/launcher   \n\n    commits  \n0       NaN  \n1         1  \n2       NaN  \n3         1  \n4       NaN  \n5         1  \n6       NaN  \n7       NaN  \n8         1  \n9       NaN  \n...}",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-28-e51bb3462e54>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mblaze\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mData\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      2\u001b[0m \u001b[0mdb\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mData\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0muri\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mcompute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtop_repos\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdb\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;32mC:\\Anaconda\\lib\\site-packages\\multipledispatch\\dispatcher.pyc\u001b[0m in \u001b[0;36m__call__\u001b[1;34m(self, *args, **kwargs)\u001b[0m\n\u001b[0;32m    162\u001b[0m             \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_cache\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mtypes\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    163\u001b[0m         \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 164\u001b[1;33m             \u001b[1;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    165\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    166\u001b[0m         \u001b[1;32mexcept\u001b[0m \u001b[0mMDNotImplementedError\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\Anaconda\\lib\\site-packages\\blaze\\compute\\core.pyc\u001b[0m in \u001b[0;36mcompute\u001b[1;34m(expr, o, **kwargs)\u001b[0m\n\u001b[0;32m     68\u001b[0m     \u001b[0mts\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mset\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mx\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mx\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mexpr\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_subterms\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mx\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mSymbol\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     69\u001b[0m     \u001b[1;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mts\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 70\u001b[1;33m         \u001b[1;32mreturn\u001b[0m \u001b[0mcompute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mexpr\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m{\u001b[0m\u001b[0mfirst\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mts\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mo\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     71\u001b[0m     \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     72\u001b[0m         \u001b[1;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"Give compute dictionary input, got %s\"\u001b[0m \u001b[1;33m%\u001b[0m \u001b[0mstr\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mo\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\Anaconda\\lib\\site-packages\\multipledispatch\\dispatcher.pyc\u001b[0m in \u001b[0;36m__call__\u001b[1;34m(self, *args, **kwargs)\u001b[0m\n\u001b[0;32m    162\u001b[0m             \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_cache\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mtypes\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    163\u001b[0m         \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 164\u001b[1;33m             \u001b[1;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    165\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    166\u001b[0m         \u001b[1;32mexcept\u001b[0m \u001b[0mMDNotImplementedError\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\Anaconda\\lib\\site-packages\\blaze\\compute\\core.pyc\u001b[0m in \u001b[0;36mcompute\u001b[1;34m(expr, d, **kwargs)\u001b[0m\n\u001b[0;32m    470\u001b[0m         \u001b[0md4\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0md3\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    471\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 472\u001b[1;33m     \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mtop_then_bottom_then_top_again_etc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mexpr3\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0md4\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    473\u001b[0m     \u001b[1;32mif\u001b[0m \u001b[0mpost_compute_\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    474\u001b[0m         \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpost_compute_\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mexpr3\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mresult\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mscope\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0md4\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\Anaconda\\lib\\site-packages\\blaze\\compute\\core.pyc\u001b[0m in \u001b[0;36mtop_then_bottom_then_top_again_etc\u001b[1;34m(expr, scope, **kwargs)\u001b[0m\n\u001b[0;32m    189\u001b[0m         raise NotImplementedError(\"Don't know how to compute:\\n\"\n\u001b[0;32m    190\u001b[0m                 \u001b[1;34m\"expr: %s\\n\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 191\u001b[1;33m                 \"data: %s\" % (expr3, scope4))\n\u001b[0m\u001b[0;32m    192\u001b[0m     \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    193\u001b[0m         \u001b[1;32mreturn\u001b[0m \u001b[0mtop_then_bottom_then_top_again_etc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mexpr3\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mscope4\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mNotImplementedError\u001b[0m: Don't know how to compute:\nexpr: by(event[event.type_ == 'PushEvent'].repo, commits=sum(event[event.type_ == 'PushEvent'].commits)).sort('commits', ascending=False).head(5)\ndata: {event:                 type_             user                              repo  \\\n0         CreateEvent          birdage                    birdage/ooi-ui   \n1           PushEvent            ArniR             ArniR/ArniR.github.io   \n2   IssueCommentEvent         CrossEye                       ramda/ramda   \n3           PushEvent           yluoyu                       yluoyu/demo   \n4   IssueCommentEvent             EJBQ                       prmr/JetUML   \n5           PushEvent         ThibaudL                  cinemaouvert/OCT   \n6          WatchEvent         ekmartin    davecheney/golang-crosscompile   \n7          WatchEvent      davidsanfal    docker-library/official-images   \n8           PushEvent  GET-TUDA-CHOPPA   gamesbyangelina/whatareyoudoing   \n9         CreateEvent      filipe-maia  Lucas-Andrade/ProjectManager_FLM   \n10          PushEvent  tomaszzielinski               appsembler/launcher   \n\n    commits  \n0       NaN  \n1         1  \n2       NaN  \n3         1  \n4       NaN  \n5         1  \n6       NaN  \n7       NaN  \n8         1  \n9       NaN  \n...}"
     ]
    }
   ],
   "source": [
    "from blaze import Data\n",
    "db = Data(uri)\n",
    "compute(top_repos, db)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "if os.path.exists('data/github_archive.sqlite'):\n",
    "    os.remove('data/github_archive.sqlite')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Dask and Castra"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from castra import Castra\n",
    "castra = Castra('data/github_archive.castra',\n",
    "                template=df, categories=categories)\n",
    "castra.extend_sequence(map(to_df, files), freq='1h')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import dask.dataframe as dd\n",
    "from dask.diagnostics import ProgressBar\n",
    "\n",
    "pbar = ProgressBar()\n",
    "pbar.register()\n",
    "\n",
    "df = dd.from_castra('data/github_archive.castra')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "df.type.value_counts().nlargest(5).compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "df[df.type=='PushEvent'].groupby('repo').commits.resample('h', how='count').compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## So ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<img src='img/calm.jpg' width=300 align='middle'>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "# ... in Python!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Thank you!\n",
    "\n",
    "## We're hiring!\n",
    "\n",
    "  * I'm [snth](http://github.com/snth) on github\n",
    "  * The Jupyter Notebook is on github: [github.com/snth/split-apply-combine](http://github.com/snth/split-apply-combine)\n",
    "  * You can view the slides on nbviewer: [slides](http://nbviewer.ipython.org/format/slides/github/snth/split-apply-combine/blob/master/The%20Split-Apply-Combine%20Pattern%20in%20Data%20Science%20and%20Python.ipynb#/)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}