{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# The Split-Apply-Combine Pattern in Data Science and Python\n",
"\n",
"## Tobias Brandt\n",
"\n",
"\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"### Projected number of talk attendees\n",
"\n",
" * With \"Data Science\" in title \n",
" * 10 +/- 5\n",
" * vs without \"Data Science\" in title\n",
" * 20 +/- 5\n",
" * insert xkcd style chart"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Google trends chart\n",
"\n",
"![\"data science\" vs \"data analysis\"](img/data_science_vs_data_analysis.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Examples\n",
"\n",
"### Chinook Database\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Example 1 - Total Revenue\n",
"\n",
"Ever written SQL like this?"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(2328.599999999957,)\n"
]
}
],
"source": [
"db = \"split-apply-combine_resources/Chinook_Sqlite.sqlite\"\n",
"import sqlite3\n",
"conn = sqlite3.connect(db)\n",
"curs = conn.cursor()\n",
"\n",
"print curs.execute(\"SELECT SUM(UnitPrice*Quantity) FROM InvoiceLine;\").fetchone()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Or Python code like this?"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2328.6\n"
]
}
],
"source": [
"total_revenue = 0\n",
"for row in curs.execute(\"SELECT * FROM InvoiceLine;\"):\n",
" InvoiceLineId, InvoiceId, TrackId, UnitPrice, Quantity = row\n",
" total_revenue += UnitPrice*Quantity\n",
"print total_revenue"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Example 2 - Revenue per track\n",
"\n",
"Or how about some SQL with a GROUP BY?"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(3250, 3.98)\n",
"(3223, 3.98)\n",
"(3214, 3.98)\n"
]
}
],
"source": [
"sql = (\"SELECT TrackId, sum(UnitPrice*Quantity) \"\n",
" \"FROM InvoiceLine \"\n",
" \"GROUP BY TrackId \"\n",
" \"ORDER BY SUM(UnitPrice*Quantity) DESC, TrackId DESC limit 3;\")\n",
"for row in curs.execute(sql):\n",
" print row"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Or Python where you use a dictionary to group items?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3250 3.98\n",
"3223 3.98\n",
"3214 3.98\n"
]
}
],
"source": [
"track_revenues = {}\n",
"for row in curs.execute(\"SELECT * FROM InvoiceLine;\"):\n",
" InvoiceLineId, InvoiceId, TrackId, UnitPrice, Quantity = row\n",
" invoice_revenue = UnitPrice*Quantity\n",
" track_revenues[TrackId] = track_revenues.get(TrackId, 0) + invoice_revenue\n",
" \n",
"for track, revenue in sorted(track_revenues.iteritems(),\n",
" key=lambda item: (item[1], item[0]), reverse=True)[:3]:\n",
" print track, revenue"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# The Split-Apply-Combine Pattern\n",
"\n",
"## Hadley Wickham \n",
"\n",
"[Hadley Wickham, the man who revolutionized R](http://priceonomics.com/hadley-wickham-the-man-who-revolutionized-r/)\n",
"\n",
"*If you don’t spend much of your time coding in the open-source statistical programming language R, \n",
"his name is likely not familiar to you -- but the statistician Hadley Wickham is, \n",
"in his own words, “nerd famous.” The kind of famous where people at statistics conferences \n",
"line up for selfies, ask him for autographs, and are generally in awe of him. \n",
"“It’s utterly utterly bizarre,” he admits. \n",
"“To be famous for writing R programs? It’s just crazy.”*"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.display import HTML\n",
"HTML('')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## The Basic Pattern\n",
"\n",
" 1. **Split** the data by some **grouping variable**\n",
" 2. **Apply** some function to each group **independently**\n",
" 3. **Combine** the data into some output dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Example 2 - revisited"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(3250, 3.98), (3223, 3.98), (3214, 3.98)]\n"
]
}
],
"source": [
"track_revenues = {}\n",
"for row in curs.execute(\"SELECT * FROM InvoiceLine;\"):\n",
" InvoiceLineId, InvoiceId, TrackId, UnitPrice, Quantity = row\n",
" invoice_revenue = UnitPrice*Quantity\n",
" track_revenues[TrackId] = track_revenues.get(TrackId, 0) + invoice_revenue\n",
" \n",
"print sorted(track_revenues.iteritems(), key=lambda item: (item[1], item[0]), reverse=True)[:3]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"def calc_revenue(row):\n",
" InvoiceLineId, InvoiceId, TrackId, UnitPrice, Quantity = row\n",
" return UnitPrice*Quantity"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(3250, 3.98), (3223, 3.98), (3214, 3.98)]\n"
]
}
],
"source": [
"track_revenues = {}\n",
"for row in curs.execute(\"SELECT * FROM InvoiceLine;\"):\n",
" track_revenues[row[2]] = track_revenues.get(row[2], 0) + calc_revenue(row)\n",
" \n",
"print sorted(track_revenues.iteritems(), key=lambda item: (item[1], item[0]), reverse=True)[:3]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Pandas - Python Data Analysis Library\n",
"\n",
"
\n",
"\n",
" * Provides high-performance, easy-to-use data structures and data analysis tools.\n",
" * My default tool for interactive data analysis.\n",
" * Provides core data structures **Series**, **DataFrame** and **Panel** (although the latter are largely obviated by MultiIndexed DataFrames)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### pandas.Series\n",
"\n",
" * Basically \"labelled arrays\"\n",
" * Combines *dict* and *numpy.array* interfaces\n",
" * *numpy.array* performance"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"apples 1\n",
"potatoes 2\n",
"dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"s1 = pd.Series(dict(apples=1, potatoes=2))\n",
"s1"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"apples NaN\n",
"oranges NaN\n",
"potatoes 6\n",
"dtype: float64\n"
]
}
],
"source": [
"s2 = pd.Series(dict(oranges=3, potatoes=4))\n",
"print s1+s2"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"apples 1\n",
"oranges 3\n",
"potatoes 6\n",
"dtype: float64\n"
]
}
],
"source": [
"print s1.add(s2, fill_value=0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### pandas.DataFrame\n",
"\n",
" * Basically in-memory database tables\n",
" * Can have columns of different dtypes\n",
" * Indexed rows and columns\n",
" * Hierarchical indexing allows for representing Panel data (pandas.MultiIndex)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" s1 | \n",
" s2 | \n",
"
\n",
" \n",
" \n",
" \n",
" apples | \n",
" 1 | \n",
" NaN | \n",
"
\n",
" \n",
" oranges | \n",
" NaN | \n",
" 3 | \n",
"
\n",
" \n",
" potatoes | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" s1 s2\n",
"apples 1 NaN\n",
"oranges NaN 3\n",
"potatoes 2 4"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(dict(s1=s1, s2=s2))\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index([u'apples', u'oranges', u'potatoes'], dtype='object')\n",
"Index([u's1', u's2'], dtype='object')\n"
]
}
],
"source": [
"print df.index\n",
"print df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Pandas Data Analysis"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" s1 | \n",
" s2 | \n",
"
\n",
" \n",
" \n",
" \n",
" apples | \n",
" 1 | \n",
" NaN | \n",
"
\n",
" \n",
" oranges | \n",
" NaN | \n",
" 3 | \n",
"
\n",
" \n",
" potatoes | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" s1 s2\n",
"apples 1 NaN\n",
"oranges NaN 3\n",
"potatoes 2 4"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"s1 3\n",
"s2 7\n",
"dtype: float64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sum()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"apples 1\n",
"oranges 3\n",
"potatoes 6\n",
"dtype: float64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sum(axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"apples s1 1\n",
"oranges s2 3\n",
"potatoes s1 2\n",
" s2 4\n",
"dtype: float64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.stack()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MultiIndex(levels=[[u'apples', u'oranges', u'potatoes'], [u's1', u's2']],\n",
" labels=[[0, 1, 2, 2], [0, 1, 0, 1]])\n"
]
}
],
"source": [
"print df.stack().index"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Split-Apply-Combine in Pandas\n",
"\n",
" * Uses **groupby** to\n",
" * **split** the data into groups based on some criteria\n",
" * **apply** a function on each group independently\n",
" * **combining** the results into a data structure"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
" * The **apply** step is usually one of\n",
" * **aggregate**\n",
" * **transform**\n",
" * or **filter**"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"HTML('')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Example 2 - using Pandas"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"df = pd.read_sql('SELECT * FROM InvoiceLine', conn)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TrackId\n",
"3250 3.98\n",
"3223 3.98\n",
"3177 3.98\n",
"Name: Revenue, dtype: float64\n"
]
}
],
"source": [
"df['Revenue'] = df['UnitPrice']*df['Quantity']\n",
"track_revenues = df.groupby('TrackId')['Revenue'].sum()\n",
"track_revenues.sort(ascending=False)\n",
"print track_revenues[:3]"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Revenue\n",
"TrackId \n",
"3250 3.98\n",
"3223 3.98\n",
"3214 3.98\n"
]
}
],
"source": [
"print track_revenues.reset_index().sort(columns=['Revenue', 'TrackId'], ascending=False).set_index('TrackId')[:3]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Great for interactive work:\n",
"\n",
" * tab-completion!\n",
" * `df.head()`, `df.tail()`\n",
" * `df.describe()`\n",
"\n",
"However ..."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"### Pandas currently only handles in-memory datasets!\n",
"\n",
"### Does my data look big in this? \n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# MapReduce\n",
"\n",
" * If you want to process Big Data, you need some MapReduce framework like one of the following\n",
"\n",
"\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"\n",
"\n",
"The key to these frameworks is adopting a **functional** [programming] mindset. In Python this means, think **iterators**!\n",
"\n",
"See [The Structure and Interpretation of Computer Programs](https://mitpress.mit.edu/sicp/full-text/book/book.html)\n",
"(the \"*Wizard book*\")\n",
"\n",
" * in particular [Chapter 2 Building Abstractions with Data](https://mitpress.mit.edu/sicp/full-text/book/book-Z-H-13.html#%_chap_2) \n",
" * and [Section 2.2.3 Sequences as Conventional Interfaces](https://mitpress.mit.edu/sicp/full-text/book/book-Z-H-15.html#%_sec_2.2.3)\n",
"\n",
"Luckily, the Split-Apply-Combine pattern is well suited to this! "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Example 1 - revisited"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"2328.599999999957"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"total_revenue = 0\n",
"for row in curs.execute(\"SELECT * FROM InvoiceLine;\"):\n",
" InvoiceLineId, InvoiceId, TrackId, UnitPrice, Quantity = row\n",
" total_revenue += UnitPrice*Quantity\n",
"total_revenue"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"2328.599999999957"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reduce(lambda x,y: x+y, map(calc_revenue, curs.execute(\"SELECT * FROM InvoiceLine;\")))"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"2328.599999999957"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sum(calc_revenue(row) for row in curs.execute(\"SELECT * FROM InvoiceLine;\"))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"What about **group by** operations?\n",
"\n",
"There is an `itertools.groupby` function in the standard library.\n",
"\n",
"However\n",
"\n",
" * it requires the data to be sorted,\n",
" * returns iterables which are shared with the original iterable.\n",
"\n",
"Hence I find that I usually need to consult the [documentation](https://docs.python.org/2/library/itertools.html#itertools.groupby) to use it correctly.\n",
"\n",
"Use the `toolz` library rather!"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"HTML('')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# PyToolz"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"HTML('')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Example 2 - revisited"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(3250, 3.98), (3223, 3.98), (3214, 3.98)]\n"
]
}
],
"source": [
"track_revenues = {}\n",
"for row in curs.execute(\"SELECT * FROM InvoiceLine;\"):\n",
" track_revenues[row[2]] = track_revenues.get(row[2], 0) + calc_revenue(row)\n",
" \n",
"print sorted(track_revenues.iteritems(), key=lambda item: (item[1], item[0]), reverse=True)[:3]"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[(3250, 3.98), (3223, 3.98), (3214, 3.98)]"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from toolz import groupby, valmap\n",
"sorted(valmap(sum,\n",
" valmap(lambda lst: map(calc_revenue, lst),\n",
" groupby(lambda row: row[2],\n",
" curs.execute(\"SELECT * FROM InvoiceLine\")))\n",
" ).iteritems()\n",
" , key=lambda t: (t[1], t[0]), reverse=True)[:3]"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[(3250, 3.98), (3223, 3.98), (3214, 3.98)]"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from toolz.curried import pipe, groupby, valmap, map, get\n",
"pipe(curs.execute(\"SELECT * FROM InvoiceLine\"),\n",
" groupby(get(2)),\n",
" valmap(map(calc_revenue)),\n",
" valmap(sum),\n",
" lambda track_revenues: sorted(track_revenues.iteritems(), key=lambda t: (t[1], t[0]), reverse=True)[:3]\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"HTML('')"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[(3250, 3.98), (3223, 3.98), (3214, 3.98)]"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from toolz.curried import reduceby\n",
"pipe(curs.execute(\"SELECT * FROM InvoiceLine\"),\n",
" reduceby(get(2),\n",
" lambda track_revenue, row: track_revenue + calc_revenue(row),\n",
" init=0\n",
" ),\n",
" lambda track_revenues: sorted(track_revenues.iteritems(), key=lambda t: (t[1], t[0]), reverse=True)[:3]\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## toolz example - multiprocessing"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"745 ['C:/ARGO/ARGO/notebooks/articles/github_archive\\\\2015-01-01-0.json.gz', 'C:/ARGO/ARGO/notebooks/articles/github_archive\\\\2015-01-01-1.json.gz', 'C:/ARGO/ARGO/notebooks/articles/github_archive\\\\2015-01-01-10.json.gz']\n"
]
}
],
"source": [
"import glob\n",
"files = glob.glob('C:/ARGO/ARGO/notebooks/articles/github_archive/*')\n",
"print len(files), files[:3]\n",
"N = len(files) # 10"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{u'ReleaseEvent': 24, u'PublicEvent': 2, u'PullRequestReviewCommentEvent': 85, u'ForkEvent': 213, u'MemberEvent': 16, u'PullRequestEvent': 315, u'IssueCommentEvent': 650, u'PushEvent': 4280, u'DeleteEvent': 141, u'CommitCommentEvent': 56, u'WatchEvent': 642, u'IssuesEvent': 373, u'CreateEvent': 815, u'GollumEvent': 90}\n"
]
}
],
"source": [
"def count_types(filename):\n",
" import gzip\n",
" import json\n",
" from collections import Counter\n",
" try:\n",
" with gzip.open(filename) as f:\n",
" return dict(Counter(json.loads(line)['type'] for line in f))\n",
" except Exception, e:\n",
" print \"Error in {!r}: {}\".format(filename, e)\n",
" return {}\n",
"print count_types(files[0])"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from collections import Counter\n",
"def update_counts(total_counts, file_counts):\n",
" total_counts.update(file_counts)\n",
" return total_counts"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Error in 'C:/ARGO/ARGO/notebooks/articles/github_archive\\\\2015-01-07-5.json.gz': CRC check failed 0xc9b8b241L != 0x3fdb8691L\n",
"Error in 'C:/ARGO/ARGO/notebooks/articles/github_archive\\\\tmp': [Errno 13] Permission denied: 'C:/ARGO/ARGO/notebooks/articles/github_archive\\\\tmp'\n",
"Counter({u'PushEvent': 7021744, u'CreateEvent': 1649671, u'IssueCommentEvent': 1321440, u'WatchEvent': 1319860, u'IssuesEvent': 692702, u'PullRequestEvent': 680308, u'ForkEvent': 490633, u'DeleteEvent': 256818, u'PullRequestReviewCommentEvent': 214188, u'GollumEvent': 150744, u'CommitCommentEvent': 96389, u'MemberEvent': 69718, u'ReleaseEvent': 44292, u'PublicEvent': 14596})\n",
"Wall time: 17min 54s\n"
]
}
],
"source": [
"%%time\n",
"pmap = map\n",
"print reduce(update_counts,\n",
" pmap(count_types, files[:N]),\n",
" Counter())"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Counter({u'PushEvent': 7021744, u'CreateEvent': 1649671, u'IssueCommentEvent': 1321440, u'WatchEvent': 1319860, u'IssuesEvent': 692702, u'PullRequestEvent': 680308, u'ForkEvent': 490633, u'DeleteEvent': 256818, u'PullRequestReviewCommentEvent': 214188, u'GollumEvent': 150744, u'CommitCommentEvent': 96389, u'MemberEvent': 69718, u'ReleaseEvent': 44292, u'PublicEvent': 14596})\n",
"Wall time: 4min 14s\n"
]
}
],
"source": [
"%%time\n",
"from IPython.parallel import Client\n",
"p = Client()[:]\n",
"pmap = p.map_sync\n",
"print reduce(update_counts,\n",
" pmap(count_types, files[:N]),\n",
" Counter())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Next time\n",
"\n",
"## [Blaze](http://blaze.pydata.org/en/latest/) \n",
"\n",
"\n",
"\n",
"\n",
"## [Dask](http://dask.pydata.org/en/latest/)\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"HTML('')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Thank you!\n",
"\n",
"## If this stuff interests you, let's chat!"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}