{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Map-Reduce and Apache Spark\n",
    "\n",
    "(C) 2015 [Steve Phelps](http://sphelps.net/)\n",
    "\n",
    "## Overview\n",
    "\n",
    "1. Recap of functional programming in Python\n",
    "2. Python's `map` and `reduce` functions\n",
    "3. Writing parallel code using `map`\n",
    "4. The Map-Reduce programming model\n",
    "5. Using Apache Spark with Python\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Reading\n",
    "\n",
    "- [Introduction to Parallel Computing](https://computing.llnl.gov/tutorials/parallel_comp/#DesignPartitioning), Blaise Barney, Lawrence Livermore National Laboratory.\n",
    "\n",
    "- Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107–113.\n",
    "\n",
    "- [Spark Programming Guide](https://spark.apache.org/docs/latest/programming-guide.html)\n",
    "\n",
    "- Chapters 1 and 3 of Karau, H., Wendell, P., & Zaharia, M. (2015). Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## History\n",
    "\n",
    "- The Map-Reduce programming model was popularised by Google (Dean and Ghemawat 2008).\n",
    "\n",
    "- The first popular open-source implementation was Apache Hadoop, first released in 2011.\n",
    "\n",
    "- Apache Spark was first released in 2014. \n",
    "\n",
    "- It was originally developed by [Matei Zaharia](http://people.csail.mit.edu/matei) as a class project, and later a PhD dissertation, at University of California, Berkeley.\n",
    "\n",
    "- In contrast to Hadoop, Apache Spark:\n",
    "\n",
    "    - is easy to install and configure.\n",
    "    - provides a much more natural *iterative* workflow \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Resilient distributed datasets\n",
    "\n",
    "- The fundamental abstraction of Apache Spark is a read-only, parallel, distributed, fault-tolerent collection called a resilient distributed datasets (RDD).\n",
    "\n",
    "- When working with Apache Spark we iteratively apply functions to every elelement of these collections in parallel to produce *new* RDDs.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Functional programming\n",
    "\n",
    "Consider the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "def double_everything_in(data):\n",
    "    result = []\n",
    "    for i in data:\n",
    "        result.append(2 * i)\n",
    "    return result\n",
    "\n",
    "def quadruple_everything_in(data):\n",
    "    result = []\n",
    "    for i in data:\n",
    "        result.append(4 * i)\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[2, 4, 6, 8, 10]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "double_everything_in([1, 2, 3, 4, 5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[4, 8, 12, 16, 20]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "quadruple_everything_in([1, 2, 3, 4, 5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "- The above code violates the [\"do not repeat yourself\"](https://en.wikipedia.org/wiki/Don't_repeat_yourself_) principle of good software engineering practice.\n",
    "\n",
    "- How can rewrite the code so that it avoids duplication?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def multiply_by_x_everything_in(x, data):\n",
    "    result = []\n",
    "    for i in data:\n",
    "        result.append(x * i)\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[2, 4, 6, 8, 10]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "multiply_by_x_everything_in(2, [1, 2, 3, 4, 5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[4, 8, 12, 16, 20]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "multiply_by_x_everything_in(4, [1, 2, 3, 4, 5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "- Now consider the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def squared(x):\n",
    "    return x*x\n",
    "\n",
    "def double(x):\n",
    "    return x*2\n",
    "\n",
    "def square_everything_in(data):\n",
    "    result = []\n",
    "    for i in data:\n",
    "        result.append(squared(i))\n",
    "    return result\n",
    "\n",
    "def double_everything_in(data):\n",
    "    result = []\n",
    "    for i in data:\n",
    "        result.append(double(i))\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 4, 9, 16, 25]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_everything_in([1, 2, 3, 4, 5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[2, 4, 6, 8, 10]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "double_everything_in([1, 2, 3, 4, 5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "- The above code violates the [\"do not repeat yourself\"](https://en.wikipedia.org/wiki/Don't_repeat_yourself_) principle of good software engineering practice.\n",
    "\n",
    "- How can rewrite the code so that it avoids duplication?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Using functions as values\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "def apply_f_to_everything_in(f, data):\n",
    "    result = []\n",
    "    for x in data:\n",
    "        result.append(f(x))\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 4, 9, 16, 25]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "apply_f_to_everything_in(squared, [1, 2, 3, 4, 5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[2, 4, 6, 8, 10]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "apply_f_to_everything_in(double, [1, 2, 3, 4, 5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Lambda expressions\n",
    "\n",
    "- We can use anonymous functions to save having to define a function each time we want to use map."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 4, 9, 16, 25]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "apply_f_to_everything_in(lambda x: x*x, [1, 2, 3, 4, 5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Python's `map` function\n",
    "\n",
    "- Python has a built-in function `map` which is much faster than our version.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 4, 9, 16, 25]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(map(lambda x: x*x, [1, 2, 3, 4, 5]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Implementing reduce\n",
    "\n",
    "- The `reduce` function is an example of a [fold](https://en.wikipedia.org/wiki/Fold_%28higher-order_function%29).\n",
    "\n",
    "- There are different ways we can fold data.\n",
    "\n",
    "- The following implements a *left* fold.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def foldl(f, data, z):\n",
    "    if (len(data) == 0):\n",
    "        print(z)\n",
    "        return z\n",
    "    else:\n",
    "        head = data[0]\n",
    "        tail = data[1:]\n",
    "        print(\"Folding\", head, \"with\", tail, \"using\", z)\n",
    "        partial_result = f(z, data[0])\n",
    "        print(\"Partial result is\", partial_result)\n",
    "        return foldl(f, tail, partial_result)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Folding 1 with [2, 3, 4, 5] using 0\n",
      "Partial result is 1\n",
      "Folding 2 with [3, 4, 5] using 1\n",
      "Partial result is 3\n",
      "Folding 3 with [4, 5] using 3\n",
      "Partial result is 6\n",
      "Folding 4 with [5] using 6\n",
      "Partial result is 10\n",
      "Folding 5 with [] using 10\n",
      "Partial result is 15\n",
      "15\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "15"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def add(x, y):\n",
    "    return x + y\n",
    "\n",
    "foldl(add, [1, 2, 3, 4, 5], 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Folding 1 with [2, 3, 4, 5] using 0\n",
      "Partial result is 1\n",
      "Folding 2 with [3, 4, 5] using 1\n",
      "Partial result is 3\n",
      "Folding 3 with [4, 5] using 3\n",
      "Partial result is 6\n",
      "Folding 4 with [5] using 6\n",
      "Partial result is 10\n",
      "Folding 5 with [] using 10\n",
      "Partial result is 15\n",
      "15\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "15"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "foldl(lambda x, y: x + y, [1, 2, 3, 4, 5], 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Folding 1 with [2, 3, 4, 5] using 0\n",
      "Partial result is -1\n",
      "Folding 2 with [3, 4, 5] using -1\n",
      "Partial result is -3\n",
      "Folding 3 with [4, 5] using -3\n",
      "Partial result is -6\n",
      "Folding 4 with [5] using -6\n",
      "Partial result is -10\n",
      "Folding 5 with [] using -10\n",
      "Partial result is -15\n",
      "-15\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "-15"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "foldl(lambda x, y: x - y, [1, 2, 3, 4, 5], 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-15"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(((((0 - 1) - 2) - 3) - 4) - 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "- Subtraction is neither [commutative](https://en.wikipedia.org/wiki/Commutative_property) nor [associative](https://en.wikipedia.org/wiki/Associative_property), so the order in which apply the fold matters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(1 - (2 - (3 - (4 - (5 - 0)))))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "def foldr(f, data, z):\n",
    "    if (len(data) == 0):\n",
    "        return z\n",
    "    else:\n",
    "        return f(data[0], foldr(f, data[1:], z))                "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Folding 1 with [2, 3, 4, 5] using 0\n",
      "Partial result is -1\n",
      "Folding 2 with [3, 4, 5] using -1\n",
      "Partial result is -3\n",
      "Folding 3 with [4, 5] using -3\n",
      "Partial result is -6\n",
      "Folding 4 with [5] using -6\n",
      "Partial result is -10\n",
      "Folding 5 with [] using -10\n",
      "Partial result is -15\n",
      "-15\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "-15"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "foldl(lambda x, y: x - y,  [1, 2, 3, 4, 5], 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "foldr(lambda x, y: x - y, [1, 2, 3, 4, 5], 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Python's `reduce` function.\n",
    "\n",
    "- Python's built-in `reduce` function is a *left* fold."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "15"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from functools import reduce\n",
    "\n",
    "reduce(lambda x, y: x + y, [1, 2, 3, 4, 5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-15"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reduce(lambda x, y: x - y, [1, 2, 3, 4, 5], 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Folding 1 with [2, 3, 4, 5] using 0\n",
      "Partial result is -1\n",
      "Folding 2 with [3, 4, 5] using -1\n",
      "Partial result is -3\n",
      "Folding 3 with [4, 5] using -3\n",
      "Partial result is -6\n",
      "Folding 4 with [5] using -6\n",
      "Partial result is -10\n",
      "Folding 5 with [] using -10\n",
      "Partial result is -15\n",
      "-15\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "-15"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "foldl(lambda x, y: x - y, [1, 2, 3, 4, 5], 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Functional programming and parallelism\n",
    "\n",
    "- Functional programming lends itself to [parallel programming](https://computing.llnl.gov/tutorials/parallel_comp/#Models).\n",
    "\n",
    "- The `map` function can easily be parallelised through [data-level parallelism](https://en.wikipedia.org/wiki/Data_parallelism),\n",
    "    - provided that the function we supply as an argument is *free from* [side-effects](https://en.wikipedia.org/wiki/Side_effect_%28computer_science%29)\n",
    "        - (which is why we avoid working with mutable data).\n",
    "\n",
    "- We can see this by rewriting it so:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "def perform_computation(f, result, data, i):\n",
    "    print(\"Computing the \", i, \"th result...\")\n",
    "    # This could be scheduled on a different CPU\n",
    "    result[i] = f(data[i])\n",
    "\n",
    "def my_map(f, data):\n",
    "    result = [None] * len(data)\n",
    "    for i in range(len(data)):\n",
    "        perform_computation(f, result, data, i)\n",
    "    # Wait for other CPUs to finish, and then..\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Computing the  0 th result...\n",
      "Computing the  1 th result...\n",
      "Computing the  2 th result...\n",
      "Computing the  3 th result...\n",
      "Computing the  4 th result...\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[1, 4, 9, 16, 25]"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_map(lambda x: x * x, [1, 2, 3, 4, 5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## A multi-threaded `map` function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "from threading import Thread\n",
    "\n",
    "def schedule_computation_threaded(f, result, data, threads, i):    \n",
    "    # Each function evaluation is scheduled on a different core.\n",
    "    def my_job(): \n",
    "        print(\"Processing data:\", data[i], \"... \")\n",
    "        result[i] = f(data[i])\n",
    "        print(\"Finished job #\", i)    \n",
    "        print(\"Result was\", result[i])\n",
    "    threads[i] = Thread(target=my_job)\n",
    "    \n",
    "def my_map_multithreaded(f, data):\n",
    "    n = len(data)\n",
    "    result = [None] * n\n",
    "    threads = [None] * n\n",
    "    print(\"Scheduling jobs.. \")\n",
    "    for i in range(n):\n",
    "        schedule_computation_threaded(f, result, data, threads, i)\n",
    "    print(\"Starting jobs.. \")\n",
    "    for i in range(n):\n",
    "        threads[i].start()\n",
    "    print(\"Waiting for jobs to finish.. \")\n",
    "    for i in range(n):\n",
    "        threads[i].join()\n",
    "    print(\"All done.\")\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Scheduling jobs.. \n",
      "Starting jobs.. \n",
      "Processing data: 1 ... \n",
      "Finished job # 0\n",
      "Result was 1\n",
      "Processing data: 2 ... \n",
      "Finished job # 1\n",
      "Result was 4\n",
      "Processing data: 3 ... \n",
      "Finished job # 2\n",
      "Result was 9\n",
      "Processing data:Processing data: 5 ... \n",
      "Finished job # 4\n",
      "Result was 25\n",
      " Waiting for jobs to finish.. 4 ... \n",
      "Finished job # 3\n",
      "Result was 16\n",
      "\n",
      "All done.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[1, 4, 9, 16, 25]"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_map_multithreaded(lambda x: x*x, [1, 2, 3, 4, 5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Scheduling jobs.. \n",
      "Starting jobs.. \n",
      "Processing data:Processing data: 2 ... \n",
      " 1 ... \n",
      "Processing data: 3 ... \n",
      "Processing data: 4 ... \n",
      "Processing data:Waiting for jobs to finish.. \n",
      " 5 ... \n",
      "Finished job # 2\n",
      "Result was 9\n",
      "Finished job # 4\n",
      "Result was 25\n",
      "Finished job # 0\n",
      "Result was 1\n",
      "Finished job # 3\n",
      "Result was 16\n",
      "Finished job # 1\n",
      "Result was 4\n",
      "All done.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[1, 4, 9, 16, 25]"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from numpy.random import uniform\n",
    "from time import sleep\n",
    "\n",
    "def a_function_which_takes_a_long_time(x):\n",
    "    sleep(uniform(2, 10))  # Simulate some long computation\n",
    "    return x*x\n",
    "\n",
    "my_map_multithreaded(a_function_which_takes_a_long_time, [1, 2, 3, 4, 5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Map Reduce\n",
    "\n",
    "- Map Reduce is a _programming model_ for scalable parallel processing.\n",
    "- Scalable here means that it can work on big data with very large compute clusters.\n",
    "- There are many implementations: e.g. Apache Hadoop and Apache Spark.\n",
    "- We can use Map-Reduce with any programming language:\n",
    "    - Hadoop is written in Java\n",
    "    - Spark is written in Scala, but has a Python interface.\n",
    "- *Functional programming* languages such as Python or Scala fit very well with the Map Reduce model:\n",
    "    - However, we don't *have* to use functional programming."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "- A MapReduce implementation will take care of the low-level functionality so that you don't have to worry about:\n",
    "    - load balancing\n",
    "    - network I/O\n",
    "    - network and disk transfer optimisation\n",
    "    - handling of machine failures\n",
    "    - serialization of data\n",
    "    - etc..\n",
    "- The model is designed to move the processing to where the data resides."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Typical steps in a Map Reduce Computation\n",
    "\n",
    "1. ETL a big data set.\n",
    "2. _Map_ operation: extract something you care about from each row\n",
    "3. \"Shuffle and Sort\": task/node allocation\n",
    "4. _Reduce_ operation: aggregate, summarise, filter or transform\n",
    "5. Write the results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Callbacks for Map Reduce\n",
    "\n",
    "- The data set, and the state of each stage of the computation, is represented as a set of key-value pairs.\n",
    "\n",
    "- The programmer provides a map function:\n",
    "\n",
    "$\\operatorname{map}(k, v) \\rightarrow \\; \\left< k', v' \\right>*$  \n",
    "\n",
    "- and a reduce function:\n",
    "\n",
    "$\\operatorname{reduce}(k', \\left< k', v'\\right> *) \\rightarrow \\; \\left< k', v''\n",
    "\\right> *$\n",
    "\n",
    "- The $*$ refers to a *collection* of values.\n",
    "\n",
    "- These collections are *not* ordered."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Resilient Distributed Data\n",
    "\n",
    "- In a Map-Reduce computation these collections are resilient distributed data-sets (RDDs):\n",
    "    - The data is distributed across nodes in a cluster of computers.\n",
    "    - No data is lost if a single node fails.\n",
    "    - Data is typically stored in HBase tables, or HDFS files.\n",
    "    - The `map` and `reduce` functions can work in *parallel* across\n",
    "       different keys, or different elements of the collection.\n",
    "\n",
    "- The underlying framework (e.g. Hadoop or Apache Spark) allocates data and processing to different nodes, without any intervention from the programmer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Word Count Example\n",
    "\n",
    "- In this simple example, the input is a set of URLs, each record is a document.\n",
    "\n",
    "- Problem: compute how many times each word has occurred across data set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Word Count: Map \n",
    "\n",
    "\n",
    "- The input to $\\operatorname{map}$ is a mapping:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Key: URL\n",
    "- Value: Contents of document"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\left< document1, to \\; be \\; or \\; not \\; to \\; be \\right>$  \n",
    "    \n",
    "\n",
    "- In this example, our $\\operatorname{map}$ function will process a given URL, and produces a mapping:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Key: word\n",
    "- Value: 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- So our original data-set will be transformed to:\n",
    "  \n",
    "  $\\left< to, 1 \\right>$\n",
    "  $\\left< be, 1 \\right>$\n",
    "  $\\left< or, 1 \\right>$\n",
    "  $\\left< not, 1 \\right>$\n",
    "  $\\left< to, 1 \\right>$\n",
    "  $\\left< be, 1 \\right>$\n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Word Count: Reduce\n",
    "\n",
    "\n",
    "- The reduce operation groups values according to their key, and then performs areduce on each key.\n",
    "\n",
    "- The collections are partitioned across different storage units, therefore.\n",
    "\n",
    "- Map-Reduce will fold the data in such a way that it minimises data-copying across the cluster.\n",
    "\n",
    "- Data in different partitions are reduced separately in parallel.\n",
    "\n",
    "- The final result is a reduce of the reduced data in each partition.\n",
    "\n",
    "- Therefore it is very important that our operator *is both commutative and associative*.\n",
    "\n",
    "- In our case the function is the `+` operator\n",
    "\n",
    "  $\\left< be, 2 \\right>$  \n",
    "  $\\left< not, 1 \\right>$  \n",
    "  $\\left< or, 1 \\right>$  \n",
    "  $\\left< to, 2 \\right>$  \n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Map and Reduce compared with Python\n",
    "\n",
    "- Notice that these functions are formulated differently from the standard Python functions of the same name.\n",
    "\n",
    "- The `reduce` function works with key-value *pairs*.\n",
    "\n",
    "- It would be more apt to call it something like `reduceByKey`.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## MiniMapReduce\n",
    "\n",
    "- To illustrate how the Map-Reduce programming model works, we can implement our own Map-Reduce framework in Python.\n",
    "\n",
    "- This *illustrates* how a problem can be written in terms of `map` and `reduce` operations.\n",
    "\n",
    "- Note that these are illustrative functions; this is *not* how Hadoop or Apache Spark actually implement them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "##########################################################\n",
    "#\n",
    "#   MiniMapReduce\n",
    "#\n",
    "# A non-parallel, non-scalable Map-Reduce implementation\n",
    "##########################################################\n",
    "\n",
    "def groupByKey(data):\n",
    "    result = dict()\n",
    "    for key, value in data:\n",
    "        if key in result:\n",
    "            result[key].append(value)\n",
    "        else:\n",
    "            result[key] = [value]\n",
    "    return result\n",
    "        \n",
    "def reduceByKey(f, data):\n",
    "    key_values = groupByKey(data)\n",
    "    return list(map(lambda key: \n",
    "                   (key, reduce(f, key_values[key])), \n",
    "                       key_values))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Word-count using MiniMapReduce\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('to', 1), ('be', 1), ('or', 1), ('not', 1), ('to', 1), ('be', 1)]"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = list(map(lambda x: (x, 1), \"to be or not to be\".split()))\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'to': [1, 1], 'be': [1, 1], 'or': [1], 'not': [1]}"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "groupByKey(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('to', 2), ('be', 2), ('or', 1), ('not', 1)]"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reduceByKey(lambda x, y: x + y, data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parallelising MiniMapReduce\n",
    "\n",
    "- We can easily turn our Map-Reduce implementation into a parallel, multi-threaded framework\n",
    "by using the `my_map_multithreaded` function we defined earlier.\n",
    "\n",
    "- This will allow us to perform map-reduce computations that exploit parallel processing using *multiple* cores on a *single* computer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "def reduceByKey_multithreaded(f, data):\n",
    "    key_values = groupByKey(data)\n",
    "    return my_map_multithreaded(\n",
    "        lambda key: (key, reduce(f, key_values[key])), [key for key in key_values])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Scheduling jobs.. \n",
      "Starting jobs.. \n",
      "Processing data: to ... \n",
      "Finished job # 0\n",
      "Result was ('to', 2)\n",
      "Processing data: be ... \n",
      "Finished job # 1\n",
      "Result was ('be', 2)\n",
      "Processing data: or ... \n",
      "Finished job # 2\n",
      "Result was ('or', 1)\n",
      "Processing data: not ... \n",
      "Finished job # 3\n",
      "Result was ('not', 1)\n",
      "Waiting for jobs to finish.. \n",
      "All done.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('to', 2), ('be', 2), ('or', 1), ('not', 1)]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reduceByKey_multithreaded(lambda x, y: x + y, data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Parallelising the reduce step\n",
    "\n",
    "- Provided that our operator is both associative and commutative we can\n",
    "also parallelise the reduce operation.\n",
    "\n",
    "- We partition the data into approximately equal subsets.\n",
    "\n",
    "- We then reduce each subset independently on a separate core.\n",
    "\n",
    "- The results can be combined in a final reduce step."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Partitioning the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['a', 'b', 'c'], ['d', 'e', 'f', 'g']]"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def split_data(data, split_points):\n",
    "    partitions = []\n",
    "    n = 0\n",
    "    for i in split_points:\n",
    "        partitions.append(data[n:i])\n",
    "        n = i\n",
    "    partitions.append(data[n:])\n",
    "    return partitions\n",
    "\n",
    "data = ['a', 'b', 'c', 'd', 'e', 'f', 'g']\n",
    "partitioned_data = split_data(data, [3])\n",
    "partitioned_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reducing across partitions in parallel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'abcdefg'"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from threading import Thread\n",
    "\n",
    "def parallel_reduce(f, partitions):\n",
    "\n",
    "    n = len(partitions)\n",
    "    results = [None] * n\n",
    "    threads = [None] * n\n",
    "    \n",
    "    def job(i):\n",
    "        results[i] = reduce(f, partitions[i])\n",
    "\n",
    "    for i in range(n):\n",
    "        threads[i] = Thread(target = lambda: job(i))\n",
    "        threads[i].start()\n",
    "    \n",
    "    for i in range(n):\n",
    "        threads[i].join()\n",
    "    \n",
    "    return reduce(f, results)\n",
    "\n",
    "parallel_reduce(lambda x, y: x + y, partitioned_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Map-Reduce on a cluster of computers\n",
    "\n",
    "- The code we have written so far will *not* allow us to exploit parallelism from multiple computers in a [cluster](https://en.wikipedia.org/wiki/Computer_cluster).\n",
    "\n",
    "- Developing such a framework would be a very large software engineering project.\n",
    "\n",
    "- There are existing frameworks we can use:\n",
    "    - [Apache Hadoop](https://hadoop.apache.org/)\n",
    "    - [Apache Spark](https://spark.apache.org/)\n",
    "    \n",
    "- In this lecture we will cover Apache Spark."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Apache Spark\n",
    "\n",
    "- Apache Spark provides an object-oriented library for processing data on the cluster.\n",
    "\n",
    "- It provides objects which represent resilient distributed datasets (RDDs).\n",
    "\n",
    "- RDDs behave a bit like Python collections (e.g. lists).\n",
    "\n",
    "- However:\n",
    "    - the underlying data is distributed across the nodes in the cluster, and\n",
    "    - the collections are *immutable*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Apache Spark and Map-Reduce\n",
    "\n",
    "- We process the data by using higher-order functions to map RDDs onto *new* RDDs. \n",
    "\n",
    "- Each instance of an RDD has at least two *methods* corresponding to the Map-Reduce workflow:\n",
    "    - `map`\n",
    "    - `reduceByKey`\n",
    "    \n",
    "- These methods work in the same way as the corresponding functions we defined earlier to work with the standard Python collections.  \n",
    "\n",
    "- There are also additional RDD methods in the Apache Spark API;\n",
    "    - Apache Spark is a *super-set* of Map-Reduce.\n",
    "   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Word-count in Apache Spark\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['to', 'be', 'or', 'not', 'to', 'be']"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "words = \"to be or not to be\".split()\n",
    "words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### The `SparkContext` class\n",
    "\n",
    "- When working with Apache Spark we invoke methods on an object which is an instance of the `pyspark.context.SparkContext` context.\n",
    "\n",
    "- Typically, an instance of this object will be created automatically for you and assigned to the variable `sc`.\n",
    "\n",
    "- The `parallelize` method in `SparkContext` can be used to turn any ordinary Python collection into an RDD;\n",
    "    - normally we would create an RDD from a large file or an HBase table. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "words_rdd = sc.parallelize(words)\n",
    "words_rdd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Mapping an RDD\n",
    "\n",
    "- Now when we invoke the `map` or `reduceByKey` methods on `my_rdd` we can set up a parallel processing computation across the cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PythonRDD[1] at RDD at PythonRDD.scala:53"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_tuples_rdd = words_rdd.map(lambda x: (x, 1))\n",
    "word_tuples_rdd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Notice that we do not have a result yet.\n",
    "\n",
    "- The computation is not performed until we request the final result to be *collected*.\n",
    "\n",
    "- We do this by invoking the `collect()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('to', 1), ('be', 1), ('or', 1), ('not', 1), ('to', 1), ('be', 1)]"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_tuples_rdd.collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Reducing an RDD\n",
    "\n",
    "- However, we require additional processing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PythonRDD[6] at RDD at PythonRDD.scala:53"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_counts_rdd = word_tuples_rdd.reduceByKey(lambda x, y: x + y)\n",
    "word_counts_rdd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Now we request the final result:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('to', 2), ('be', 2), ('or', 1), ('not', 1)]"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_counts = word_counts_rdd.collect()\n",
    "word_counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Lazy evaluation \n",
    "\n",
    "- It is only when we invoke `collect()` that the processing is performed on the cluster.\n",
    "\n",
    "- Invoking `collect()` will cause both the `map` and `reduceByKey` operations to be performed.\n",
    "\n",
    "- If the resulting collection is very large then this can be an expensive operation.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### The head of an RDD\n",
    "\n",
    "- The `take` method is similar to `collect`, but only returns the first $n$ elements.\n",
    "\n",
    "- This can be very useful for testing.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('to', 2), ('be', 2)]"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_counts_rdd.take(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The complete word-count example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('to', 2), ('be', 2), ('or', 1), ('not', 1)]"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"to be or not to be\".split()\n",
    "rdd = sc.parallelize(text)\n",
    "counts = rdd.map(lambda word: (word, 1)) \\\n",
    "             .reduceByKey(lambda x, y: x + y)\n",
    "counts.collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Additional RDD transformations\n",
    "\n",
    "- Apache Spark offers many more methods for operating on collections of tuples over and above the standard Map-Reduce framework:\n",
    "\n",
    "    - Sorting: `sortByKey`, `sortBy`, `takeOrdered`\n",
    "    - Mapping: `flatMap`\n",
    "    - Filtering: `filter`\n",
    "    - Counting: `count`\n",
    "    - Set-theoretic: `intersection`, `union`\n",
    "    - Many others: [see the Transformations section of the programming guide](https://spark.apache.org/docs/latest/programming-guide.html#transformations)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Creating an RDD from a text file\n",
    "\n",
    "- In the previous example, we created an RDD from a Python collection.\n",
    "\n",
    "- This is *not* typically how we would work with big data.\n",
    "\n",
    "- More commonly we would create an RDD corresponding to data in an\n",
    "HBase table, or an HDFS file.\n",
    "\n",
    "- The following example creates an RDD from a text file on the native filesystem (ext4);\n",
    "    - With bigger data, you would use an HDFS file, but the principle is the same.\n",
    "\n",
    "- Each element of the RDD corresponds to a single *line* of text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['TTGGCCATGCTGCCCACTCACCTAGAGCGCACAGCTGACACTGAGTCCTCTTCTGAACCTCATCCATGAA',\n",
       " 'CATATTTATGAAATCTTTCCTGGCCCCAAGTGGAAATGCCCCCTCATTTGGGTCCTCACTGAACCCCAGT',\n",
       " 'ACACAACTCTTTTGTACTACTCTATTATGCTGGGGTGTTTTTTTATTGTCTCACCTGATAAACCGTAAGC',\n",
       " 'CCCTTGAAGACAGCAACTCGTTTTTAAGCTCTTTATAACCCCAGAGCCTCGCACAGTACCTGGACCAGAT',\n",
       " 'TAAGGGGTACTTAACAGATGCTTAGTGAAGGAAGGAATGGATTTCTCACCTGGTTGCTTATCTTCTAGAC']"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "genome = sc.textFile('../../../data/genome.txt')\n",
    "genome.take(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Genome example\n",
    "\n",
    "- We will use this RDD to calculate the frequencies of sequences of five bases, and then sort the sequences into descending order ranked by their frequency.\n",
    "\n",
    "- First we will define some functions to split the bases into sequences of a certain size:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [],
   "source": [
    "def group_characters(line, n=5):\n",
    "    result = ''\n",
    "    i = 0\n",
    "    for ch in line:\n",
    "        result = result + ch\n",
    "        i = i + 1\n",
    "        if (i % n) == 0:\n",
    "            yield result\n",
    "            result = ''\n",
    "\n",
    "def group_and_split(line):\n",
    "    return [sequence for sequence in group_characters(line)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['abcde', 'fghij', 'klmno']"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "group_and_split('abcdefghijklmno')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Now we will transform the original text RDD into an RDD containing key-value pairs where the key is the sequence and the value is 1, as per the word-count example.\n",
    "\n",
    "- Notice that if we simply map each line of text, we will obtain multi-dimensional data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['TTGGC',\n",
       "  'CATGC',\n",
       "  'TGCCC',\n",
       "  'ACTCA',\n",
       "  'CCTAG',\n",
       "  'AGCGC',\n",
       "  'ACAGC',\n",
       "  'TGACA',\n",
       "  'CTGAG',\n",
       "  'TCCTC',\n",
       "  'TTCTG',\n",
       "  'AACCT',\n",
       "  'CATCC',\n",
       "  'ATGAA'],\n",
       " ['CATAT',\n",
       "  'TTATG',\n",
       "  'AAATC',\n",
       "  'TTTCC',\n",
       "  'TGGCC',\n",
       "  'CCAAG',\n",
       "  'TGGAA',\n",
       "  'ATGCC',\n",
       "  'CCCTC',\n",
       "  'ATTTG',\n",
       "  'GGTCC',\n",
       "  'TCACT',\n",
       "  'GAACC',\n",
       "  'CCAGT']]"
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "genome.map(group_and_split).take(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Flattening an RDD using `flatMap`\n",
    "\n",
    "- We will need to flatten this data in order to turn it into a list of base-sequences.\n",
    "\n",
    "- We can use the `flatMap` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['TTGGC', 'CATGC', 'TGCCC']"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sequences = genome.flatMap(group_and_split)\n",
    "sequences.take(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('TTGGC', 587),\n",
       " ('CATGC', 647),\n",
       " ('TGCCC', 599),\n",
       " ('ACTCA', 775),\n",
       " ('TGACA', 831),\n",
       " ('TTCTG', 1257),\n",
       " ('AACCT', 726),\n",
       " ('TTATG', 819),\n",
       " ('AAATC', 996),\n",
       " ('TGGCC', 718)]"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "counts = \\\n",
    "    sequences.map(\n",
    "        lambda w: (w, 1)).reduceByKey(lambda x, y: x + y)\n",
    "counts.take(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We want to rank each sequence according to its count.\n",
    "\n",
    "- Therefore the key (first element) of each tuple should be the count.\n",
    "\n",
    "- Thefefore we need to reverse the tuples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [],
   "source": [
    "def reverse_tuple(key_value_pair):\n",
    "    return (key_value_pair[1], key_value_pair[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(587, 'TTGGC'),\n",
       " (647, 'CATGC'),\n",
       " (599, 'TGCCC'),\n",
       " (775, 'ACTCA'),\n",
       " (831, 'TGACA'),\n",
       " (1257, 'TTCTG'),\n",
       " (726, 'AACCT'),\n",
       " (819, 'TTATG'),\n",
       " (996, 'AAATC'),\n",
       " (718, 'TGGCC')]"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sequences = counts.map(reverse_tuple)\n",
    "sequences.take(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sorting an RDD\n",
    "\n",
    "- Now we can sort the RDD in descending order of key:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(37137, 'NNNNN'),\n",
       " (4653, 'AAAAA'),\n",
       " (4223, 'TTTTT'),\n",
       " (2788, 'AAAAT'),\n",
       " (2658, 'ATTTT'),\n",
       " (2283, 'AAATA'),\n",
       " (2276, 'TAAAA'),\n",
       " (2197, 'TTTTA'),\n",
       " (2196, 'TATTT'),\n",
       " (2185, 'AGAAA')]"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sequences_sorted = sequences.sortByKey(False)\n",
    "top_ten_sequences = sequences_sorted.take(10)\n",
    "top_ten_sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Calculating $\\pi$ using Spark\n",
    "\n",
    "- We can estimate an approximate value for $\\pi$ using the following Monte-Carlo method:\n",
    "\n",
    "\n",
    "1.    Inscribe a circle in a square\n",
    "2.    Randomly generate points in the square\n",
    "3.    Determine the number of points in the square that are also in the circle\n",
    "4.    Let $r$ be the number of points in the circle divided by the number of points in the square, then $\\pi \\approx 4 r$.\n",
    "    \n",
    "- Note that the more points generated, the better the approximation\n",
    "\n",
    "See [this tutorial](https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pi is approximately 3.141\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "def sample(p):\n",
    "    x, y = np.random.random(), np.random.random()\n",
    "    return 1 if x*x + y*y < 1 else 0\n",
    "\n",
    "NUM_SAMPLES = 5000000\n",
    "\n",
    "count = sc.parallelize(range(0, NUM_SAMPLES)).map(sample) \\\n",
    "             .reduce(lambda a, b: a + b)\n",
    "r = float(count) / float(NUM_SAMPLES)\n",
    "print(\"Pi is approximately %.3f\" % (4.0 * r))"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}