{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Map-Reduce and Apache Spark\n", "\n", "(C) 2015 [Steve Phelps](http://sphelps.net/)\n", "\n", "## Overview\n", "\n", "1. Recap of functional programming in Python\n", "2. Python's `map` and `reduce` functions\n", "3. Writing parallel code using `map`\n", "4. The Map-Reduce programming model\n", "5. Using Apache Spark with Python\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Reading\n", "\n", "- [Introduction to Parallel Computing](https://computing.llnl.gov/tutorials/parallel_comp/#DesignPartitioning), Blaise Barney, Lawrence Livermore National Laboratory.\n", "\n", "- Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107–113.\n", "\n", "- [Spark Programming Guide](https://spark.apache.org/docs/latest/programming-guide.html)\n", "\n", "- Chapters 1 and 3 of Karau, H., Wendell, P., & Zaharia, M. (2015). Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## History\n", "\n", "- The Map-Reduce programming model was popularised by Google (Dean and Ghemawat 2008).\n", "\n", "- The first popular open-source implementation was Apache Hadoop, first released in 2011.\n", "\n", "- Apache Spark was first released in 2014. \n", "\n", "- It was originally developed by [Matei Zaharia](http://people.csail.mit.edu/matei) as a class project, and later a PhD dissertation, at University of California, Berkeley.\n", "\n", "- In contrast to Hadoop, Apache Spark:\n", "\n", " - is easy to install and configure.\n", " - provides a much more natural *iterative* workflow \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Resilient distributed datasets\n", "\n", "- The fundamental abstraction of Apache Spark is a read-only, parallel, distributed, fault-tolerent collection called a resilient distributed datasets (RDD).\n", "\n", "- When working with Apache Spark we iteratively apply functions to every elelement of these collections in parallel to produce *new* RDDs.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Functional programming\n", "\n", "Consider the following code:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def double_everything_in(data):\n", " result = []\n", " for i in data:\n", " result.append(2 * i)\n", " return result\n", "\n", "def quadruple_everything_in(data):\n", " result = []\n", " for i in data:\n", " result.append(4 * i)\n", " return result" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2, 4, 6, 8, 10]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double_everything_in([1, 2, 3, 4, 5])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[4, 8, 12, 16, 20]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quadruple_everything_in([1, 2, 3, 4, 5])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- The above code violates the [\"do not repeat yourself\"](https://en.wikipedia.org/wiki/Don't_repeat_yourself_) principle of good software engineering practice.\n", "\n", "- How can rewrite the code so that it avoids duplication?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def multiply_by_x_everything_in(x, data):\n", " result = []\n", " for i in data:\n", " result.append(x * i)\n", " return result" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2, 4, 6, 8, 10]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "multiply_by_x_everything_in(2, [1, 2, 3, 4, 5])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[4, 8, 12, 16, 20]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "multiply_by_x_everything_in(4, [1, 2, 3, 4, 5])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- Now consider the following code:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def squared(x):\n", " return x*x\n", "\n", "def double(x):\n", " return x*2\n", "\n", "def square_everything_in(data):\n", " result = []\n", " for i in data:\n", " result.append(squared(i))\n", " return result\n", "\n", "def double_everything_in(data):\n", " result = []\n", " for i in data:\n", " result.append(double(i))\n", " return result" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 4, 9, 16, 25]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_everything_in([1, 2, 3, 4, 5])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2, 4, 6, 8, 10]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double_everything_in([1, 2, 3, 4, 5])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- The above code violates the [\"do not repeat yourself\"](https://en.wikipedia.org/wiki/Don't_repeat_yourself_) principle of good software engineering practice.\n", "\n", "- How can rewrite the code so that it avoids duplication?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Using functions as values\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "def apply_f_to_everything_in(f, data):\n", " result = []\n", " for x in data:\n", " result.append(f(x))\n", " return result" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 4, 9, 16, 25]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "apply_f_to_everything_in(squared, [1, 2, 3, 4, 5])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2, 4, 6, 8, 10]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "apply_f_to_everything_in(double, [1, 2, 3, 4, 5])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Lambda expressions\n", "\n", "- We can use anonymous functions to save having to define a function each time we want to use map." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 4, 9, 16, 25]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "apply_f_to_everything_in(lambda x: x*x, [1, 2, 3, 4, 5])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Python's `map` function\n", "\n", "- Python has a built-in function `map` which is much faster than our version.\n", "\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 4, 9, 16, 25]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(map(lambda x: x*x, [1, 2, 3, 4, 5]))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Implementing reduce\n", "\n", "- The `reduce` function is an example of a [fold](https://en.wikipedia.org/wiki/Fold_%28higher-order_function%29).\n", "\n", "- There are different ways we can fold data.\n", "\n", "- The following implements a *left* fold.\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def foldl(f, data, z):\n", " if (len(data) == 0):\n", " print(z)\n", " return z\n", " else:\n", " head = data[0]\n", " tail = data[1:]\n", " print(\"Folding\", head, \"with\", tail, \"using\", z)\n", " partial_result = f(z, data[0])\n", " print(\"Partial result is\", partial_result)\n", " return foldl(f, tail, partial_result) " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Folding 1 with [2, 3, 4, 5] using 0\n", "Partial result is 1\n", "Folding 2 with [3, 4, 5] using 1\n", "Partial result is 3\n", "Folding 3 with [4, 5] using 3\n", "Partial result is 6\n", "Folding 4 with [5] using 6\n", "Partial result is 10\n", "Folding 5 with [] using 10\n", "Partial result is 15\n", "15\n" ] }, { "data": { "text/plain": [ "15" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def add(x, y):\n", " return x + y\n", "\n", "foldl(add, [1, 2, 3, 4, 5], 0)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Folding 1 with [2, 3, 4, 5] using 0\n", "Partial result is 1\n", "Folding 2 with [3, 4, 5] using 1\n", "Partial result is 3\n", "Folding 3 with [4, 5] using 3\n", "Partial result is 6\n", "Folding 4 with [5] using 6\n", "Partial result is 10\n", "Folding 5 with [] using 10\n", "Partial result is 15\n", "15\n" ] }, { "data": { "text/plain": [ "15" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foldl(lambda x, y: x + y, [1, 2, 3, 4, 5], 0)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Folding 1 with [2, 3, 4, 5] using 0\n", "Partial result is -1\n", "Folding 2 with [3, 4, 5] using -1\n", "Partial result is -3\n", "Folding 3 with [4, 5] using -3\n", "Partial result is -6\n", "Folding 4 with [5] using -6\n", "Partial result is -10\n", "Folding 5 with [] using -10\n", "Partial result is -15\n", "-15\n" ] }, { "data": { "text/plain": [ "-15" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foldl(lambda x, y: x - y, [1, 2, 3, 4, 5], 0)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-15" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(((((0 - 1) - 2) - 3) - 4) - 5)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Subtraction is neither [commutative](https://en.wikipedia.org/wiki/Commutative_property) nor [associative](https://en.wikipedia.org/wiki/Associative_property), so the order in which apply the fold matters:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(1 - (2 - (3 - (4 - (5 - 0)))))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "def foldr(f, data, z):\n", " if (len(data) == 0):\n", " return z\n", " else:\n", " return f(data[0], foldr(f, data[1:], z)) " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Folding 1 with [2, 3, 4, 5] using 0\n", "Partial result is -1\n", "Folding 2 with [3, 4, 5] using -1\n", "Partial result is -3\n", "Folding 3 with [4, 5] using -3\n", "Partial result is -6\n", "Folding 4 with [5] using -6\n", "Partial result is -10\n", "Folding 5 with [] using -10\n", "Partial result is -15\n", "-15\n" ] }, { "data": { "text/plain": [ "-15" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foldl(lambda x, y: x - y, [1, 2, 3, 4, 5], 0)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foldr(lambda x, y: x - y, [1, 2, 3, 4, 5], 0)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Python's `reduce` function.\n", "\n", "- Python's built-in `reduce` function is a *left* fold." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from functools import reduce\n", "\n", "reduce(lambda x, y: x + y, [1, 2, 3, 4, 5])" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "-15" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reduce(lambda x, y: x - y, [1, 2, 3, 4, 5], 0)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Folding 1 with [2, 3, 4, 5] using 0\n", "Partial result is -1\n", "Folding 2 with [3, 4, 5] using -1\n", "Partial result is -3\n", "Folding 3 with [4, 5] using -3\n", "Partial result is -6\n", "Folding 4 with [5] using -6\n", "Partial result is -10\n", "Folding 5 with [] using -10\n", "Partial result is -15\n", "-15\n" ] }, { "data": { "text/plain": [ "-15" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foldl(lambda x, y: x - y, [1, 2, 3, 4, 5], 0)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Functional programming and parallelism\n", "\n", "- Functional programming lends itself to [parallel programming](https://computing.llnl.gov/tutorials/parallel_comp/#Models).\n", "\n", "- The `map` function can easily be parallelised through [data-level parallelism](https://en.wikipedia.org/wiki/Data_parallelism),\n", " - provided that the function we supply as an argument is *free from* [side-effects](https://en.wikipedia.org/wiki/Side_effect_%28computer_science%29)\n", " - (which is why we avoid working with mutable data).\n", "\n", "- We can see this by rewriting it so:\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "def perform_computation(f, result, data, i):\n", " print(\"Computing the \", i, \"th result...\")\n", " # This could be scheduled on a different CPU\n", " result[i] = f(data[i])\n", "\n", "def my_map(f, data):\n", " result = [None] * len(data)\n", " for i in range(len(data)):\n", " perform_computation(f, result, data, i)\n", " # Wait for other CPUs to finish, and then..\n", " return result" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Computing the 0 th result...\n", "Computing the 1 th result...\n", "Computing the 2 th result...\n", "Computing the 3 th result...\n", "Computing the 4 th result...\n" ] }, { "data": { "text/plain": [ "[1, 4, 9, 16, 25]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_map(lambda x: x * x, [1, 2, 3, 4, 5])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## A multi-threaded `map` function" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "from threading import Thread\n", "\n", "def schedule_computation_threaded(f, result, data, threads, i): \n", " # Each function evaluation is scheduled on a different core.\n", " def my_job(): \n", " print(\"Processing data:\", data[i], \"... \")\n", " result[i] = f(data[i])\n", " print(\"Finished job #\", i) \n", " print(\"Result was\", result[i])\n", " threads[i] = Thread(target=my_job)\n", " \n", "def my_map_multithreaded(f, data):\n", " n = len(data)\n", " result = [None] * n\n", " threads = [None] * n\n", " print(\"Scheduling jobs.. \")\n", " for i in range(n):\n", " schedule_computation_threaded(f, result, data, threads, i)\n", " print(\"Starting jobs.. \")\n", " for i in range(n):\n", " threads[i].start()\n", " print(\"Waiting for jobs to finish.. \")\n", " for i in range(n):\n", " threads[i].join()\n", " print(\"All done.\")\n", " return result" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scheduling jobs.. \n", "Starting jobs.. \n", "Processing data: 1 ... \n", "Finished job # 0\n", "Result was 1\n", "Processing data: 2 ... \n", "Finished job # 1\n", "Result was 4\n", "Processing data: 3 ... \n", "Finished job # 2\n", "Result was 9\n", "Processing data:Processing data: 5 ... \n", "Finished job # 4\n", "Result was 25\n", " Waiting for jobs to finish.. 4 ... \n", "Finished job # 3\n", "Result was 16\n", "\n", "All done.\n" ] }, { "data": { "text/plain": [ "[1, 4, 9, 16, 25]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_map_multithreaded(lambda x: x*x, [1, 2, 3, 4, 5])" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scheduling jobs.. \n", "Starting jobs.. \n", "Processing data:Processing data: 2 ... \n", " 1 ... \n", "Processing data: 3 ... \n", "Processing data: 4 ... \n", "Processing data:Waiting for jobs to finish.. \n", " 5 ... \n", "Finished job # 2\n", "Result was 9\n", "Finished job # 4\n", "Result was 25\n", "Finished job # 0\n", "Result was 1\n", "Finished job # 3\n", "Result was 16\n", "Finished job # 1\n", "Result was 4\n", "All done.\n" ] }, { "data": { "text/plain": [ "[1, 4, 9, 16, 25]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from numpy.random import uniform\n", "from time import sleep\n", "\n", "def a_function_which_takes_a_long_time(x):\n", " sleep(uniform(2, 10)) # Simulate some long computation\n", " return x*x\n", "\n", "my_map_multithreaded(a_function_which_takes_a_long_time, [1, 2, 3, 4, 5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Map Reduce\n", "\n", "- Map Reduce is a _programming model_ for scalable parallel processing.\n", "- Scalable here means that it can work on big data with very large compute clusters.\n", "- There are many implementations: e.g. Apache Hadoop and Apache Spark.\n", "- We can use Map-Reduce with any programming language:\n", " - Hadoop is written in Java\n", " - Spark is written in Scala, but has a Python interface.\n", "- *Functional programming* languages such as Python or Scala fit very well with the Map Reduce model:\n", " - However, we don't *have* to use functional programming." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- A MapReduce implementation will take care of the low-level functionality so that you don't have to worry about:\n", " - load balancing\n", " - network I/O\n", " - network and disk transfer optimisation\n", " - handling of machine failures\n", " - serialization of data\n", " - etc..\n", "- The model is designed to move the processing to where the data resides." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Typical steps in a Map Reduce Computation\n", "\n", "1. ETL a big data set.\n", "2. _Map_ operation: extract something you care about from each row\n", "3. \"Shuffle and Sort\": task/node allocation\n", "4. _Reduce_ operation: aggregate, summarise, filter or transform\n", "5. Write the results." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Callbacks for Map Reduce\n", "\n", "- The data set, and the state of each stage of the computation, is represented as a set of key-value pairs.\n", "\n", "- The programmer provides a map function:\n", "\n", "$\\operatorname{map}(k, v) \\rightarrow \\; \\left< k', v' \\right>*$ \n", "\n", "- and a reduce function:\n", "\n", "$\\operatorname{reduce}(k', \\left< k', v'\\right> *) \\rightarrow \\; \\left< k', v''\n", "\\right> *$\n", "\n", "- The $*$ refers to a *collection* of values.\n", "\n", "- These collections are *not* ordered." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Resilient Distributed Data\n", "\n", "- In a Map-Reduce computation these collections are resilient distributed data-sets (RDDs):\n", " - The data is distributed across nodes in a cluster of computers.\n", " - No data is lost if a single node fails.\n", " - Data is typically stored in HBase tables, or HDFS files.\n", " - The `map` and `reduce` functions can work in *parallel* across\n", " different keys, or different elements of the collection.\n", "\n", "- The underlying framework (e.g. Hadoop or Apache Spark) allocates data and processing to different nodes, without any intervention from the programmer." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Word Count Example\n", "\n", "- In this simple example, the input is a set of URLs, each record is a document.\n", "\n", "- Problem: compute how many times each word has occurred across data set." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Word Count: Map \n", "\n", "\n", "- The input to $\\operatorname{map}$ is a mapping:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Key: URL\n", "- Value: Contents of document" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\left< document1, to \\; be \\; or \\; not \\; to \\; be \\right>$ \n", " \n", "\n", "- In this example, our $\\operatorname{map}$ function will process a given URL, and produces a mapping:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Key: word\n", "- Value: 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- So our original data-set will be transformed to:\n", " \n", " $\\left< to, 1 \\right>$\n", " $\\left< be, 1 \\right>$\n", " $\\left< or, 1 \\right>$\n", " $\\left< not, 1 \\right>$\n", " $\\left< to, 1 \\right>$\n", " $\\left< be, 1 \\right>$\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Word Count: Reduce\n", "\n", "\n", "- The reduce operation groups values according to their key, and then performs areduce on each key.\n", "\n", "- The collections are partitioned across different storage units, therefore.\n", "\n", "- Map-Reduce will fold the data in such a way that it minimises data-copying across the cluster.\n", "\n", "- Data in different partitions are reduced separately in parallel.\n", "\n", "- The final result is a reduce of the reduced data in each partition.\n", "\n", "- Therefore it is very important that our operator *is both commutative and associative*.\n", "\n", "- In our case the function is the `+` operator\n", "\n", " $\\left< be, 2 \\right>$ \n", " $\\left< not, 1 \\right>$ \n", " $\\left< or, 1 \\right>$ \n", " $\\left< to, 2 \\right>$ \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Map and Reduce compared with Python\n", "\n", "- Notice that these functions are formulated differently from the standard Python functions of the same name.\n", "\n", "- The `reduce` function works with key-value *pairs*.\n", "\n", "- It would be more apt to call it something like `reduceByKey`.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## MiniMapReduce\n", "\n", "- To illustrate how the Map-Reduce programming model works, we can implement our own Map-Reduce framework in Python.\n", "\n", "- This *illustrates* how a problem can be written in terms of `map` and `reduce` operations.\n", "\n", "- Note that these are illustrative functions; this is *not* how Hadoop or Apache Spark actually implement them." ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "##########################################################\n", "#\n", "# MiniMapReduce\n", "#\n", "# A non-parallel, non-scalable Map-Reduce implementation\n", "##########################################################\n", "\n", "def groupByKey(data):\n", " result = dict()\n", " for key, value in data:\n", " if key in result:\n", " result[key].append(value)\n", " else:\n", " result[key] = [value]\n", " return result\n", " \n", "def reduceByKey(f, data):\n", " key_values = groupByKey(data)\n", " return list(map(lambda key: \n", " (key, reduce(f, key_values[key])), \n", " key_values))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Word-count using MiniMapReduce\n" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('to', 1), ('be', 1), ('or', 1), ('not', 1), ('to', 1), ('be', 1)]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = list(map(lambda x: (x, 1), \"to be or not to be\".split()))\n", "data" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'to': [1, 1], 'be': [1, 1], 'or': [1], 'not': [1]}" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "groupByKey(data)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('to', 2), ('be', 2), ('or', 1), ('not', 1)]" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reduceByKey(lambda x, y: x + y, data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parallelising MiniMapReduce\n", "\n", "- We can easily turn our Map-Reduce implementation into a parallel, multi-threaded framework\n", "by using the `my_map_multithreaded` function we defined earlier.\n", "\n", "- This will allow us to perform map-reduce computations that exploit parallel processing using *multiple* cores on a *single* computer." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "def reduceByKey_multithreaded(f, data):\n", " key_values = groupByKey(data)\n", " return my_map_multithreaded(\n", " lambda key: (key, reduce(f, key_values[key])), [key for key in key_values])" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scheduling jobs.. \n", "Starting jobs.. \n", "Processing data: to ... \n", "Finished job # 0\n", "Result was ('to', 2)\n", "Processing data: be ... \n", "Finished job # 1\n", "Result was ('be', 2)\n", "Processing data: or ... \n", "Finished job # 2\n", "Result was ('or', 1)\n", "Processing data: not ... \n", "Finished job # 3\n", "Result was ('not', 1)\n", "Waiting for jobs to finish.. \n", "All done.\n" ] }, { "data": { "text/plain": [ "[('to', 2), ('be', 2), ('or', 1), ('not', 1)]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reduceByKey_multithreaded(lambda x, y: x + y, data)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Parallelising the reduce step\n", "\n", "- Provided that our operator is both associative and commutative we can\n", "also parallelise the reduce operation.\n", "\n", "- We partition the data into approximately equal subsets.\n", "\n", "- We then reduce each subset independently on a separate core.\n", "\n", "- The results can be combined in a final reduce step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Partitioning the data" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['a', 'b', 'c'], ['d', 'e', 'f', 'g']]" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def split_data(data, split_points):\n", " partitions = []\n", " n = 0\n", " for i in split_points:\n", " partitions.append(data[n:i])\n", " n = i\n", " partitions.append(data[n:])\n", " return partitions\n", "\n", "data = ['a', 'b', 'c', 'd', 'e', 'f', 'g']\n", "partitioned_data = split_data(data, [3])\n", "partitioned_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reducing across partitions in parallel" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'abcdefg'" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from threading import Thread\n", "\n", "def parallel_reduce(f, partitions):\n", "\n", " n = len(partitions)\n", " results = [None] * n\n", " threads = [None] * n\n", " \n", " def job(i):\n", " results[i] = reduce(f, partitions[i])\n", "\n", " for i in range(n):\n", " threads[i] = Thread(target = lambda: job(i))\n", " threads[i].start()\n", " \n", " for i in range(n):\n", " threads[i].join()\n", " \n", " return reduce(f, results)\n", "\n", "parallel_reduce(lambda x, y: x + y, partitioned_data)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Map-Reduce on a cluster of computers\n", "\n", "- The code we have written so far will *not* allow us to exploit parallelism from multiple computers in a [cluster](https://en.wikipedia.org/wiki/Computer_cluster).\n", "\n", "- Developing such a framework would be a very large software engineering project.\n", "\n", "- There are existing frameworks we can use:\n", " - [Apache Hadoop](https://hadoop.apache.org/)\n", " - [Apache Spark](https://spark.apache.org/)\n", " \n", "- In this lecture we will cover Apache Spark." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Apache Spark\n", "\n", "- Apache Spark provides an object-oriented library for processing data on the cluster.\n", "\n", "- It provides objects which represent resilient distributed datasets (RDDs).\n", "\n", "- RDDs behave a bit like Python collections (e.g. lists).\n", "\n", "- However:\n", " - the underlying data is distributed across the nodes in the cluster, and\n", " - the collections are *immutable*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Apache Spark and Map-Reduce\n", "\n", "- We process the data by using higher-order functions to map RDDs onto *new* RDDs. \n", "\n", "- Each instance of an RDD has at least two *methods* corresponding to the Map-Reduce workflow:\n", " - `map`\n", " - `reduceByKey`\n", " \n", "- These methods work in the same way as the corresponding functions we defined earlier to work with the standard Python collections. \n", "\n", "- There are also additional RDD methods in the Apache Spark API;\n", " - Apache Spark is a *super-set* of Map-Reduce.\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Word-count in Apache Spark\n", "\n" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "['to', 'be', 'or', 'not', 'to', 'be']" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words = \"to be or not to be\".split()\n", "words" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### The `SparkContext` class\n", "\n", "- When working with Apache Spark we invoke methods on an object which is an instance of the `pyspark.context.SparkContext` context.\n", "\n", "- Typically, an instance of this object will be created automatically for you and assigned to the variable `sc`.\n", "\n", "- The `parallelize` method in `SparkContext` can be used to turn any ordinary Python collection into an RDD;\n", " - normally we would create an RDD from a large file or an HBase table. " ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words_rdd = sc.parallelize(words)\n", "words_rdd" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Mapping an RDD\n", "\n", "- Now when we invoke the `map` or `reduceByKey` methods on `my_rdd` we can set up a parallel processing computation across the cluster." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PythonRDD[1] at RDD at PythonRDD.scala:53" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_tuples_rdd = words_rdd.map(lambda x: (x, 1))\n", "word_tuples_rdd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Notice that we do not have a result yet.\n", "\n", "- The computation is not performed until we request the final result to be *collected*.\n", "\n", "- We do this by invoking the `collect()` method:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('to', 1), ('be', 1), ('or', 1), ('not', 1), ('to', 1), ('be', 1)]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_tuples_rdd.collect()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Reducing an RDD\n", "\n", "- However, we require additional processing:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PythonRDD[6] at RDD at PythonRDD.scala:53" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_counts_rdd = word_tuples_rdd.reduceByKey(lambda x, y: x + y)\n", "word_counts_rdd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Now we request the final result:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('to', 2), ('be', 2), ('or', 1), ('not', 1)]" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_counts = word_counts_rdd.collect()\n", "word_counts" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Lazy evaluation \n", "\n", "- It is only when we invoke `collect()` that the processing is performed on the cluster.\n", "\n", "- Invoking `collect()` will cause both the `map` and `reduceByKey` operations to be performed.\n", "\n", "- If the resulting collection is very large then this can be an expensive operation.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### The head of an RDD\n", "\n", "- The `take` method is similar to `collect`, but only returns the first $n$ elements.\n", "\n", "- This can be very useful for testing.\n" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('to', 2), ('be', 2)]" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_counts_rdd.take(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The complete word-count example" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[('to', 2), ('be', 2), ('or', 1), ('not', 1)]" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = \"to be or not to be\".split()\n", "rdd = sc.parallelize(text)\n", "counts = rdd.map(lambda word: (word, 1)) \\\n", " .reduceByKey(lambda x, y: x + y)\n", "counts.collect()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Additional RDD transformations\n", "\n", "- Apache Spark offers many more methods for operating on collections of tuples over and above the standard Map-Reduce framework:\n", "\n", " - Sorting: `sortByKey`, `sortBy`, `takeOrdered`\n", " - Mapping: `flatMap`\n", " - Filtering: `filter`\n", " - Counting: `count`\n", " - Set-theoretic: `intersection`, `union`\n", " - Many others: [see the Transformations section of the programming guide](https://spark.apache.org/docs/latest/programming-guide.html#transformations)\n", " " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "## Creating an RDD from a text file\n", "\n", "- In the previous example, we created an RDD from a Python collection.\n", "\n", "- This is *not* typically how we would work with big data.\n", "\n", "- More commonly we would create an RDD corresponding to data in an\n", "HBase table, or an HDFS file.\n", "\n", "- The following example creates an RDD from a text file on the native filesystem (ext4);\n", " - With bigger data, you would use an HDFS file, but the principle is the same.\n", "\n", "- Each element of the RDD corresponds to a single *line* of text." ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['TTGGCCATGCTGCCCACTCACCTAGAGCGCACAGCTGACACTGAGTCCTCTTCTGAACCTCATCCATGAA',\n", " 'CATATTTATGAAATCTTTCCTGGCCCCAAGTGGAAATGCCCCCTCATTTGGGTCCTCACTGAACCCCAGT',\n", " 'ACACAACTCTTTTGTACTACTCTATTATGCTGGGGTGTTTTTTTATTGTCTCACCTGATAAACCGTAAGC',\n", " 'CCCTTGAAGACAGCAACTCGTTTTTAAGCTCTTTATAACCCCAGAGCCTCGCACAGTACCTGGACCAGAT',\n", " 'TAAGGGGTACTTAACAGATGCTTAGTGAAGGAAGGAATGGATTTCTCACCTGGTTGCTTATCTTCTAGAC']" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genome = sc.textFile('../../../data/genome.txt')\n", "genome.take(5)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Genome example\n", "\n", "- We will use this RDD to calculate the frequencies of sequences of five bases, and then sort the sequences into descending order ranked by their frequency.\n", "\n", "- First we will define some functions to split the bases into sequences of a certain size:" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [], "source": [ "def group_characters(line, n=5):\n", " result = ''\n", " i = 0\n", " for ch in line:\n", " result = result + ch\n", " i = i + 1\n", " if (i % n) == 0:\n", " yield result\n", " result = ''\n", "\n", "def group_and_split(line):\n", " return [sequence for sequence in group_characters(line)]" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['abcde', 'fghij', 'klmno']" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group_and_split('abcdefghijklmno')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Now we will transform the original text RDD into an RDD containing key-value pairs where the key is the sequence and the value is 1, as per the word-count example.\n", "\n", "- Notice that if we simply map each line of text, we will obtain multi-dimensional data:" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['TTGGC',\n", " 'CATGC',\n", " 'TGCCC',\n", " 'ACTCA',\n", " 'CCTAG',\n", " 'AGCGC',\n", " 'ACAGC',\n", " 'TGACA',\n", " 'CTGAG',\n", " 'TCCTC',\n", " 'TTCTG',\n", " 'AACCT',\n", " 'CATCC',\n", " 'ATGAA'],\n", " ['CATAT',\n", " 'TTATG',\n", " 'AAATC',\n", " 'TTTCC',\n", " 'TGGCC',\n", " 'CCAAG',\n", " 'TGGAA',\n", " 'ATGCC',\n", " 'CCCTC',\n", " 'ATTTG',\n", " 'GGTCC',\n", " 'TCACT',\n", " 'GAACC',\n", " 'CCAGT']]" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genome.map(group_and_split).take(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Flattening an RDD using `flatMap`\n", "\n", "- We will need to flatten this data in order to turn it into a list of base-sequences.\n", "\n", "- We can use the `flatMap` method:" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['TTGGC', 'CATGC', 'TGCCC']" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sequences = genome.flatMap(group_and_split)\n", "sequences.take(3)" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('TTGGC', 587),\n", " ('CATGC', 647),\n", " ('TGCCC', 599),\n", " ('ACTCA', 775),\n", " ('TGACA', 831),\n", " ('TTCTG', 1257),\n", " ('AACCT', 726),\n", " ('TTATG', 819),\n", " ('AAATC', 996),\n", " ('TGGCC', 718)]" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts = \\\n", " sequences.map(\n", " lambda w: (w, 1)).reduceByKey(lambda x, y: x + y)\n", "counts.take(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- We want to rank each sequence according to its count.\n", "\n", "- Therefore the key (first element) of each tuple should be the count.\n", "\n", "- Thefefore we need to reverse the tuples." ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [ "def reverse_tuple(key_value_pair):\n", " return (key_value_pair[1], key_value_pair[0])" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(587, 'TTGGC'),\n", " (647, 'CATGC'),\n", " (599, 'TGCCC'),\n", " (775, 'ACTCA'),\n", " (831, 'TGACA'),\n", " (1257, 'TTCTG'),\n", " (726, 'AACCT'),\n", " (819, 'TTATG'),\n", " (996, 'AAATC'),\n", " (718, 'TGGCC')]" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sequences = counts.map(reverse_tuple)\n", "sequences.take(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sorting an RDD\n", "\n", "- Now we can sort the RDD in descending order of key:" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[(37137, 'NNNNN'),\n", " (4653, 'AAAAA'),\n", " (4223, 'TTTTT'),\n", " (2788, 'AAAAT'),\n", " (2658, 'ATTTT'),\n", " (2283, 'AAATA'),\n", " (2276, 'TAAAA'),\n", " (2197, 'TTTTA'),\n", " (2196, 'TATTT'),\n", " (2185, 'AGAAA')]" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sequences_sorted = sequences.sortByKey(False)\n", "top_ten_sequences = sequences_sorted.take(10)\n", "top_ten_sequences" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Calculating $\\pi$ using Spark\n", "\n", "- We can estimate an approximate value for $\\pi$ using the following Monte-Carlo method:\n", "\n", "\n", "1. Inscribe a circle in a square\n", "2. Randomly generate points in the square\n", "3. Determine the number of points in the square that are also in the circle\n", "4. Let $r$ be the number of points in the circle divided by the number of points in the square, then $\\pi \\approx 4 r$.\n", " \n", "- Note that the more points generated, the better the approximation\n", "\n", "See [this tutorial](https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)." ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pi is approximately 3.141\n" ] } ], "source": [ "import numpy as np\n", "\n", "def sample(p):\n", " x, y = np.random.random(), np.random.random()\n", " return 1 if x*x + y*y < 1 else 0\n", "\n", "NUM_SAMPLES = 5000000\n", "\n", "count = sc.parallelize(range(0, NUM_SAMPLES)).map(sample) \\\n", " .reduce(lambda a, b: a + b)\n", "r = float(count) / float(NUM_SAMPLES)\n", "print(\"Pi is approximately %.3f\" % (4.0 * r))" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }