{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#MapReduce" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we will talk about large data and the popular framework to work with distributed computation: MapReduce.\n", "\n", "By the end of this section you will\n", "\n", "- Know the basic concepts of MapReduce.\n", "- Know how to design simple algorithms with MapReduce.\n", "- What the tools does Python offer to work with it ?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##What is Large Data?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Can\u2019t fit into Excel\n", "- Increase Memory\n", "- Can\u2019t fit into R\n", "- Increase Memory\n", "- Can\u2019t fit into Memory\n", "- Increase Memory\n", "- Can\u2019t fit on a single disk\n", "- Distributed Filesystem: SAN, HDFS/DDFS, AWS: S3, Redshift, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##MapReduce" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Framework to help solve the problem of distributed computation for distributed data*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- A mass of data: records\n", "- Split/Map records into key-values pairs\n", "- Collect/Partition kv pairs (Optional Sort)\n", "- Buckets are passed to Reduce function\n", "- Result is returned" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }