{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/groda/big_data/blob/master/MapReduce_Primer_HelloWorld_bash.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "<a href=\"https://github.com/groda/big_data\"><div><img src=\"https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true\" align=right width=\"90\"></div></a>\n",
        "\n",
        "# MapReduce: A Primer with <code>Hello World!</code> in bash\n",
        "<br>\n",
        "<br>\n",
        "\n",
        "This tutorial serves as a companion to [MapReduce_Primer_HelloWorld.ipynb](https://github.com/groda/big_data/blob/master/MapReduce_Primer_HelloWorld.ipynb), with the implementation carried out in the Bash scripting language requiring only a few lines of code.\n",
        "\n",
        "For this tutorial, we are going to download the core Hadoop distribution and run Hadoop in _local standalone mode_:\n",
        "\n",
        "> ❝ _By default, Hadoop is configured to run in a non-distributed mode, as a single Java process._ ❞\n",
        "\n",
        "(see [https://hadoop.apache.org/docs/stable/.../Standalone_Operation](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation))\n",
        "\n",
        "We are going to run a MapReduce job using MapReduce's [streaming application](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Hadoop_Streaming). This is not to be confused with real-time streaming:\n",
        "\n",
        "> ❝ _Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer._ ❞\n",
        "\n",
        "MapReduce streaming defaults to using [`IdentityMapper`](https://hadoop.apache.org/docs/stable/api/index.html) and [`IdentityReducer`](https://hadoop.apache.org/docs/stable/api/index.html), thus eliminating the need for explicit specification of a mapper or reducer.\n",
        "\n",
        "Both input and output are standard files since Hadoop's default filesystem is the regular file system, as specified by the `fs.defaultFS` property in [core-default.xml](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/core-default.xml)).\n"
      ],
      "metadata": {
        "id": "YCd6jCrqlSXw"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "6j5zZwJMkc6C",
        "outputId": "4d77a268-69fe-4cc0-cede-8e78d7f087f2"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "output20240324T1947:\n",
            "total 4\n",
            "-rw-r--r-- 1 root root 16 Mar 24 19:47 part-00000\n",
            "-rw-r--r-- 1 root root  0 Mar 24 19:47 _SUCCESS\n",
            "\n",
            "output20240324T1948:\n",
            "total 4\n",
            "-rw-r--r-- 1 root root 16 Mar 24 19:48 part-00000\n",
            "-rw-r--r-- 1 root root  0 Mar 24 19:48 _SUCCESS\n",
            "0\tHello, World!\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "2024-03-24 19:48:27,531 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties\n",
            "2024-03-24 19:48:27,701 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).\n",
            "2024-03-24 19:48:27,702 INFO impl.MetricsSystemImpl: JobTracker metrics system started\n",
            "2024-03-24 19:48:27,727 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!\n",
            "2024-03-24 19:48:28,055 INFO mapred.FileInputFormat: Total input files to process : 1\n",
            "2024-03-24 19:48:28,082 INFO mapreduce.JobSubmitter: number of splits:1\n",
            "2024-03-24 19:48:28,411 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local723241263_0001\n",
            "2024-03-24 19:48:28,411 INFO mapreduce.JobSubmitter: Executing with tokens: []\n",
            "2024-03-24 19:48:28,686 INFO mapreduce.Job: The url to track the job: http://localhost:8080/\n",
            "2024-03-24 19:48:28,688 INFO mapreduce.Job: Running job: job_local723241263_0001\n",
            "2024-03-24 19:48:28,697 INFO mapred.LocalJobRunner: OutputCommitter set in config null\n",
            "2024-03-24 19:48:28,700 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter\n",
            "2024-03-24 19:48:28,709 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n",
            "2024-03-24 19:48:28,713 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n",
            "2024-03-24 19:48:28,767 INFO mapred.LocalJobRunner: Waiting for map tasks\n",
            "2024-03-24 19:48:28,773 INFO mapred.LocalJobRunner: Starting task: attempt_local723241263_0001_m_000000_0\n",
            "2024-03-24 19:48:28,822 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n",
            "2024-03-24 19:48:28,825 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n",
            "2024-03-24 19:48:28,855 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]\n",
            "2024-03-24 19:48:28,866 INFO mapred.MapTask: Processing split: file:/content/hello.txt:0+14\n",
            "2024-03-24 19:48:28,884 INFO mapred.MapTask: numReduceTasks: 1\n",
            "2024-03-24 19:48:28,969 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)\n",
            "2024-03-24 19:48:28,969 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100\n",
            "2024-03-24 19:48:28,969 INFO mapred.MapTask: soft limit at 83886080\n",
            "2024-03-24 19:48:28,969 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600\n",
            "2024-03-24 19:48:28,969 INFO mapred.MapTask: kvstart = 26214396; length = 6553600\n",
            "2024-03-24 19:48:28,976 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer\n",
            "2024-03-24 19:48:28,985 INFO mapred.LocalJobRunner: \n",
            "2024-03-24 19:48:28,985 INFO mapred.MapTask: Starting flush of map output\n",
            "2024-03-24 19:48:28,985 INFO mapred.MapTask: Spilling map output\n",
            "2024-03-24 19:48:28,985 INFO mapred.MapTask: bufstart = 0; bufend = 22; bufvoid = 104857600\n",
            "2024-03-24 19:48:28,985 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600\n",
            "2024-03-24 19:48:28,993 INFO mapred.MapTask: Finished spill 0\n",
            "2024-03-24 19:48:29,009 INFO mapred.Task: Task:attempt_local723241263_0001_m_000000_0 is done. And is in the process of committing\n",
            "2024-03-24 19:48:29,015 INFO mapred.LocalJobRunner: file:/content/hello.txt:0+14\n",
            "2024-03-24 19:48:29,015 INFO mapred.Task: Task 'attempt_local723241263_0001_m_000000_0' done.\n",
            "2024-03-24 19:48:29,025 INFO mapred.Task: Final Counters for attempt_local723241263_0001_m_000000_0: Counters: 17\n",
            "\tFile System Counters\n",
            "\t\tFILE: Number of bytes read=141410\n",
            "\t\tFILE: Number of bytes written=776343\n",
            "\t\tFILE: Number of read operations=0\n",
            "\t\tFILE: Number of large read operations=0\n",
            "\t\tFILE: Number of write operations=0\n",
            "\tMap-Reduce Framework\n",
            "\t\tMap input records=1\n",
            "\t\tMap output records=1\n",
            "\t\tMap output bytes=22\n",
            "\t\tMap output materialized bytes=30\n",
            "\t\tInput split bytes=75\n",
            "\t\tCombine input records=0\n",
            "\t\tSpilled Records=1\n",
            "\t\tFailed Shuffles=0\n",
            "\t\tMerged Map outputs=0\n",
            "\t\tGC time elapsed (ms)=0\n",
            "\t\tTotal committed heap usage (bytes)=407896064\n",
            "\tFile Input Format Counters \n",
            "\t\tBytes Read=14\n",
            "2024-03-24 19:48:29,025 INFO mapred.LocalJobRunner: Finishing task: attempt_local723241263_0001_m_000000_0\n",
            "2024-03-24 19:48:29,026 INFO mapred.LocalJobRunner: map task executor complete.\n",
            "2024-03-24 19:48:29,031 INFO mapred.LocalJobRunner: Waiting for reduce tasks\n",
            "2024-03-24 19:48:29,035 INFO mapred.LocalJobRunner: Starting task: attempt_local723241263_0001_r_000000_0\n",
            "2024-03-24 19:48:29,046 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n",
            "2024-03-24 19:48:29,047 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n",
            "2024-03-24 19:48:29,047 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]\n",
            "2024-03-24 19:48:29,054 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@4a68130c\n",
            "2024-03-24 19:48:29,056 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!\n",
            "2024-03-24 19:48:29,079 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=2382574336, maxSingleShuffleLimit=595643584, mergeThreshold=1572499072, ioSortFactor=10, memToMemMergeOutputsThreshold=10\n",
            "2024-03-24 19:48:29,095 INFO reduce.EventFetcher: attempt_local723241263_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events\n",
            "2024-03-24 19:48:29,158 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local723241263_0001_m_000000_0 decomp: 26 len: 30 to MEMORY\n",
            "2024-03-24 19:48:29,162 INFO reduce.InMemoryMapOutput: Read 26 bytes from map-output for attempt_local723241263_0001_m_000000_0\n",
            "2024-03-24 19:48:29,164 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 26, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->26\n",
            "2024-03-24 19:48:29,168 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning\n",
            "2024-03-24 19:48:29,170 INFO mapred.LocalJobRunner: 1 / 1 copied.\n",
            "2024-03-24 19:48:29,170 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs\n",
            "2024-03-24 19:48:29,177 INFO mapred.Merger: Merging 1 sorted segments\n",
            "2024-03-24 19:48:29,178 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16 bytes\n",
            "2024-03-24 19:48:29,179 INFO reduce.MergeManagerImpl: Merged 1 segments, 26 bytes to disk to satisfy reduce memory limit\n",
            "2024-03-24 19:48:29,180 INFO reduce.MergeManagerImpl: Merging 1 files, 30 bytes from disk\n",
            "2024-03-24 19:48:29,181 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce\n",
            "2024-03-24 19:48:29,181 INFO mapred.Merger: Merging 1 sorted segments\n",
            "2024-03-24 19:48:29,182 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16 bytes\n",
            "2024-03-24 19:48:29,183 INFO mapred.LocalJobRunner: 1 / 1 copied.\n",
            "2024-03-24 19:48:29,193 INFO mapred.Task: Task:attempt_local723241263_0001_r_000000_0 is done. And is in the process of committing\n",
            "2024-03-24 19:48:29,195 INFO mapred.LocalJobRunner: 1 / 1 copied.\n",
            "2024-03-24 19:48:29,195 INFO mapred.Task: Task attempt_local723241263_0001_r_000000_0 is allowed to commit now\n",
            "2024-03-24 19:48:29,197 INFO output.FileOutputCommitter: Saved output of task 'attempt_local723241263_0001_r_000000_0' to file:/content/output20240324T1948\n",
            "2024-03-24 19:48:29,198 INFO mapred.LocalJobRunner: reduce > reduce\n",
            "2024-03-24 19:48:29,198 INFO mapred.Task: Task 'attempt_local723241263_0001_r_000000_0' done.\n",
            "2024-03-24 19:48:29,199 INFO mapred.Task: Final Counters for attempt_local723241263_0001_r_000000_0: Counters: 24\n",
            "\tFile System Counters\n",
            "\t\tFILE: Number of bytes read=141502\n",
            "\t\tFILE: Number of bytes written=776401\n",
            "\t\tFILE: Number of read operations=0\n",
            "\t\tFILE: Number of large read operations=0\n",
            "\t\tFILE: Number of write operations=0\n",
            "\tMap-Reduce Framework\n",
            "\t\tCombine input records=0\n",
            "\t\tCombine output records=0\n",
            "\t\tReduce input groups=1\n",
            "\t\tReduce shuffle bytes=30\n",
            "\t\tReduce input records=1\n",
            "\t\tReduce output records=1\n",
            "\t\tSpilled Records=1\n",
            "\t\tShuffled Maps =1\n",
            "\t\tFailed Shuffles=0\n",
            "\t\tMerged Map outputs=1\n",
            "\t\tGC time elapsed (ms)=0\n",
            "\t\tTotal committed heap usage (bytes)=407896064\n",
            "\tShuffle Errors\n",
            "\t\tBAD_ID=0\n",
            "\t\tCONNECTION=0\n",
            "\t\tIO_ERROR=0\n",
            "\t\tWRONG_LENGTH=0\n",
            "\t\tWRONG_MAP=0\n",
            "\t\tWRONG_REDUCE=0\n",
            "\tFile Output Format Counters \n",
            "\t\tBytes Written=28\n",
            "2024-03-24 19:48:29,199 INFO mapred.LocalJobRunner: Finishing task: attempt_local723241263_0001_r_000000_0\n",
            "2024-03-24 19:48:29,200 INFO mapred.LocalJobRunner: reduce task executor complete.\n",
            "2024-03-24 19:48:29,694 INFO mapreduce.Job: Job job_local723241263_0001 running in uber mode : false\n",
            "2024-03-24 19:48:29,696 INFO mapreduce.Job:  map 100% reduce 100%\n",
            "2024-03-24 19:48:29,697 INFO mapreduce.Job: Job job_local723241263_0001 completed successfully\n",
            "2024-03-24 19:48:29,707 INFO mapreduce.Job: Counters: 30\n",
            "\tFile System Counters\n",
            "\t\tFILE: Number of bytes read=282912\n",
            "\t\tFILE: Number of bytes written=1552744\n",
            "\t\tFILE: Number of read operations=0\n",
            "\t\tFILE: Number of large read operations=0\n",
            "\t\tFILE: Number of write operations=0\n",
            "\tMap-Reduce Framework\n",
            "\t\tMap input records=1\n",
            "\t\tMap output records=1\n",
            "\t\tMap output bytes=22\n",
            "\t\tMap output materialized bytes=30\n",
            "\t\tInput split bytes=75\n",
            "\t\tCombine input records=0\n",
            "\t\tCombine output records=0\n",
            "\t\tReduce input groups=1\n",
            "\t\tReduce shuffle bytes=30\n",
            "\t\tReduce input records=1\n",
            "\t\tReduce output records=1\n",
            "\t\tSpilled Records=2\n",
            "\t\tShuffled Maps =1\n",
            "\t\tFailed Shuffles=0\n",
            "\t\tMerged Map outputs=1\n",
            "\t\tGC time elapsed (ms)=0\n",
            "\t\tTotal committed heap usage (bytes)=815792128\n",
            "\tShuffle Errors\n",
            "\t\tBAD_ID=0\n",
            "\t\tCONNECTION=0\n",
            "\t\tIO_ERROR=0\n",
            "\t\tWRONG_LENGTH=0\n",
            "\t\tWRONG_MAP=0\n",
            "\t\tWRONG_REDUCE=0\n",
            "\tFile Input Format Counters \n",
            "\t\tBytes Read=14\n",
            "\tFile Output Format Counters \n",
            "\t\tBytes Written=28\n",
            "2024-03-24 19:48:29,707 INFO streaming.StreamJob: Output directory: output20240324T1948\n"
          ]
        }
      ],
      "source": [
        "%%bash\n",
        "#set -x\n",
        "HADOOP_URL=\"https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz\"\n",
        "wget --quiet --no-clobber $HADOOP_URL >/dev/null\n",
        "[ ! -d $(basename $HADOOP_URL .tar.gz) ] && tar -xzf $(basename $HADOOP_URL)\n",
        "HADOOP_HOME=$(pwd)'/'$(basename $HADOOP_URL .tar.gz)'/bin'\n",
        "PATH=$HADOOP_HOME:$PATH\n",
        "which java >/dev/null|| apt install -y openjdk-19-jre-headless\n",
        "export JAVA_HOME=$(realpath $(which java) | sed 's/\\/bin\\/java$//')\n",
        "echo -e \"Hello, World!\">hello.txt\n",
        "output_dir=\"output\"$(date +\"%Y%m%dT%H%M\")\n",
        "sleep 10\n",
        "mapred streaming -input hello.txt -output $output_dir\n",
        "ls -lR output*\n",
        "cat $output_dir/part-00000"
      ]
    }
  ]
}