{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "source": [ "
\n", "\n", "# MapReduce: A Primer with Hello World! in bash\n", "
\n", "
\n", "\n", "This tutorial serves as a companion to [MapReduce_Primer_HelloWorld.ipynb](https://github.com/groda/big_data/blob/master/MapReduce_Primer_HelloWorld.ipynb), with the implementation carried out in the Bash scripting language requiring only a few lines of code.\n", "\n", "For this tutorial, we are going to download the core Hadoop distribution and run Hadoop in _local standalone mode_:\n", "\n", "> ❝ _By default, Hadoop is configured to run in a non-distributed mode, as a single Java process._ ❞\n", "\n", "(see [https://hadoop.apache.org/docs/stable/.../Standalone_Operation](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation))\n", "\n", "We are going to run a MapReduce job using MapReduce's [streaming application](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Hadoop_Streaming). This is not to be confused with real-time streaming:\n", "\n", "> ❝ _Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer._ ❞\n", "\n", "MapReduce streaming defaults to using [`IdentityMapper`](https://hadoop.apache.org/docs/stable/api/index.html) and [`IdentityReducer`](https://hadoop.apache.org/docs/stable/api/index.html), thus eliminating the need for explicit specification of a mapper or reducer.\n", "\n", "Both input and output are standard files since Hadoop's default filesystem is the regular file system, as specified by the `fs.defaultFS` property in [core-default.xml](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/core-default.xml)).\n" ], "metadata": { "id": "YCd6jCrqlSXw" } }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6j5zZwJMkc6C", "outputId": "4d77a268-69fe-4cc0-cede-8e78d7f087f2" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "output20240324T1947:\n", "total 4\n", "-rw-r--r-- 1 root root 16 Mar 24 19:47 part-00000\n", "-rw-r--r-- 1 root root 0 Mar 24 19:47 _SUCCESS\n", "\n", "output20240324T1948:\n", "total 4\n", "-rw-r--r-- 1 root root 16 Mar 24 19:48 part-00000\n", "-rw-r--r-- 1 root root 0 Mar 24 19:48 _SUCCESS\n", "0\tHello, World!\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "2024-03-24 19:48:27,531 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties\n", "2024-03-24 19:48:27,701 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).\n", "2024-03-24 19:48:27,702 INFO impl.MetricsSystemImpl: JobTracker metrics system started\n", "2024-03-24 19:48:27,727 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!\n", "2024-03-24 19:48:28,055 INFO mapred.FileInputFormat: Total input files to process : 1\n", "2024-03-24 19:48:28,082 INFO mapreduce.JobSubmitter: number of splits:1\n", "2024-03-24 19:48:28,411 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local723241263_0001\n", "2024-03-24 19:48:28,411 INFO mapreduce.JobSubmitter: Executing with tokens: []\n", "2024-03-24 19:48:28,686 INFO mapreduce.Job: The url to track the job: http://localhost:8080/\n", "2024-03-24 19:48:28,688 INFO mapreduce.Job: Running job: job_local723241263_0001\n", "2024-03-24 19:48:28,697 INFO mapred.LocalJobRunner: OutputCommitter set in config null\n", "2024-03-24 19:48:28,700 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter\n", "2024-03-24 19:48:28,709 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n", "2024-03-24 19:48:28,713 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n", "2024-03-24 19:48:28,767 INFO mapred.LocalJobRunner: Waiting for map tasks\n", "2024-03-24 19:48:28,773 INFO mapred.LocalJobRunner: Starting task: attempt_local723241263_0001_m_000000_0\n", "2024-03-24 19:48:28,822 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n", "2024-03-24 19:48:28,825 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n", "2024-03-24 19:48:28,855 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]\n", "2024-03-24 19:48:28,866 INFO mapred.MapTask: Processing split: file:/content/hello.txt:0+14\n", "2024-03-24 19:48:28,884 INFO mapred.MapTask: numReduceTasks: 1\n", "2024-03-24 19:48:28,969 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)\n", "2024-03-24 19:48:28,969 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100\n", "2024-03-24 19:48:28,969 INFO mapred.MapTask: soft limit at 83886080\n", "2024-03-24 19:48:28,969 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600\n", "2024-03-24 19:48:28,969 INFO mapred.MapTask: kvstart = 26214396; length = 6553600\n", "2024-03-24 19:48:28,976 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer\n", "2024-03-24 19:48:28,985 INFO mapred.LocalJobRunner: \n", "2024-03-24 19:48:28,985 INFO mapred.MapTask: Starting flush of map output\n", "2024-03-24 19:48:28,985 INFO mapred.MapTask: Spilling map output\n", "2024-03-24 19:48:28,985 INFO mapred.MapTask: bufstart = 0; bufend = 22; bufvoid = 104857600\n", "2024-03-24 19:48:28,985 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600\n", "2024-03-24 19:48:28,993 INFO mapred.MapTask: Finished spill 0\n", "2024-03-24 19:48:29,009 INFO mapred.Task: Task:attempt_local723241263_0001_m_000000_0 is done. And is in the process of committing\n", "2024-03-24 19:48:29,015 INFO mapred.LocalJobRunner: file:/content/hello.txt:0+14\n", "2024-03-24 19:48:29,015 INFO mapred.Task: Task 'attempt_local723241263_0001_m_000000_0' done.\n", "2024-03-24 19:48:29,025 INFO mapred.Task: Final Counters for attempt_local723241263_0001_m_000000_0: Counters: 17\n", "\tFile System Counters\n", "\t\tFILE: Number of bytes read=141410\n", "\t\tFILE: Number of bytes written=776343\n", "\t\tFILE: Number of read operations=0\n", "\t\tFILE: Number of large read operations=0\n", "\t\tFILE: Number of write operations=0\n", "\tMap-Reduce Framework\n", "\t\tMap input records=1\n", "\t\tMap output records=1\n", "\t\tMap output bytes=22\n", "\t\tMap output materialized bytes=30\n", "\t\tInput split bytes=75\n", "\t\tCombine input records=0\n", "\t\tSpilled Records=1\n", "\t\tFailed Shuffles=0\n", "\t\tMerged Map outputs=0\n", "\t\tGC time elapsed (ms)=0\n", "\t\tTotal committed heap usage (bytes)=407896064\n", "\tFile Input Format Counters \n", "\t\tBytes Read=14\n", "2024-03-24 19:48:29,025 INFO mapred.LocalJobRunner: Finishing task: attempt_local723241263_0001_m_000000_0\n", "2024-03-24 19:48:29,026 INFO mapred.LocalJobRunner: map task executor complete.\n", "2024-03-24 19:48:29,031 INFO mapred.LocalJobRunner: Waiting for reduce tasks\n", "2024-03-24 19:48:29,035 INFO mapred.LocalJobRunner: Starting task: attempt_local723241263_0001_r_000000_0\n", "2024-03-24 19:48:29,046 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n", "2024-03-24 19:48:29,047 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n", "2024-03-24 19:48:29,047 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]\n", "2024-03-24 19:48:29,054 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@4a68130c\n", "2024-03-24 19:48:29,056 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!\n", "2024-03-24 19:48:29,079 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=2382574336, maxSingleShuffleLimit=595643584, mergeThreshold=1572499072, ioSortFactor=10, memToMemMergeOutputsThreshold=10\n", "2024-03-24 19:48:29,095 INFO reduce.EventFetcher: attempt_local723241263_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events\n", "2024-03-24 19:48:29,158 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local723241263_0001_m_000000_0 decomp: 26 len: 30 to MEMORY\n", "2024-03-24 19:48:29,162 INFO reduce.InMemoryMapOutput: Read 26 bytes from map-output for attempt_local723241263_0001_m_000000_0\n", "2024-03-24 19:48:29,164 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 26, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->26\n", "2024-03-24 19:48:29,168 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning\n", "2024-03-24 19:48:29,170 INFO mapred.LocalJobRunner: 1 / 1 copied.\n", "2024-03-24 19:48:29,170 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs\n", "2024-03-24 19:48:29,177 INFO mapred.Merger: Merging 1 sorted segments\n", "2024-03-24 19:48:29,178 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16 bytes\n", "2024-03-24 19:48:29,179 INFO reduce.MergeManagerImpl: Merged 1 segments, 26 bytes to disk to satisfy reduce memory limit\n", "2024-03-24 19:48:29,180 INFO reduce.MergeManagerImpl: Merging 1 files, 30 bytes from disk\n", "2024-03-24 19:48:29,181 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce\n", "2024-03-24 19:48:29,181 INFO mapred.Merger: Merging 1 sorted segments\n", "2024-03-24 19:48:29,182 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16 bytes\n", "2024-03-24 19:48:29,183 INFO mapred.LocalJobRunner: 1 / 1 copied.\n", "2024-03-24 19:48:29,193 INFO mapred.Task: Task:attempt_local723241263_0001_r_000000_0 is done. And is in the process of committing\n", "2024-03-24 19:48:29,195 INFO mapred.LocalJobRunner: 1 / 1 copied.\n", "2024-03-24 19:48:29,195 INFO mapred.Task: Task attempt_local723241263_0001_r_000000_0 is allowed to commit now\n", "2024-03-24 19:48:29,197 INFO output.FileOutputCommitter: Saved output of task 'attempt_local723241263_0001_r_000000_0' to file:/content/output20240324T1948\n", "2024-03-24 19:48:29,198 INFO mapred.LocalJobRunner: reduce > reduce\n", "2024-03-24 19:48:29,198 INFO mapred.Task: Task 'attempt_local723241263_0001_r_000000_0' done.\n", "2024-03-24 19:48:29,199 INFO mapred.Task: Final Counters for attempt_local723241263_0001_r_000000_0: Counters: 24\n", "\tFile System Counters\n", "\t\tFILE: Number of bytes read=141502\n", "\t\tFILE: Number of bytes written=776401\n", "\t\tFILE: Number of read operations=0\n", "\t\tFILE: Number of large read operations=0\n", "\t\tFILE: Number of write operations=0\n", "\tMap-Reduce Framework\n", "\t\tCombine input records=0\n", "\t\tCombine output records=0\n", "\t\tReduce input groups=1\n", "\t\tReduce shuffle bytes=30\n", "\t\tReduce input records=1\n", "\t\tReduce output records=1\n", "\t\tSpilled Records=1\n", "\t\tShuffled Maps =1\n", "\t\tFailed Shuffles=0\n", "\t\tMerged Map outputs=1\n", "\t\tGC time elapsed (ms)=0\n", "\t\tTotal committed heap usage (bytes)=407896064\n", "\tShuffle Errors\n", "\t\tBAD_ID=0\n", "\t\tCONNECTION=0\n", "\t\tIO_ERROR=0\n", "\t\tWRONG_LENGTH=0\n", "\t\tWRONG_MAP=0\n", "\t\tWRONG_REDUCE=0\n", "\tFile Output Format Counters \n", "\t\tBytes Written=28\n", "2024-03-24 19:48:29,199 INFO mapred.LocalJobRunner: Finishing task: attempt_local723241263_0001_r_000000_0\n", "2024-03-24 19:48:29,200 INFO mapred.LocalJobRunner: reduce task executor complete.\n", "2024-03-24 19:48:29,694 INFO mapreduce.Job: Job job_local723241263_0001 running in uber mode : false\n", "2024-03-24 19:48:29,696 INFO mapreduce.Job: map 100% reduce 100%\n", "2024-03-24 19:48:29,697 INFO mapreduce.Job: Job job_local723241263_0001 completed successfully\n", "2024-03-24 19:48:29,707 INFO mapreduce.Job: Counters: 30\n", "\tFile System Counters\n", "\t\tFILE: Number of bytes read=282912\n", "\t\tFILE: Number of bytes written=1552744\n", "\t\tFILE: Number of read operations=0\n", "\t\tFILE: Number of large read operations=0\n", "\t\tFILE: Number of write operations=0\n", "\tMap-Reduce Framework\n", "\t\tMap input records=1\n", "\t\tMap output records=1\n", "\t\tMap output bytes=22\n", "\t\tMap output materialized bytes=30\n", "\t\tInput split bytes=75\n", "\t\tCombine input records=0\n", "\t\tCombine output records=0\n", "\t\tReduce input groups=1\n", "\t\tReduce shuffle bytes=30\n", "\t\tReduce input records=1\n", "\t\tReduce output records=1\n", "\t\tSpilled Records=2\n", "\t\tShuffled Maps =1\n", "\t\tFailed Shuffles=0\n", "\t\tMerged Map outputs=1\n", "\t\tGC time elapsed (ms)=0\n", "\t\tTotal committed heap usage (bytes)=815792128\n", "\tShuffle Errors\n", "\t\tBAD_ID=0\n", "\t\tCONNECTION=0\n", "\t\tIO_ERROR=0\n", "\t\tWRONG_LENGTH=0\n", "\t\tWRONG_MAP=0\n", "\t\tWRONG_REDUCE=0\n", "\tFile Input Format Counters \n", "\t\tBytes Read=14\n", "\tFile Output Format Counters \n", "\t\tBytes Written=28\n", "2024-03-24 19:48:29,707 INFO streaming.StreamJob: Output directory: output20240324T1948\n" ] } ], "source": [ "%%bash\n", "#set -x\n", "HADOOP_URL=\"https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz\"\n", "wget --quiet --no-clobber $HADOOP_URL >/dev/null\n", "[ ! -d $(basename $HADOOP_URL .tar.gz) ] && tar -xzf $(basename $HADOOP_URL)\n", "HADOOP_HOME=$(pwd)'/'$(basename $HADOOP_URL .tar.gz)'/bin'\n", "PATH=$HADOOP_HOME:$PATH\n", "which java >/dev/null|| apt install -y openjdk-19-jre-headless\n", "export JAVA_HOME=$(realpath $(which java) | sed 's/\\/bin\\/java$//')\n", "echo -e \"Hello, World!\">hello.txt\n", "output_dir=\"output\"$(date +\"%Y%m%dT%H%M\")\n", "sleep 10\n", "mapred streaming -input hello.txt -output $output_dir\n", "ls -lR output*\n", "cat $output_dir/part-00000" ] } ] }