{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"source": [
"\n",
"\n",
"# MapReduce: A Primer with Hello World!
in bash\n",
"
\n",
"
\n",
"\n",
"This tutorial serves as a companion to [MapReduce_Primer_HelloWorld.ipynb](https://github.com/groda/big_data/blob/master/MapReduce_Primer_HelloWorld.ipynb), with the implementation carried out in the Bash scripting language requiring only a few lines of code.\n",
"\n",
"For this tutorial, we are going to download the core Hadoop distribution and run Hadoop in _local standalone mode_:\n",
"\n",
"> ❝ _By default, Hadoop is configured to run in a non-distributed mode, as a single Java process._ ❞\n",
"\n",
"(see [https://hadoop.apache.org/docs/stable/.../Standalone_Operation](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation))\n",
"\n",
"We are going to run a MapReduce job using MapReduce's [streaming application](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Hadoop_Streaming). This is not to be confused with real-time streaming:\n",
"\n",
"> ❝ _Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer._ ❞\n",
"\n",
"MapReduce streaming defaults to using [`IdentityMapper`](https://hadoop.apache.org/docs/stable/api/index.html) and [`IdentityReducer`](https://hadoop.apache.org/docs/stable/api/index.html), thus eliminating the need for explicit specification of a mapper or reducer.\n",
"\n",
"Both input and output are standard files since Hadoop's default filesystem is the regular file system, as specified by the `fs.defaultFS` property in [core-default.xml](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/core-default.xml)).\n"
],
"metadata": {
"id": "YCd6jCrqlSXw"
}
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "6j5zZwJMkc6C",
"outputId": "4d77a268-69fe-4cc0-cede-8e78d7f087f2"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"output20240324T1947:\n",
"total 4\n",
"-rw-r--r-- 1 root root 16 Mar 24 19:47 part-00000\n",
"-rw-r--r-- 1 root root 0 Mar 24 19:47 _SUCCESS\n",
"\n",
"output20240324T1948:\n",
"total 4\n",
"-rw-r--r-- 1 root root 16 Mar 24 19:48 part-00000\n",
"-rw-r--r-- 1 root root 0 Mar 24 19:48 _SUCCESS\n",
"0\tHello, World!\n"
]
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"2024-03-24 19:48:27,531 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties\n",
"2024-03-24 19:48:27,701 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).\n",
"2024-03-24 19:48:27,702 INFO impl.MetricsSystemImpl: JobTracker metrics system started\n",
"2024-03-24 19:48:27,727 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!\n",
"2024-03-24 19:48:28,055 INFO mapred.FileInputFormat: Total input files to process : 1\n",
"2024-03-24 19:48:28,082 INFO mapreduce.JobSubmitter: number of splits:1\n",
"2024-03-24 19:48:28,411 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local723241263_0001\n",
"2024-03-24 19:48:28,411 INFO mapreduce.JobSubmitter: Executing with tokens: []\n",
"2024-03-24 19:48:28,686 INFO mapreduce.Job: The url to track the job: http://localhost:8080/\n",
"2024-03-24 19:48:28,688 INFO mapreduce.Job: Running job: job_local723241263_0001\n",
"2024-03-24 19:48:28,697 INFO mapred.LocalJobRunner: OutputCommitter set in config null\n",
"2024-03-24 19:48:28,700 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter\n",
"2024-03-24 19:48:28,709 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n",
"2024-03-24 19:48:28,713 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n",
"2024-03-24 19:48:28,767 INFO mapred.LocalJobRunner: Waiting for map tasks\n",
"2024-03-24 19:48:28,773 INFO mapred.LocalJobRunner: Starting task: attempt_local723241263_0001_m_000000_0\n",
"2024-03-24 19:48:28,822 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n",
"2024-03-24 19:48:28,825 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n",
"2024-03-24 19:48:28,855 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]\n",
"2024-03-24 19:48:28,866 INFO mapred.MapTask: Processing split: file:/content/hello.txt:0+14\n",
"2024-03-24 19:48:28,884 INFO mapred.MapTask: numReduceTasks: 1\n",
"2024-03-24 19:48:28,969 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)\n",
"2024-03-24 19:48:28,969 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100\n",
"2024-03-24 19:48:28,969 INFO mapred.MapTask: soft limit at 83886080\n",
"2024-03-24 19:48:28,969 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600\n",
"2024-03-24 19:48:28,969 INFO mapred.MapTask: kvstart = 26214396; length = 6553600\n",
"2024-03-24 19:48:28,976 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer\n",
"2024-03-24 19:48:28,985 INFO mapred.LocalJobRunner: \n",
"2024-03-24 19:48:28,985 INFO mapred.MapTask: Starting flush of map output\n",
"2024-03-24 19:48:28,985 INFO mapred.MapTask: Spilling map output\n",
"2024-03-24 19:48:28,985 INFO mapred.MapTask: bufstart = 0; bufend = 22; bufvoid = 104857600\n",
"2024-03-24 19:48:28,985 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600\n",
"2024-03-24 19:48:28,993 INFO mapred.MapTask: Finished spill 0\n",
"2024-03-24 19:48:29,009 INFO mapred.Task: Task:attempt_local723241263_0001_m_000000_0 is done. And is in the process of committing\n",
"2024-03-24 19:48:29,015 INFO mapred.LocalJobRunner: file:/content/hello.txt:0+14\n",
"2024-03-24 19:48:29,015 INFO mapred.Task: Task 'attempt_local723241263_0001_m_000000_0' done.\n",
"2024-03-24 19:48:29,025 INFO mapred.Task: Final Counters for attempt_local723241263_0001_m_000000_0: Counters: 17\n",
"\tFile System Counters\n",
"\t\tFILE: Number of bytes read=141410\n",
"\t\tFILE: Number of bytes written=776343\n",
"\t\tFILE: Number of read operations=0\n",
"\t\tFILE: Number of large read operations=0\n",
"\t\tFILE: Number of write operations=0\n",
"\tMap-Reduce Framework\n",
"\t\tMap input records=1\n",
"\t\tMap output records=1\n",
"\t\tMap output bytes=22\n",
"\t\tMap output materialized bytes=30\n",
"\t\tInput split bytes=75\n",
"\t\tCombine input records=0\n",
"\t\tSpilled Records=1\n",
"\t\tFailed Shuffles=0\n",
"\t\tMerged Map outputs=0\n",
"\t\tGC time elapsed (ms)=0\n",
"\t\tTotal committed heap usage (bytes)=407896064\n",
"\tFile Input Format Counters \n",
"\t\tBytes Read=14\n",
"2024-03-24 19:48:29,025 INFO mapred.LocalJobRunner: Finishing task: attempt_local723241263_0001_m_000000_0\n",
"2024-03-24 19:48:29,026 INFO mapred.LocalJobRunner: map task executor complete.\n",
"2024-03-24 19:48:29,031 INFO mapred.LocalJobRunner: Waiting for reduce tasks\n",
"2024-03-24 19:48:29,035 INFO mapred.LocalJobRunner: Starting task: attempt_local723241263_0001_r_000000_0\n",
"2024-03-24 19:48:29,046 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n",
"2024-03-24 19:48:29,047 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n",
"2024-03-24 19:48:29,047 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]\n",
"2024-03-24 19:48:29,054 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@4a68130c\n",
"2024-03-24 19:48:29,056 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!\n",
"2024-03-24 19:48:29,079 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=2382574336, maxSingleShuffleLimit=595643584, mergeThreshold=1572499072, ioSortFactor=10, memToMemMergeOutputsThreshold=10\n",
"2024-03-24 19:48:29,095 INFO reduce.EventFetcher: attempt_local723241263_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events\n",
"2024-03-24 19:48:29,158 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local723241263_0001_m_000000_0 decomp: 26 len: 30 to MEMORY\n",
"2024-03-24 19:48:29,162 INFO reduce.InMemoryMapOutput: Read 26 bytes from map-output for attempt_local723241263_0001_m_000000_0\n",
"2024-03-24 19:48:29,164 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 26, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->26\n",
"2024-03-24 19:48:29,168 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning\n",
"2024-03-24 19:48:29,170 INFO mapred.LocalJobRunner: 1 / 1 copied.\n",
"2024-03-24 19:48:29,170 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs\n",
"2024-03-24 19:48:29,177 INFO mapred.Merger: Merging 1 sorted segments\n",
"2024-03-24 19:48:29,178 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16 bytes\n",
"2024-03-24 19:48:29,179 INFO reduce.MergeManagerImpl: Merged 1 segments, 26 bytes to disk to satisfy reduce memory limit\n",
"2024-03-24 19:48:29,180 INFO reduce.MergeManagerImpl: Merging 1 files, 30 bytes from disk\n",
"2024-03-24 19:48:29,181 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce\n",
"2024-03-24 19:48:29,181 INFO mapred.Merger: Merging 1 sorted segments\n",
"2024-03-24 19:48:29,182 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16 bytes\n",
"2024-03-24 19:48:29,183 INFO mapred.LocalJobRunner: 1 / 1 copied.\n",
"2024-03-24 19:48:29,193 INFO mapred.Task: Task:attempt_local723241263_0001_r_000000_0 is done. And is in the process of committing\n",
"2024-03-24 19:48:29,195 INFO mapred.LocalJobRunner: 1 / 1 copied.\n",
"2024-03-24 19:48:29,195 INFO mapred.Task: Task attempt_local723241263_0001_r_000000_0 is allowed to commit now\n",
"2024-03-24 19:48:29,197 INFO output.FileOutputCommitter: Saved output of task 'attempt_local723241263_0001_r_000000_0' to file:/content/output20240324T1948\n",
"2024-03-24 19:48:29,198 INFO mapred.LocalJobRunner: reduce > reduce\n",
"2024-03-24 19:48:29,198 INFO mapred.Task: Task 'attempt_local723241263_0001_r_000000_0' done.\n",
"2024-03-24 19:48:29,199 INFO mapred.Task: Final Counters for attempt_local723241263_0001_r_000000_0: Counters: 24\n",
"\tFile System Counters\n",
"\t\tFILE: Number of bytes read=141502\n",
"\t\tFILE: Number of bytes written=776401\n",
"\t\tFILE: Number of read operations=0\n",
"\t\tFILE: Number of large read operations=0\n",
"\t\tFILE: Number of write operations=0\n",
"\tMap-Reduce Framework\n",
"\t\tCombine input records=0\n",
"\t\tCombine output records=0\n",
"\t\tReduce input groups=1\n",
"\t\tReduce shuffle bytes=30\n",
"\t\tReduce input records=1\n",
"\t\tReduce output records=1\n",
"\t\tSpilled Records=1\n",
"\t\tShuffled Maps =1\n",
"\t\tFailed Shuffles=0\n",
"\t\tMerged Map outputs=1\n",
"\t\tGC time elapsed (ms)=0\n",
"\t\tTotal committed heap usage (bytes)=407896064\n",
"\tShuffle Errors\n",
"\t\tBAD_ID=0\n",
"\t\tCONNECTION=0\n",
"\t\tIO_ERROR=0\n",
"\t\tWRONG_LENGTH=0\n",
"\t\tWRONG_MAP=0\n",
"\t\tWRONG_REDUCE=0\n",
"\tFile Output Format Counters \n",
"\t\tBytes Written=28\n",
"2024-03-24 19:48:29,199 INFO mapred.LocalJobRunner: Finishing task: attempt_local723241263_0001_r_000000_0\n",
"2024-03-24 19:48:29,200 INFO mapred.LocalJobRunner: reduce task executor complete.\n",
"2024-03-24 19:48:29,694 INFO mapreduce.Job: Job job_local723241263_0001 running in uber mode : false\n",
"2024-03-24 19:48:29,696 INFO mapreduce.Job: map 100% reduce 100%\n",
"2024-03-24 19:48:29,697 INFO mapreduce.Job: Job job_local723241263_0001 completed successfully\n",
"2024-03-24 19:48:29,707 INFO mapreduce.Job: Counters: 30\n",
"\tFile System Counters\n",
"\t\tFILE: Number of bytes read=282912\n",
"\t\tFILE: Number of bytes written=1552744\n",
"\t\tFILE: Number of read operations=0\n",
"\t\tFILE: Number of large read operations=0\n",
"\t\tFILE: Number of write operations=0\n",
"\tMap-Reduce Framework\n",
"\t\tMap input records=1\n",
"\t\tMap output records=1\n",
"\t\tMap output bytes=22\n",
"\t\tMap output materialized bytes=30\n",
"\t\tInput split bytes=75\n",
"\t\tCombine input records=0\n",
"\t\tCombine output records=0\n",
"\t\tReduce input groups=1\n",
"\t\tReduce shuffle bytes=30\n",
"\t\tReduce input records=1\n",
"\t\tReduce output records=1\n",
"\t\tSpilled Records=2\n",
"\t\tShuffled Maps =1\n",
"\t\tFailed Shuffles=0\n",
"\t\tMerged Map outputs=1\n",
"\t\tGC time elapsed (ms)=0\n",
"\t\tTotal committed heap usage (bytes)=815792128\n",
"\tShuffle Errors\n",
"\t\tBAD_ID=0\n",
"\t\tCONNECTION=0\n",
"\t\tIO_ERROR=0\n",
"\t\tWRONG_LENGTH=0\n",
"\t\tWRONG_MAP=0\n",
"\t\tWRONG_REDUCE=0\n",
"\tFile Input Format Counters \n",
"\t\tBytes Read=14\n",
"\tFile Output Format Counters \n",
"\t\tBytes Written=28\n",
"2024-03-24 19:48:29,707 INFO streaming.StreamJob: Output directory: output20240324T1948\n"
]
}
],
"source": [
"%%bash\n",
"#set -x\n",
"HADOOP_URL=\"https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz\"\n",
"wget --quiet --no-clobber $HADOOP_URL >/dev/null\n",
"[ ! -d $(basename $HADOOP_URL .tar.gz) ] && tar -xzf $(basename $HADOOP_URL)\n",
"HADOOP_HOME=$(pwd)'/'$(basename $HADOOP_URL .tar.gz)'/bin'\n",
"PATH=$HADOOP_HOME:$PATH\n",
"which java >/dev/null|| apt install -y openjdk-19-jre-headless\n",
"export JAVA_HOME=$(realpath $(which java) | sed 's/\\/bin\\/java$//')\n",
"echo -e \"Hello, World!\">hello.txt\n",
"output_dir=\"output\"$(date +\"%Y%m%dT%H%M\")\n",
"sleep 10\n",
"mapred streaming -input hello.txt -output $output_dir\n",
"ls -lR output*\n",
"cat $output_dir/part-00000"
]
}
]
}