<a href="https://colab.research.google.com/github/groda/big_data/blob/master/MapReduce_Primer_HelloWorld_bash.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://github.com/groda/big_data"><div><img src="https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true" align=right width="90"></div></a>

# MapReduce: A Primer with <code>Hello World!</code> in bash
<br>
<br>

This tutorial serves as a companion to [MapReduce_Primer_HelloWorld.ipynb](https://github.com/groda/big_data/blob/master/MapReduce_Primer_HelloWorld.ipynb), with the implementation carried out in the Bash scripting language requiring only a few lines of code.

For this tutorial, we are going to download the core Hadoop distribution and run Hadoop in _local standalone mode_:

> ❝ _By default, Hadoop is configured to run in a non-distributed mode, as a single Java process._ ❞

(see [https://hadoop.apache.org/docs/stable/.../Standalone_Operation](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation))

We are going to run a MapReduce job using MapReduce's [streaming application](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Hadoop_Streaming). This is not to be confused with real-time streaming:

> ❝ _Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer._ ❞

MapReduce streaming defaults to using [`IdentityMapper`](https://hadoop.apache.org/docs/stable/api/index.html) and [`IdentityReducer`](https://hadoop.apache.org/docs/stable/api/index.html), thus eliminating the need for explicit specification of a mapper or reducer.

Both input and output are standard files since Hadoop's default filesystem is the regular file system, as specified by the `fs.defaultFS` property in [core-default.xml](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/core-default.xml)).


In [2]:
%%bash
#set -x
HADOOP_URL="https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz"
wget --quiet --no-clobber $HADOOP_URL >/dev/null
[ ! -d $(basename $HADOOP_URL .tar.gz) ] && tar -xzf $(basename $HADOOP_URL)
HADOOP_HOME=$(pwd)'/'$(basename $HADOOP_URL .tar.gz)'/bin'
PATH=$HADOOP_HOME:$PATH
which java >/dev/null|| apt install -y openjdk-19-jre-headless
export JAVA_HOME=$(realpath $(which java) | sed 's/\/bin\/java$//')
echo -e "Hello, World!">hello.txt
output_dir="output"$(date +"%Y%m%dT%H%M")
sleep 10
mapred streaming -input hello.txt -output $output_dir
ls -lR output*
cat $output_dir/part-00000

output20240324T1947:
total 4
-rw-r--r-- 1 root root 16 Mar 24 19:47 part-00000
-rw-r--r-- 1 root root  0 Mar 24 19:47 _SUCCESS

output20240324T1948:
total 4
-rw-r--r-- 1 root root 16 Mar 24 19:48 part-00000
-rw-r--r-- 1 root root  0 Mar 24 19:48 _SUCCESS
0	Hello, World!


2024-03-24 19:48:27,531 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2024-03-24 19:48:27,701 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2024-03-24 19:48:27,702 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2024-03-24 19:48:27,727 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2024-03-24 19:48:28,055 INFO mapred.FileInputFormat: Total input files to process : 1
2024-03-24 19:48:28,082 INFO mapreduce.JobSubmitter: number of splits:1
2024-03-24 19:48:28,411 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local723241263_0001
2024-03-24 19:48:28,411 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-03-24 19:48:28,686 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2024-03-24 19:48:28,688 INFO mapreduce.Job: Running job: job_local723241263_0001
2024-03-24 19:48:28,697 INFO mapred.LocalJobRunner: OutputCommitter set in config null
2024-03-2