{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "view-in-github"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YCd6jCrqlSXw"
},
"source": [
"\n",
"\n",
"# MapReduce: A Primer with Hello World!
in bash\n",
"
\n",
"
\n",
"\n",
"This tutorial serves as a companion to [MapReduce_Primer_HelloWorld.ipynb](https://github.com/groda/big_data/blob/master/MapReduce_Primer_HelloWorld.ipynb), with the implementation carried out in the Bash scripting language requiring only a few lines of code.\n",
"\n",
"For this tutorial, we are going to download the core Hadoop distribution and run Hadoop in _local standalone mode_:\n",
"\n",
"> ❝ _By default, Hadoop is configured to run in a non-distributed mode, as a single Java process._ ❞\n",
"\n",
"(see [https://hadoop.apache.org/docs/stable/.../Standalone_Operation](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation))\n",
"\n",
"We are going to run a MapReduce job using MapReduce's [streaming application](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Hadoop_Streaming). This is not to be confused with real-time streaming:\n",
"\n",
"> ❝ _Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer._ ❞\n",
"\n",
"MapReduce streaming defaults to using [`IdentityMapper`](https://hadoop.apache.org/docs/stable/api/index.html) and [`IdentityReducer`](https://hadoop.apache.org/docs/stable/api/index.html), thus eliminating the need for explicit specification of a mapper or reducer.\n",
"\n",
"Both input and output are standard files since Hadoop's default filesystem is the regular file system, as specified by the `fs.defaultFS` property in [core-default.xml](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/core-default.xml)).\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2024-03-11T15:45:11.710635Z",
"iopub.status.busy": "2024-03-11T15:45:11.710437Z",
"iopub.status.idle": "2024-03-11T15:45:49.556232Z",
"shell.execute_reply": "2024-03-11T15:45:49.555410Z"
},
"id": "6j5zZwJMkc6C",
"outputId": "4fbcc033-7143-4bca-c635-1e02a0c55e1d"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\tHello, World!\n"
]
}
],
"source": [
"%%bash\n",
"HADOOP_URL=\"https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.3.6.tar.gz\"\n",
"wget --quiet --no-clobber $HADOOP_URL >/dev/null\n",
"[ ! -d $(basename $HADOOP_URL .tar.gz) ] && tar -xzf $(basename $HADOOP_URL)\n",
"HADOOP_HOME=$(pwd)'/'$(basename $HADOOP_URL .tar.gz)'/bin'\n",
"PATH=$HADOOP_HOME:$PATH\n",
"which java >/dev/null|| apt install -y openjdk-19-jre-headless\n",
"export JAVA_HOME=$(realpath $(which java) | sed 's/\\/bin\\/java$//')\n",
"echo -e \"Hello, World!\">hello.txt\n",
"output_dir=\"output\"$(date +\"%Y%m%dT%H%M\")\n",
"mapred streaming -input hello.txt -output output_dir >log 2>&1\n",
"cat output_dir/part-00000"
]
}
],
"metadata": {
"colab": {
"authorship_tag": "ABX9TyPylzoU4zNysmX3h9I2aCT/",
"include_colab_link": true,
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.18"
}
},
"nbformat": 4,
"nbformat_minor": 0
}