{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/groda/big_data/blob/master/MapReduce_Primer_HelloWorld_bash.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YCd6jCrqlSXw"
   },
   "source": [
    "<a href=\"https://github.com/groda/big_data\"><div><img src=\"https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true\" align=right width=\"90\"></div></a>\n",
    "\n",
    "# MapReduce: A Primer with <code>Hello World!</code> in bash\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "This tutorial serves as a companion to [MapReduce_Primer_HelloWorld.ipynb](https://github.com/groda/big_data/blob/master/MapReduce_Primer_HelloWorld.ipynb), with the implementation carried out in the Bash scripting language requiring only a few lines of code.\n",
    "\n",
    "For this tutorial, we are going to download the core Hadoop distribution and run Hadoop in _local standalone mode_:\n",
    "\n",
    "> ❝ _By default, Hadoop is configured to run in a non-distributed mode, as a single Java process._ ❞\n",
    "\n",
    "(see [https://hadoop.apache.org/docs/stable/.../Standalone_Operation](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation))\n",
    "\n",
    "We are going to run a MapReduce job using MapReduce's [streaming application](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Hadoop_Streaming). This is not to be confused with real-time streaming:\n",
    "\n",
    "> ❝ _Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer._ ❞\n",
    "\n",
    "MapReduce streaming defaults to using [`IdentityMapper`](https://hadoop.apache.org/docs/stable/api/index.html) and [`IdentityReducer`](https://hadoop.apache.org/docs/stable/api/index.html), thus eliminating the need for explicit specification of a mapper or reducer.\n",
    "\n",
    "Both input and output are standard files since Hadoop's default filesystem is the regular file system, as specified by the `fs.defaultFS` property in [core-default.xml](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/core-default.xml)).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-03-11T15:45:11.710635Z",
     "iopub.status.busy": "2024-03-11T15:45:11.710437Z",
     "iopub.status.idle": "2024-03-11T15:45:49.556232Z",
     "shell.execute_reply": "2024-03-11T15:45:49.555410Z"
    },
    "id": "6j5zZwJMkc6C",
    "outputId": "4fbcc033-7143-4bca-c635-1e02a0c55e1d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\tHello, World!\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "HADOOP_URL=\"https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.3.6.tar.gz\"\n",
    "wget --quiet --no-clobber $HADOOP_URL >/dev/null\n",
    "[ ! -d $(basename $HADOOP_URL .tar.gz) ] && tar -xzf $(basename $HADOOP_URL)\n",
    "HADOOP_HOME=$(pwd)'/'$(basename $HADOOP_URL .tar.gz)'/bin'\n",
    "PATH=$HADOOP_HOME:$PATH\n",
    "which java >/dev/null|| apt install -y openjdk-19-jre-headless\n",
    "export JAVA_HOME=$(realpath $(which java) | sed 's/\\/bin\\/java$//')\n",
    "echo -e \"Hello, World!\">hello.txt\n",
    "output_dir=\"output\"$(date +\"%Y%m%dT%H%M\")\n",
    "mapred streaming -input hello.txt -output output_dir >log 2>&1\n",
    "cat output_dir/part-00000"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "authorship_tag": "ABX9TyPylzoU4zNysmX3h9I2aCT/",
   "include_colab_link": true,
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}