{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "YCd6jCrqlSXw" }, "source": [ "
\n", "\n", "# MapReduce: A Primer with Hello World! in bash\n", "
\n", "
\n", "\n", "This tutorial serves as a companion to [MapReduce_Primer_HelloWorld.ipynb](https://github.com/groda/big_data/blob/master/MapReduce_Primer_HelloWorld.ipynb), with the implementation carried out in the Bash scripting language requiring only a few lines of code.\n", "\n", "For this tutorial, we are going to download the core Hadoop distribution and run Hadoop in _local standalone mode_:\n", "\n", "> ❝ _By default, Hadoop is configured to run in a non-distributed mode, as a single Java process._ ❞\n", "\n", "(see [https://hadoop.apache.org/docs/stable/.../Standalone_Operation](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation))\n", "\n", "We are going to run a MapReduce job using MapReduce's [streaming application](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Hadoop_Streaming). This is not to be confused with real-time streaming:\n", "\n", "> ❝ _Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer._ ❞\n", "\n", "MapReduce streaming defaults to using [`IdentityMapper`](https://hadoop.apache.org/docs/stable/api/index.html) and [`IdentityReducer`](https://hadoop.apache.org/docs/stable/api/index.html), thus eliminating the need for explicit specification of a mapper or reducer.\n", "\n", "Both input and output are standard files since Hadoop's default filesystem is the regular file system, as specified by the `fs.defaultFS` property in [core-default.xml](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/core-default.xml)).\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:45:11.710635Z", "iopub.status.busy": "2024-03-11T15:45:11.710437Z", "iopub.status.idle": "2024-03-11T15:45:49.556232Z", "shell.execute_reply": "2024-03-11T15:45:49.555410Z" }, "id": "6j5zZwJMkc6C", "outputId": "4fbcc033-7143-4bca-c635-1e02a0c55e1d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\tHello, World!\n" ] } ], "source": [ "%%bash\n", "HADOOP_URL=\"https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.3.6.tar.gz\"\n", "wget --quiet --no-clobber $HADOOP_URL >/dev/null\n", "[ ! -d $(basename $HADOOP_URL .tar.gz) ] && tar -xzf $(basename $HADOOP_URL)\n", "HADOOP_HOME=$(pwd)'/'$(basename $HADOOP_URL .tar.gz)'/bin'\n", "PATH=$HADOOP_HOME:$PATH\n", "which java >/dev/null|| apt install -y openjdk-19-jre-headless\n", "export JAVA_HOME=$(realpath $(which java) | sed 's/\\/bin\\/java$//')\n", "echo -e \"Hello, World!\">hello.txt\n", "output_dir=\"output\"$(date +\"%Y%m%dT%H%M\")\n", "mapred streaming -input hello.txt -output output_dir >log 2>&1\n", "cat output_dir/part-00000" ] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyPylzoU4zNysmX3h9I2aCT/", "include_colab_link": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.18" } }, "nbformat": 4, "nbformat_minor": 0 }