{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "GzbmlR27wh6e" }, "source": [ "
\n", "\n", "# MapReduce: A Primer with Hello World!\n", "
\n", "
\n", "\n", "For this tutorial, we are going to download the core Hadoop distribution and run Hadoop in _local standalone mode_:\n", "\n", "> ❝ _By default, Hadoop is configured to run in a non-distributed mode, as a single Java process._ ❞\n", "\n", "(see [https://hadoop.apache.org/docs/stable/.../Standalone_Operation](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation))\n", "\n", "We are going to run a MapReduce job using MapReduce's [streaming application](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Hadoop_Streaming). This is not to be confused with real-time streaming:\n", "\n", "> ❝ _Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer._ ❞\n", "\n", "MapReduce streaming defaults to using [`IdentityMapper`](https://hadoop.apache.org/docs/stable/api/index.html) and [`IdentityReducer`](https://hadoop.apache.org/docs/stable/api/index.html), thus eliminating the need for explicit specification of a mapper or reducer. Finally, we show how to run a map-only job by setting `mapreduce.job.reduce` equal to $0$.\n", "\n", "Both input and output are standard files since Hadoop's default filesystem is the regular file system, as specified by the `fs.defaultFS` property in [core-default.xml](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/core-default.xml)).\n" ] }, { "cell_type": "markdown", "metadata": { "id": "uUbM5R0GwwYw" }, "source": [ "# Download core Hadoop" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:45:53.005612Z", "iopub.status.busy": "2024-03-11T15:45:53.005405Z", "iopub.status.idle": "2024-03-11T15:45:56.608648Z", "shell.execute_reply": "2024-03-11T15:45:56.607923Z" }, "id": "jDgQtQlzw8bL", "outputId": "829df74f-efd1-4484-a374-44fcf2c95b2f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Not downloading, Hadoop folder hadoop-3.3.6 already exists\n" ] } ], "source": [ "HADOOP_URL = \"https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.3.6.tar.gz\"\n", "\n", "import requests\n", "import os\n", "import tarfile\n", "\n", "def download_and_extract_targz(url):\n", " response = requests.get(url)\n", " filename = url.rsplit('/', 1)[-1]\n", " HADOOP_HOME = filename[:-7]\n", " # set HADOOP_HOME environment variable\n", " os.environ['HADOOP_HOME'] = HADOOP_HOME\n", " if os.path.isdir(HADOOP_HOME):\n", " print(\"Not downloading, Hadoop folder {} already exists\".format(HADOOP_HOME))\n", " return\n", " if response.status_code == 200:\n", " with open(filename, 'wb') as file:\n", " file.write(response.content)\n", " with tarfile.open(filename, 'r:gz') as tar_ref:\n", " extract_path = tar_ref.extractall(path='.')\n", " # Get the names of all members (files and directories) in the archive\n", " all_members = tar_ref.getnames()\n", " # If there is a top-level directory, get its name\n", " if all_members:\n", " top_level_directory = all_members[0]\n", " print(f\"ZIP file downloaded and extracted successfully. Contents saved at: {top_level_directory}\")\n", " else:\n", " print(f\"Failed to download ZIP file. Status code: {response.status_code}\")\n", "\n", "\n", "download_and_extract_targz(HADOOP_URL)" ] }, { "cell_type": "markdown", "metadata": { "id": "3yvb5cw9xEbh" }, "source": [ "# Set environment variables" ] }, { "cell_type": "markdown", "metadata": { "id": "u6lkrz1dxIiO" }, "source": [ "## Set `HADOOP_HOME` and `PATH`" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:45:56.640406Z", "iopub.status.busy": "2024-03-11T15:45:56.639824Z", "iopub.status.idle": "2024-03-11T15:45:56.644219Z", "shell.execute_reply": "2024-03-11T15:45:56.643590Z" }, "id": "s7maAwaFxBT_", "outputId": "cdcb2d0e-e387-452f-d601-c3828bffe4ab" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "HADOOP_HOME is hadoop-3.3.6\n", "PATH is hadoop-3.3.6/bin:/opt/hostedtoolcache/Python/3.8.18/x64/bin:/opt/hostedtoolcache/Python/3.8.18/x64:/snap/bin:/home/runner/.local/bin:/opt/pipx_bin:/home/runner/.cargo/bin:/home/runner/.config/composer/vendor/bin:/usr/local/.ghcup/bin:/home/runner/.dotnet/tools:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin\n" ] } ], "source": [ "# HADOOP_HOME was set earlier when downloading Hadoop distribution\n", "print(\"HADOOP_HOME is {}\".format(os.environ['HADOOP_HOME']))\n", "\n", "os.environ['PATH'] = ':'.join([os.path.join(os.environ['HADOOP_HOME'], 'bin'), os.environ['PATH']])\n", "print(\"PATH is {}\".format(os.environ['PATH']))" ] }, { "cell_type": "markdown", "metadata": { "id": "4kzJ8cNoxPyK" }, "source": [ "## Set `JAVA_HOME`\n", "\n", "While Java is readily available on Google Colab, we consider the broader scenario of an Ubuntu machine. In this case, we ensure compatibility by installing Java, specifically opting for the `openjdk-19-jre-headless` version." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:45:56.646756Z", "iopub.status.busy": "2024-03-11T15:45:56.646422Z", "iopub.status.idle": "2024-03-11T15:45:56.653209Z", "shell.execute_reply": "2024-03-11T15:45:56.652596Z" }, "id": "SauFHVPOxL-Y", "outputId": "51c06892-7c62-460c-f16d-c8ffd51f2547" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Java is already installed: /usr/lib/jvm/temurin-11-jdk-amd64\n" ] } ], "source": [ "import shutil\n", "\n", "# set variable JAVA_HOME (install Java if necessary)\n", "def is_java_installed():\n", " os.environ['JAVA_HOME'] = os.path.realpath(shutil.which(\"java\")).split('/bin')[0]\n", " return os.environ['JAVA_HOME']\n", "\n", "def install_java():\n", " # Uncomment and modify the desired version\n", " # java_version= 'openjdk-11-jre-headless'\n", " # java_version= 'default-jre'\n", " # java_version= 'openjdk-17-jre-headless'\n", " # java_version= 'openjdk-18-jre-headless'\n", " java_version= 'openjdk-19-jre-headless'\n", "\n", " print(f\"Java not found. Installing {java_version} ... (this might take a while)\")\n", " try:\n", " cmd = f\"apt install -y {java_version}\"\n", " subprocess_output = subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)\n", " stdout_result = subprocess_output.stdout\n", " # Process the results as needed\n", " print(\"Done installing Java {}\".format(java_version))\n", " os.environ['JAVA_HOME'] = os.path.realpath(shutil.which(\"java\")).split('/bin')[0]\n", " print(\"JAVA_HOME is {}\".format(os.environ['JAVA_HOME']))\n", " except subprocess.CalledProcessError as e:\n", " # Handle the error if the command returns a non-zero exit code\n", " print(\"Command failed with return code {}\".format(e.returncode))\n", " print(\"stdout: {}\".format(e.stdout))\n", "\n", "# Install Java if not available\n", "if is_java_installed():\n", " print(\"Java is already installed: {}\".format(os.environ['JAVA_HOME']))\n", "else:\n", " print(\"Installing Java\")\n", " install_java()" ] }, { "cell_type": "markdown", "metadata": { "id": "6HFPVX84xbNd" }, "source": [ "# Run a MapReduce job with Hadoop streaming" ] }, { "cell_type": "markdown", "metadata": { "id": "_yVa55X1xmOb" }, "source": [ "## Create a file\n", "\n", "Write the string\"Hello, World!\" to a local file.

**Note:** you will be writing to the file `./hello.txt` in your current directory (denoted by `./`)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-03-11T15:45:56.655904Z", "iopub.status.busy": "2024-03-11T15:45:56.655523Z", "iopub.status.idle": "2024-03-11T15:45:56.790282Z", "shell.execute_reply": "2024-03-11T15:45:56.789590Z" }, "id": "9Jz7mJkcxYxw" }, "outputs": [], "source": [ "!echo \"Hello, World!\">./hello.txt" ] }, { "cell_type": "markdown", "metadata": { "id": "zSh_Kr5Bxvst" }, "source": [ "## Launch the MapReduce \"Hello, World!\" application\n", "\n", "Since the default filesystem is the local filesystem (as opposed to HDFS) we do not need to upload the local file `hello.txt` to HDFS.\n", "\n", "Run a MapReduce job with `/bin/cat` as a mapper and no reducer.\n", "\n", "**Note:** the first step of removing the output directory is necessary because MapReduce does not overwrite data folders by design." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:45:56.793334Z", "iopub.status.busy": "2024-03-11T15:45:56.793117Z", "iopub.status.idle": "2024-03-11T15:46:00.219770Z", "shell.execute_reply": "2024-03-11T15:46:00.219013Z" }, "id": "nb5JryK9xpPA", "outputId": "7ba38b06-d0ee-4606-974d-63d68835700e" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "rm: `my_output': No such file or directory\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:58,688 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:58,770 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:58,770 INFO impl.MetricsSystemImpl: JobTracker metrics system started\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:58,783 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:58,931 INFO mapred.FileInputFormat: Total input files to process : 1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:58,944 INFO mapreduce.JobSubmitter: number of splits:1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,069 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local370782050_0001\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,070 INFO mapreduce.JobSubmitter: Executing with tokens: []\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,164 INFO mapreduce.Job: The url to track the job: http://localhost:8080/\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,166 INFO mapreduce.Job: Running job: job_local370782050_0001\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,172 INFO mapred.LocalJobRunner: OutputCommitter set in config null\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,175 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,180 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,181 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,212 INFO mapred.LocalJobRunner: Waiting for map tasks\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,217 INFO mapred.LocalJobRunner: Starting task: attempt_local370782050_0001_m_000000_0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,237 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,237 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,253 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,258 INFO mapred.MapTask: Processing split: file:/home/runner/work/big_data/big_data/hello.txt:0+14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,265 INFO mapred.MapTask: numReduceTasks: 1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,281 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,281 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,281 INFO mapred.MapTask: soft limit at 83886080\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,281 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,281 INFO mapred.MapTask: kvstart = 26214396; length = 6553600\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,284 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,288 INFO streaming.PipeMapRed: PipeMapRed exec [/bin/cat]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,292 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,294 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,294 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,294 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,295 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,295 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,296 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,296 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,296 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,297 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,297 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,303 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,319 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,319 INFO streaming.PipeMapRed: Records R/W=1/1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,320 INFO streaming.PipeMapRed: MRErrorThread done\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,321 INFO streaming.PipeMapRed: mapRedFinished\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,323 INFO mapred.LocalJobRunner: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,323 INFO mapred.MapTask: Starting flush of map output\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,323 INFO mapred.MapTask: Spilling map output\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,323 INFO mapred.MapTask: bufstart = 0; bufend = 15; bufvoid = 104857600\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,323 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,328 INFO mapred.MapTask: Finished spill 0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,338 INFO mapred.Task: Task:attempt_local370782050_0001_m_000000_0 is done. And is in the process of committing\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,340 INFO mapred.LocalJobRunner: Records R/W=1/1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,340 INFO mapred.Task: Task 'attempt_local370782050_0001_m_000000_0' done.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,346 INFO mapred.Task: Final Counters for attempt_local370782050_0001_m_000000_0: Counters: 17\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile System Counters\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes read=141437\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes written=781694\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of large read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of write operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tMap-Reduce Framework\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap input records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output bytes=15\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output materialized bytes=23\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tInput split bytes=102\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tCombine input records=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tSpilled Records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFailed Shuffles=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMerged Map outputs=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tGC time elapsed (ms)=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tTotal committed heap usage (bytes)=314572800\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Input Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Read=14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,347 INFO mapred.LocalJobRunner: Finishing task: attempt_local370782050_0001_m_000000_0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,350 INFO mapred.LocalJobRunner: map task executor complete.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,355 INFO mapred.LocalJobRunner: Waiting for reduce tasks\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,355 INFO mapred.LocalJobRunner: Starting task: attempt_local370782050_0001_r_000000_0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,360 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,360 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,360 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,363 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@9ffba29\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,368 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,380 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=2933076736, maxSingleShuffleLimit=733269184, mergeThreshold=1935830784, ioSortFactor=10, memToMemMergeOutputsThreshold=10\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,382 INFO reduce.EventFetcher: attempt_local370782050_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,407 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local370782050_0001_m_000000_0 decomp: 19 len: 23 to MEMORY\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,410 INFO reduce.InMemoryMapOutput: Read 19 bytes from map-output for attempt_local370782050_0001_m_000000_0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,412 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 19, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->19\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,415 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,416 INFO mapred.LocalJobRunner: 1 / 1 copied.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,416 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,421 INFO mapred.Merger: Merging 1 sorted segments\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,421 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 3 bytes\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,422 INFO reduce.MergeManagerImpl: Merged 1 segments, 19 bytes to disk to satisfy reduce memory limit\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,435 INFO reduce.MergeManagerImpl: Merging 1 files, 23 bytes from disk\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,436 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,436 INFO mapred.Merger: Merging 1 sorted segments\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,438 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 3 bytes\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,439 INFO mapred.LocalJobRunner: 1 / 1 copied.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,444 INFO mapred.Task: Task:attempt_local370782050_0001_r_000000_0 is done. And is in the process of committing\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,445 INFO mapred.LocalJobRunner: 1 / 1 copied.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,445 INFO mapred.Task: Task attempt_local370782050_0001_r_000000_0 is allowed to commit now\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,447 INFO output.FileOutputCommitter: Saved output of task 'attempt_local370782050_0001_r_000000_0' to file:/home/runner/work/big_data/big_data/my_output\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,447 INFO mapred.LocalJobRunner: reduce > reduce\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,447 INFO mapred.Task: Task 'attempt_local370782050_0001_r_000000_0' done.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,448 INFO mapred.Task: Final Counters for attempt_local370782050_0001_r_000000_0: Counters: 24\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile System Counters\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes read=141515\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes written=781744\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of large read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of write operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tMap-Reduce Framework\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tCombine input records=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tCombine output records=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce input groups=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce shuffle bytes=23\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce input records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce output records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tSpilled Records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tShuffled Maps =1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFailed Shuffles=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMerged Map outputs=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tGC time elapsed (ms)=12\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tTotal committed heap usage (bytes)=314572800\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tShuffle Errors\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBAD_ID=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tCONNECTION=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tIO_ERROR=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tWRONG_LENGTH=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tWRONG_MAP=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tWRONG_REDUCE=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Output Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Written=27\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,448 INFO mapred.LocalJobRunner: Finishing task: attempt_local370782050_0001_r_000000_0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:45:59,448 INFO mapred.LocalJobRunner: reduce task executor complete.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:00,171 INFO mapreduce.Job: Job job_local370782050_0001 running in uber mode : false\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:00,172 INFO mapreduce.Job: map 100% reduce 100%\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:00,173 INFO mapreduce.Job: Job job_local370782050_0001 completed successfully\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:00,178 INFO mapreduce.Job: Counters: 30\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile System Counters\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes read=282952\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes written=1563438\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of large read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of write operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tMap-Reduce Framework\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap input records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output bytes=15\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output materialized bytes=23\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tInput split bytes=102\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tCombine input records=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tCombine output records=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce input groups=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce shuffle bytes=23\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce input records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce output records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tSpilled Records=2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tShuffled Maps =1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFailed Shuffles=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMerged Map outputs=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tGC time elapsed (ms)=12\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tTotal committed heap usage (bytes)=629145600\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tShuffle Errors\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBAD_ID=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tCONNECTION=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tIO_ERROR=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tWRONG_LENGTH=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tWRONG_MAP=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tWRONG_REDUCE=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Input Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Read=14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Output Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Written=27\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:00,179 INFO streaming.StreamJob: Output directory: my_output\n" ] } ], "source": [ "%%bash\n", "hdfs dfs -rm -r my_output\n", "\n", "mapred streaming \\\n", " -input hello.txt \\\n", " -output my_output \\\n", " -mapper '/bin/cat'" ] }, { "cell_type": "markdown", "metadata": { "id": "OB_fX9u5x55y" }, "source": [ "## Verify the result\n", "\n", "If the job executed successfully, an empty file named `_SUCCESS` is expected to be present in the output directory `my_output`.\n", "\n", "Verify the success of the MapReduce job by checking for the presence of the `_SUCCESS` file." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:46:00.222553Z", "iopub.status.busy": "2024-03-11T15:46:00.222179Z", "iopub.status.idle": "2024-03-11T15:46:01.136014Z", "shell.execute_reply": "2024-03-11T15:46:01.135356Z" }, "id": "bnvEvYDfx2g4", "outputId": "d8c3dd75-aca5-4165-aaeb-f57e4e377293" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Check if MapReduce job was successful\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "_SUCCESS exists!\n" ] } ], "source": [ "%%bash\n", "\n", "echo \"Check if MapReduce job was successful\"\n", "hdfs dfs -test -e my_output/_SUCCESS\n", "if [ $? -eq 0 ]; then\n", "\techo \"_SUCCESS exists!\"\n", "fi" ] }, { "cell_type": "markdown", "metadata": { "id": "BLMnBh44x_YR" }, "source": [ "**Note:** `hdfs dfs -ls` is the same as `ls` since the default filesystem is the local filesystem." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:46:01.138836Z", "iopub.status.busy": "2024-03-11T15:46:01.138564Z", "iopub.status.idle": "2024-03-11T15:46:02.243021Z", "shell.execute_reply": "2024-03-11T15:46:02.242350Z" }, "id": "ufAfmGUvx8jW", "outputId": "3ab8e401-6151-4241-c3b7-b951c71977c6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 2 items\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "-rw-r--r-- 1 runner docker 0 2024-03-11 15:45 my_output/_SUCCESS\r\n", "-rw-r--r-- 1 runner docker 15 2024-03-11 15:45 my_output/part-00000\r\n" ] } ], "source": [ "!hdfs dfs -ls my_output" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:46:02.246227Z", "iopub.status.busy": "2024-03-11T15:46:02.245769Z", "iopub.status.idle": "2024-03-11T15:46:02.382457Z", "shell.execute_reply": "2024-03-11T15:46:02.381789Z" }, "id": "ZnKSahPzyCAn", "outputId": "6e94b1ce-fea5-4e72-bbf7-a1c58b820a14" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 4\r\n", "-rw-r--r-- 1 runner docker 0 Mar 11 15:45 _SUCCESS\r\n", "-rw-r--r-- 1 runner docker 15 Mar 11 15:45 part-00000\r\n" ] } ], "source": [ "!ls -l my_output" ] }, { "cell_type": "markdown", "metadata": { "id": "v9LmpcaMyG23" }, "source": [ "The actual output of the MapReduce job is contained in the file `part-00000` in the output directory." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:46:02.385529Z", "iopub.status.busy": "2024-03-11T15:46:02.385260Z", "iopub.status.idle": "2024-03-11T15:46:02.521495Z", "shell.execute_reply": "2024-03-11T15:46:02.520819Z" }, "id": "eL-Clat5yD8I", "outputId": "d2340516-50fd-4945-bda1-4510f9a82885" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello, World!\t\r\n" ] } ], "source": [ "!cat my_output/part-00000" ] }, { "cell_type": "markdown", "metadata": { "id": "AmpHr_HyyMnM" }, "source": [ "# MapReduce without specifying mapper or reducer\n", "\n", "In the previous example, we have seen how to run a MapReduce job without specifying any reducer.\n", "\n", "Since the only required options for `mapred streaming` are `input` and `output`, we can also run a MapReduce job without specifying a mapper." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:46:02.524604Z", "iopub.status.busy": "2024-03-11T15:46:02.524347Z", "iopub.status.idle": "2024-03-11T15:46:03.310013Z", "shell.execute_reply": "2024-03-11T15:46:03.309201Z" }, "id": "ZPWL1AiXyJac", "outputId": "3dfc9d86-85b9-497d-bfba-97f890a3424b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2024-03-11 15:46:03,162 ERROR streaming.StreamJob: Unrecognized option: -h\r\n", "Usage: $HADOOP_HOME/bin/hadoop jar hadoop-streaming.jar [options]\r\n", "Options:\r\n", " -input DFS input file(s) for the Map step.\r\n", " -output DFS output directory for the Reduce step.\r\n", " -mapper Optional. Command to be run as mapper.\r\n", " -combiner Optional. Command to be run as combiner.\r\n", " -reducer Optional. Command to be run as reducer.\r\n", " -file Optional. File/dir to be shipped in the Job jar file.\r\n", " Deprecated. Use generic option \"-files\" instead.\r\n", " -inputformat \r\n", " Optional. The input format class.\r\n", " -outputformat \r\n", " Optional. The output format class.\r\n", " -partitioner Optional. The partitioner class.\r\n", " -numReduceTasks Optional. Number of reduce tasks.\r\n", " -inputreader Optional. Input recordreader spec.\r\n", " -cmdenv = Optional. Pass env.var to streaming commands.\r\n", " -mapdebug Optional. To run this script when a map task fails.\r\n", " -reducedebug Optional. To run this script when a reduce task fails.\r\n", " -io Optional. Format to use for input to and output\r\n", " from mapper/reducer commands\r\n", " -lazyOutput Optional. Lazily create Output.\r\n", " -background Optional. Submit the job and don't wait till it completes.\r\n", " -verbose Optional. Print verbose output.\r\n", " -info Optional. Print detailed usage." ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r\n", " -help Optional. Print help message.\r\n", "\r\n", "Generic options supported are:\r\n", "-conf specify an application configuration file\r\n", "-D define a value for a given property\r\n", "-fs specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.\r\n", "-jt specify a ResourceManager\r\n", "-files specify a comma-separated list of files to be copied to the map reduce cluster\r\n", "-libjars specify a comma-separated list of jar files to be included in the classpath\r\n", "-archives specify a comma-separated list of archives to be unarchived on the compute machines\r\n", "\r\n", "The general command line syntax is:\r\n", "command [genericOptions] [commandOptions]\r\n", "\r\n", "\r\n", "For more details about these options:\r\n", "Use $HADOOP_HOME/bin/hadoop jar hadoop-streaming.jar -info\r\n", "\r\n", "Try -help for more information\r\n", "Streaming Command Failed!\r\n" ] } ], "source": [ "!mapred streaming -h" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:46:03.313182Z", "iopub.status.busy": "2024-03-11T15:46:03.312776Z", "iopub.status.idle": "2024-03-11T15:46:06.892907Z", "shell.execute_reply": "2024-03-11T15:46:06.892285Z" }, "id": "5H2MkIUPyQc2", "outputId": "9337fa36-87fd-4ec5-c846-98810959ea59" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:04,240 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Deleted my_output\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,316 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,402 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,402 INFO impl.MetricsSystemImpl: JobTracker metrics system started\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,414 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,551 INFO mapred.FileInputFormat: Total input files to process : 1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,565 INFO mapreduce.JobSubmitter: number of splits:1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,710 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1336272806_0001\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,710 INFO mapreduce.JobSubmitter: Executing with tokens: []\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,839 INFO mapreduce.Job: The url to track the job: http://localhost:8080/\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,840 INFO mapreduce.Job: Running job: job_local1336272806_0001\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,846 INFO mapred.LocalJobRunner: OutputCommitter set in config null\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,848 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,853 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,853 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,881 INFO mapred.LocalJobRunner: Waiting for map tasks\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,885 INFO mapred.LocalJobRunner: Starting task: attempt_local1336272806_0001_m_000000_0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,904 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,904 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,920 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,926 INFO mapred.MapTask: Processing split: file:/home/runner/work/big_data/big_data/hello.txt:0+14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,933 INFO mapred.MapTask: numReduceTasks: 1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,949 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,949 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,949 INFO mapred.MapTask: soft limit at 83886080\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,949 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,949 INFO mapred.MapTask: kvstart = 26214396; length = 6553600\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,953 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,957 INFO mapred.LocalJobRunner: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,957 INFO mapred.MapTask: Starting flush of map output\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,957 INFO mapred.MapTask: Spilling map output\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,957 INFO mapred.MapTask: bufstart = 0; bufend = 22; bufvoid = 104857600\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,957 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,962 INFO mapred.MapTask: Finished spill 0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,970 INFO mapred.Task: Task:attempt_local1336272806_0001_m_000000_0 is done. And is in the process of committing\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,972 INFO mapred.LocalJobRunner: file:/home/runner/work/big_data/big_data/hello.txt:0+14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,972 INFO mapred.Task: Task 'attempt_local1336272806_0001_m_000000_0' done.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,976 INFO mapred.Task: Final Counters for attempt_local1336272806_0001_m_000000_0: Counters: 17\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile System Counters\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes read=141437\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes written=782672\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of large read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of write operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tMap-Reduce Framework\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap input records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output bytes=22\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output materialized bytes=30\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tInput split bytes=102\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tCombine input records=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tSpilled Records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFailed Shuffles=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMerged Map outputs=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tGC time elapsed (ms)=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tTotal committed heap usage (bytes)=337641472\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Input Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Read=14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,976 INFO mapred.LocalJobRunner: Finishing task: attempt_local1336272806_0001_m_000000_0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,977 INFO mapred.LocalJobRunner: map task executor complete.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,979 INFO mapred.LocalJobRunner: Waiting for reduce tasks\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,980 INFO mapred.LocalJobRunner: Starting task: attempt_local1336272806_0001_r_000000_0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,990 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,990 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:05,990 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,001 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@6bc81e34\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,007 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,021 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=2933076736, maxSingleShuffleLimit=733269184, mergeThreshold=1935830784, ioSortFactor=10, memToMemMergeOutputsThreshold=10\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,022 INFO reduce.EventFetcher: attempt_local1336272806_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,042 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1336272806_0001_m_000000_0 decomp: 26 len: 30 to MEMORY\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,044 INFO reduce.InMemoryMapOutput: Read 26 bytes from map-output for attempt_local1336272806_0001_m_000000_0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,046 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 26, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->26\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,048 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,049 INFO mapred.LocalJobRunner: 1 / 1 copied.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,049 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,053 INFO mapred.Merger: Merging 1 sorted segments\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,054 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16 bytes\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,056 INFO reduce.MergeManagerImpl: Merged 1 segments, 26 bytes to disk to satisfy reduce memory limit\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,057 INFO reduce.MergeManagerImpl: Merging 1 files, 30 bytes from disk\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,058 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,058 INFO mapred.Merger: Merging 1 sorted segments\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,060 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16 bytes\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,062 INFO mapred.LocalJobRunner: 1 / 1 copied.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,067 INFO mapred.Task: Task:attempt_local1336272806_0001_r_000000_0 is done. And is in the process of committing\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,068 INFO mapred.LocalJobRunner: 1 / 1 copied.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,068 INFO mapred.Task: Task attempt_local1336272806_0001_r_000000_0 is allowed to commit now\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,069 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1336272806_0001_r_000000_0' to file:/home/runner/work/big_data/big_data/my_output\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,071 INFO mapred.LocalJobRunner: reduce > reduce\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,071 INFO mapred.Task: Task 'attempt_local1336272806_0001_r_000000_0' done.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,072 INFO mapred.Task: Final Counters for attempt_local1336272806_0001_r_000000_0: Counters: 24\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile System Counters\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes read=141529\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes written=782730\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of large read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of write operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tMap-Reduce Framework\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tCombine input records=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tCombine output records=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce input groups=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce shuffle bytes=30\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce input records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce output records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tSpilled Records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tShuffled Maps =1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFailed Shuffles=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMerged Map outputs=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tGC time elapsed (ms)=6\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tTotal committed heap usage (bytes)=337641472\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tShuffle Errors\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBAD_ID=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tCONNECTION=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tIO_ERROR=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tWRONG_LENGTH=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tWRONG_MAP=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tWRONG_REDUCE=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Output Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Written=28\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,073 INFO mapred.LocalJobRunner: Finishing task: attempt_local1336272806_0001_r_000000_0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,073 INFO mapred.LocalJobRunner: reduce task executor complete.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,845 INFO mapreduce.Job: Job job_local1336272806_0001 running in uber mode : false\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,846 INFO mapreduce.Job: map 100% reduce 100%\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,847 INFO mapreduce.Job: Job job_local1336272806_0001 completed successfully\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,852 INFO mapreduce.Job: Counters: 30\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile System Counters\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes read=282966\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes written=1565402\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of large read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of write operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tMap-Reduce Framework\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap input records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output bytes=22\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output materialized bytes=30\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tInput split bytes=102\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tCombine input records=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tCombine output records=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce input groups=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce shuffle bytes=30\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce input records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tReduce output records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tSpilled Records=2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tShuffled Maps =1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFailed Shuffles=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMerged Map outputs=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tGC time elapsed (ms)=6\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tTotal committed heap usage (bytes)=675282944\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tShuffle Errors\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBAD_ID=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tCONNECTION=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tIO_ERROR=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tWRONG_LENGTH=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tWRONG_MAP=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tWRONG_REDUCE=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Input Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Read=14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Output Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Written=28\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:06,852 INFO streaming.StreamJob: Output directory: my_output\n" ] } ], "source": [ "%%bash\n", "hdfs dfs -rm -r my_output\n", "\n", "mapred streaming \\\n", " -input hello.txt \\\n", " -output my_output" ] }, { "cell_type": "markdown", "metadata": { "id": "v7Ks3e96yXuB" }, "source": [ "## Verify the result" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:46:06.895928Z", "iopub.status.busy": "2024-03-11T15:46:06.895487Z", "iopub.status.idle": "2024-03-11T15:46:07.822467Z", "shell.execute_reply": "2024-03-11T15:46:07.821703Z" }, "id": "cWAXvG0_yThc", "outputId": "74f791c4-3aec-4f40-f4bb-72db1117c6b3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Check if MapReduce job was successful\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "_SUCCESS exists!\n" ] } ], "source": [ "%%bash\n", "\n", "echo \"Check if MapReduce job was successful\"\n", "hdfs dfs -test -e my_output/_SUCCESS\n", "if [ $? -eq 0 ]; then\n", "\techo \"_SUCCESS exists!\"\n", "fi" ] }, { "cell_type": "markdown", "metadata": { "id": "t40GgJ2Hya9P" }, "source": [ "Show output" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:46:07.825343Z", "iopub.status.busy": "2024-03-11T15:46:07.825011Z", "iopub.status.idle": "2024-03-11T15:46:07.960321Z", "shell.execute_reply": "2024-03-11T15:46:07.959722Z" }, "id": "I5APWEgoyaRS", "outputId": "dfa0295f-e20a-4f5e-d674-72500bb959c0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\tHello, World!\r\n" ] } ], "source": [ "!cat my_output/part-00000" ] }, { "cell_type": "markdown", "metadata": { "id": "mzfaMVKqyjpC" }, "source": [ "What happened here is that not having defined any mapper or reducer, the \"Identity\" mapper ([IdentityMapper](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/lib/IdentityMapper.html)) and reducer ([IdentityReducer](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/lib/IdentityReducer.html)) were used by default (see [Streaming command options](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Streaming_Command_Options))." ] }, { "cell_type": "markdown", "metadata": { "id": "lzIuWv7Myndc" }, "source": [ "# Run a map-only MapReduce job\n", "\n", "Not specifying mapper and reducer in the MapReduce job submission does not mean that MapReduce isn't going to run the mapper and reducer steps, it is simply going to use the Identity mapper and reducer.\n", "\n", "To run a MapReduce job _without_ reducer one needs to use the generic option\n", "\n", " \\-D mapreduce.job.reduces=0\n", "\n", "(see [specifying map-only jobs](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Specifying_Map-Only_Jobs))." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:46:07.963339Z", "iopub.status.busy": "2024-03-11T15:46:07.962912Z", "iopub.status.idle": "2024-03-11T15:46:11.455675Z", "shell.execute_reply": "2024-03-11T15:46:11.455021Z" }, "id": "OdwKWyVRye27", "outputId": "38091c9f-72d8-4b20-fed7-29e5769be908" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:08,861 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Deleted my_output\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:09,908 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:09,985 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:09,986 INFO impl.MetricsSystemImpl: JobTracker metrics system started\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:09,997 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,132 INFO mapred.FileInputFormat: Total input files to process : 1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,145 INFO mapreduce.JobSubmitter: number of splits:1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,293 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1539286825_0001\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,293 INFO mapreduce.JobSubmitter: Executing with tokens: []\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,409 INFO mapreduce.Job: The url to track the job: http://localhost:8080/\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,411 INFO mapreduce.Job: Running job: job_local1539286825_0001\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,417 INFO mapred.LocalJobRunner: OutputCommitter set in config null\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,420 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,426 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,426 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,457 INFO mapred.LocalJobRunner: Waiting for map tasks\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,460 INFO mapred.LocalJobRunner: Starting task: attempt_local1539286825_0001_m_000000_0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,477 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,478 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,497 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,504 INFO mapred.MapTask: Processing split: file:/home/runner/work/big_data/big_data/hello.txt:0+14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,511 INFO mapred.MapTask: numReduceTasks: 0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,520 INFO mapred.LocalJobRunner: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,532 INFO mapred.Task: Task:attempt_local1539286825_0001_m_000000_0 is done. And is in the process of committing\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,533 INFO mapred.LocalJobRunner: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,533 INFO mapred.Task: Task attempt_local1539286825_0001_m_000000_0 is allowed to commit now\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,535 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1539286825_0001_m_000000_0' to file:/home/runner/work/big_data/big_data/my_output\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,536 INFO mapred.LocalJobRunner: file:/home/runner/work/big_data/big_data/hello.txt:0+14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,536 INFO mapred.Task: Task 'attempt_local1539286825_0001_m_000000_0' done.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,543 INFO mapred.Task: Final Counters for attempt_local1539286825_0001_m_000000_0: Counters: 15\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile System Counters\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes read=141437\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes written=782636\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of large read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of write operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tMap-Reduce Framework\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap input records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tInput split bytes=102\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tSpilled Records=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFailed Shuffles=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMerged Map outputs=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tGC time elapsed (ms)=7\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tTotal committed heap usage (bytes)=314572800\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Input Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Read=14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Output Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Written=28\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,543 INFO mapred.LocalJobRunner: Finishing task: attempt_local1539286825_0001_m_000000_0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:10,544 INFO mapred.LocalJobRunner: map task executor complete.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:11,420 INFO mapreduce.Job: Job job_local1539286825_0001 running in uber mode : false\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:11,422 INFO mapreduce.Job: map 100% reduce 0%\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:11,423 INFO mapreduce.Job: Job job_local1539286825_0001 completed successfully\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:11,427 INFO mapreduce.Job: Counters: 15\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile System Counters\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes read=141437\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes written=782636\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of large read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of write operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tMap-Reduce Framework\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap input records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tInput split bytes=102\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tSpilled Records=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFailed Shuffles=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMerged Map outputs=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tGC time elapsed (ms)=7\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tTotal committed heap usage (bytes)=314572800\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Input Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Read=14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Output Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Written=28\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:11,427 INFO streaming.StreamJob: Output directory: my_output\n" ] } ], "source": [ "%%bash\n", "hdfs dfs -rm -r my_output\n", "\n", "mapred streaming \\\n", " -D mapreduce.job.reduces=0 \\\n", " -input hello.txt \\\n", " -output my_output" ] }, { "cell_type": "markdown", "metadata": { "id": "QZIE9yXOyyHJ" }, "source": [ "## Verify the result" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:46:11.458655Z", "iopub.status.busy": "2024-03-11T15:46:11.458209Z", "iopub.status.idle": "2024-03-11T15:46:12.450827Z", "shell.execute_reply": "2024-03-11T15:46:12.450185Z" }, "id": "7Dt3tUI0yu5e", "outputId": "b227027b-eb0f-46d1-f08d-d0d7992c5539" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\tHello, World!\r\n" ] } ], "source": [ "!hdfs dfs -test -e my_output/_SUCCESS && cat my_output/part-00000" ] }, { "cell_type": "markdown", "metadata": { "id": "hUGEUv99y3cM" }, "source": [ "## Why a map-only application?\n", "\n", "The advantage of a map-only job is that the sorting and shuffling phases are skipped, so if you do not need that remember to specify `-D mapreduce.job.reduces=0 `.\n", "\n", "On the other hand, a MapReduce job even with the default `IdentityReducer` will deliver sorted results because the data passed from the mapper to the reducer always gets sorted.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "FhVVFEdKzGcI" }, "source": [ "# Improved version of the MapReduce \"Hello, World!\" application\n", "\n", "Taking into account the previous considerations, here's a more efficient version of the 'Hello, World!' application that bypasses the shuffling and sorting step." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:46:12.453624Z", "iopub.status.busy": "2024-03-11T15:46:12.453404Z", "iopub.status.idle": "2024-03-11T15:46:15.963289Z", "shell.execute_reply": "2024-03-11T15:46:15.962625Z" }, "id": "jLgMXX2jy0vC", "outputId": "c0b66feb-bf48-42cd-ac38-909dfb638c2e" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:13,342 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Deleted my_output\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,418 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,499 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,500 INFO impl.MetricsSystemImpl: JobTracker metrics system started\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,510 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,625 INFO mapred.FileInputFormat: Total input files to process : 1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,639 INFO mapreduce.JobSubmitter: number of splits:1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,784 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1621052802_0001\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,784 INFO mapreduce.JobSubmitter: Executing with tokens: []\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,906 INFO mapreduce.Job: The url to track the job: http://localhost:8080/\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,910 INFO mapreduce.Job: Running job: job_local1621052802_0001\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,916 INFO mapred.LocalJobRunner: OutputCommitter set in config null\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,921 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,927 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,927 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,958 INFO mapred.LocalJobRunner: Waiting for map tasks\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,962 INFO mapred.LocalJobRunner: Starting task: attempt_local1621052802_0001_m_000000_0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,981 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:14,983 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,002 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,008 INFO mapred.MapTask: Processing split: file:/home/runner/work/big_data/big_data/hello.txt:0+14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,017 INFO mapred.MapTask: numReduceTasks: 0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,024 INFO streaming.PipeMapRed: PipeMapRed exec [/bin/cat]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,029 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,032 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,032 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,033 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,035 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,036 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,038 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,039 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,040 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,041 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,042 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,043 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,056 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,058 INFO streaming.PipeMapRed: MRErrorThread done\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,061 INFO streaming.PipeMapRed: Records R/W=1/1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,063 INFO streaming.PipeMapRed: mapRedFinished\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,065 INFO mapred.LocalJobRunner: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,070 INFO mapred.Task: Task:attempt_local1621052802_0001_m_000000_0 is done. And is in the process of committing\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,073 INFO mapred.LocalJobRunner: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,073 INFO mapred.Task: Task attempt_local1621052802_0001_m_000000_0 is allowed to commit now\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,076 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1621052802_0001_m_000000_0' to file:/home/runner/work/big_data/big_data/my_output\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,077 INFO mapred.LocalJobRunner: Records R/W=1/1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,078 INFO mapred.Task: Task 'attempt_local1621052802_0001_m_000000_0' done.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,084 INFO mapred.Task: Final Counters for attempt_local1621052802_0001_m_000000_0: Counters: 15\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile System Counters\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes read=141437\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes written=785612\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of large read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of write operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tMap-Reduce Framework\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap input records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tInput split bytes=102\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tSpilled Records=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFailed Shuffles=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMerged Map outputs=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tGC time elapsed (ms)=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tTotal committed heap usage (bytes)=314572800\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Input Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Read=14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Output Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Written=27\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,084 INFO mapred.LocalJobRunner: Finishing task: attempt_local1621052802_0001_m_000000_0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,085 INFO mapred.LocalJobRunner: map task executor complete.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,914 INFO mapreduce.Job: Job job_local1621052802_0001 running in uber mode : false\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,916 INFO mapreduce.Job: map 100% reduce 0%\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,918 INFO mapreduce.Job: Job job_local1621052802_0001 completed successfully\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,922 INFO mapreduce.Job: Counters: 15\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile System Counters\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes read=141437\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of bytes written=785612\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of large read operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFILE: Number of write operations=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tMap-Reduce Framework\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap input records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMap output records=1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tInput split bytes=102\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tSpilled Records=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tFailed Shuffles=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tMerged Map outputs=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tGC time elapsed (ms)=0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tTotal committed heap usage (bytes)=314572800\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Input Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Read=14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tFile Output Format Counters \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t\tBytes Written=27\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-03-11 15:46:15,922 INFO streaming.StreamJob: Output directory: my_output\n" ] } ], "source": [ "%%bash\n", "hdfs dfs -rm -r my_output\n", "\n", "mapred streaming \\\n", " -D mapreduce.job.reduces=0 \\\n", " -input hello.txt \\\n", " -output my_output \\\n", " -mapper '/bin/cat'" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-03-11T15:46:15.966342Z", "iopub.status.busy": "2024-03-11T15:46:15.965770Z", "iopub.status.idle": "2024-03-11T15:46:16.962515Z", "shell.execute_reply": "2024-03-11T15:46:16.961893Z" }, "id": "Sa1UDPr6zKKw", "outputId": "3e1e2461-3181-4ab1-82dd-5931eebd38eb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello, World!\t\r\n" ] } ], "source": [ "!hdfs dfs -test -e my_output/_SUCCESS && cat my_output/part-00000" ] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyOOpT0soTGaqyxm/vlb8BfU", "include_colab_link": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.18" } }, "nbformat": 4, "nbformat_minor": 0 }