{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "hyHna6wYs42H" }, "source": [ "
\n", "# A simple MapReduce job with mrjob\n", "\n", "`mrjob` is a **Python framework** that simplifies writing and running multi-step **MapReduce jobs** and **hybrid Spark jobs** entirely in Python. It allows developers to test jobs locally and then execute them seamlessly across different backends like **Hadoop, YARN, or AWS EMR** with minimal code changes. (**Note:** While $\\text{mrjob}$ is robust and was famously developed and used extensively by Yelp, the project has not been actively maintained or updated in recent years.).\n", "\n", "In this notebook, we'll start with a basic wordcount example to demonstrate its core functionality.\n", "\n", "Find the official $\\text{mrjob}$ documentation here: [https://mrjob.readthedocs.io/en/latest/](https://mrjob.readthedocs.io/en/latest/)" ] }, { "cell_type": "code", "source": [ "!pip install mrjob" ], "metadata": { "id": "JBhcJ_vxyObq", "outputId": "41f11977-7133-4307-c6c1-f4314830e66d", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 1, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Requirement already satisfied: mrjob in /usr/local/lib/python3.12/dist-packages (0.7.4)\n", "Requirement already satisfied: PyYAML>=3.10 in /usr/local/lib/python3.12/dist-packages (from mrjob) (6.0.3)\n" ] } ] }, { "cell_type": "markdown", "source": [ "Let us check if there are any examples that come with the `mrjob` distribution." ], "metadata": { "id": "FjlHvEyVydfe" } }, { "cell_type": "code", "source": [ "!find /usr -name \"*examples*\" |grep mrjob" ], "metadata": { "id": "QtlHwXwEySPD", "outputId": "bd903740-970f-49c9-ab24-01e1edbdfd97", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 2, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "/usr/local/lib/python3.12/dist-packages/mrjob/examples\n" ] } ] }, { "cell_type": "code", "source": [ "!pip show mrjob" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BTokmPInF9Ev", "outputId": "8bfa6cd8-e161-4b17-8498-5108a9df193d" }, "execution_count": 3, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Name: mrjob\n", "Version: 0.7.4\n", "Summary: Python MapReduce framework\n", "Home-page: http://github.com/Yelp/mrjob\n", "Author: David Marin\n", "Author-email: dm@davidmarin.org\n", "License: Apache\n", "Location: /usr/local/lib/python3.12/dist-packages\n", "Requires: PyYAML\n", "Required-by: \n" ] } ] }, { "cell_type": "markdown", "source": [ "Here's the list of examples:" ], "metadata": { "id": "DLyGstg2yk3l" } }, { "cell_type": "code", "source": [ "!ls $(pip show mrjob | grep Location | cut -d ' ' -f 2)/mrjob/examples" ], "metadata": { "id": "tJUj12vAylCk", "outputId": "84b47e1e-dd27-4ad0-f695-80b80ee29d53", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 4, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "docs-to-classify\t mr_phone_to_url.py\n", "__init__.py\t\t mr_sparkaboom.py\n", "mr_boom.py\t\t mr_spark_most_used_word.py\n", "mr_count_lines_by_file.py mr_spark_wordcount.py\n", "mr_count_lines_right.py mr_spark_wordcount_script.py\n", "mr_count_lines_wrong.py mr_text_classifier.py\n", "mr_grep.py\t\t mr_wc.py\n", "mr_jar_step_example.py\t mr_word_freq_count.py\n", "mr_log_sampler.py\t mr_words_containing_u_freq_count.py\n", "mr_most_used_word.py\t nicknack-1.0.1.jar\n", "mr_next_word_stats.py\t __pycache__\n", "mr_nick_nack_input_format.py spark_wordcount_script.py\n", "mr_nick_nack.py\t\t stop_words.txt\n" ] } ] }, { "cell_type": "markdown", "source": [ "`mr_wc.py` must be the classic \"word count\" example." ], "metadata": { "id": "ca-2NlLPy7FX" } }, { "cell_type": "code", "source": [ "!cat $(pip show mrjob | grep Location | cut -d ' ' -f 2)/mrjob/examples/mr_wc.py" ], "metadata": { "id": "_MPypLJnyzQZ", "outputId": "2e70e1be-450a-402e-e14e-98017aa44d7d", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 5, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "# Copyright 2009-2010 Yelp\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# http://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License.\n", "\n", "\"\"\"An implementation of wc as an MRJob.\n", "\n", "This is meant as an example of why mapper_final is useful.\"\"\"\n", "from mrjob.job import MRJob\n", "\n", "\n", "class MRWordCountUtility(MRJob):\n", "\n", " def __init__(self, *args, **kwargs):\n", " super(MRWordCountUtility, self).__init__(*args, **kwargs)\n", " self.chars = 0\n", " self.words = 0\n", " self.lines = 0\n", "\n", " def mapper(self, _, line):\n", " # Don't actually yield anything for each line. Instead, collect them\n", " # and yield the sums when all lines have been processed. The results\n", " # will be collected by the reducer.\n", " self.chars += len(line) + 1 # +1 for newline\n", " self.words += sum(1 for word in line.split() if word.strip())\n", " self.lines += 1\n", "\n", " def mapper_final(self):\n", " yield('chars', self.chars)\n", " yield('words', self.words)\n", " yield('lines', self.lines)\n", "\n", " def reducer(self, key, values):\n", " yield(key, sum(values))\n", "\n", "\n", "if __name__ == '__main__':\n", " MRWordCountUtility.run()\n" ] } ] }, { "cell_type": "markdown", "source": [ "Let us create a symbolic link to `/usr/local/lib/.../dist-packages/mrjob/examples` so that we don't need to type long paths (and the folder is visible in the left pane)." ], "metadata": { "id": "wWvXygBjzWbl" } }, { "cell_type": "code", "source": [ "!ln -s $(pip show mrjob | grep Location | cut -d ' ' -f 2)/mrjob/examples examples" ], "metadata": { "id": "ntSu0qgmzgkF" }, "execution_count": 6, "outputs": [] }, { "cell_type": "markdown", "source": [ "We are going to need some text data to run the wordcount example. It is common for Hadoop distributions to provide some toy data together with example scripts. And in fact, also `mrjob` includes some data in the folder `docs-to-classify` (subfolder of `examples`). Thi will do it for our wordcount demonstration." ], "metadata": { "id": "xdvSx6SkzqAj" } }, { "cell_type": "code", "source": [ "!ls -lh examples/docs-to-classify" ], "metadata": { "id": "ov3QzMzCzpWJ", "outputId": "6ef1de27-4d6b-43eb-b2c3-849d06da9a66", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 7, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "total 88K\n", "-rw-r--r-- 1 root root 9.4K Feb 7 17:11 american_feuillage-whitman-america.txt\n", "-rw-r--r-- 1 root root 933 Feb 7 17:11 as_i_ponderd_in_silence-whitman.txt\n", "-rw-r--r-- 1 root root 1.2K Feb 7 17:11 buckingham_palace-milne-not_america.txt\n", "-rw-r--r-- 1 root root 20K Feb 7 17:11 chants_democratic-whitman-america.txt\n", "-rw-r--r-- 1 root root 288 Feb 7 17:11 corner_of_the_street-milne-not_whitman.txt\n", "-rw-r--r-- 1 root root 154 Feb 7 17:11 happiness-milne.txt\n", "-rw-r--r-- 1 root root 1.5K Feb 7 17:11 in_cabind_ships_at_sea-whitman.txt\n", "-rw-r--r-- 1 root root 326 Feb 7 17:11 lines_and_squares-milne-animals.txt\n", "-rw-r--r-- 1 root root 432 Feb 7 17:11 ones_self_i_sing-whitman.txt\n", "-rw-r--r-- 1 root root 1.2K Feb 7 17:11 puppy_and_i-milne-animals.txt\n", "-rw-r--r-- 1 root root 415 Feb 7 17:11 the_christening-milne-animals.txt\n", "-rw-r--r-- 1 root root 869 Feb 7 17:11 the_four_friends-milne-animals.txt\n", "-rw-r--r-- 1 root root 500 Feb 7 17:11 to_a_historian-whitman.txt\n", "-rw-r--r-- 1 root root 193 Feb 7 17:11 to_foreign_lands-whitman-america.txt\n", "-rw-r--r-- 1 root root 895 Feb 7 17:11 to_thee_old_cause-whitman.txt\n", "-rw-r--r-- 1 root root 226 Feb 7 17:11 twinkletoes-milne.txt\n" ] } ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "0WSY1wUcs42O", "outputId": "24dbd6cb-e083-42d5-d517-653508ff41e1", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\"lines\"\t660\n", "\"chars\"\t37967\n", "\"words\"\t6371\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "No configs found; falling back on auto-configuration\n", "No configs specified for inline runner\n", "Creating temp directory /tmp/mr_wc.root.20260207.172720.278294\n", "Running step 1 of 1...\n", "job output is in /tmp/mr_wc.root.20260207.172720.278294/output\n", "Streaming final output from /tmp/mr_wc.root.20260207.172720.278294/output...\n", "Removing temp directory /tmp/mr_wc.root.20260207.172720.278294...\n" ] } ], "source": [ "%%bash\n", "\n", "DATA=examples/docs-to-classify\n", "\n", "python examples/mr_wc.py $DATA" ] }, { "cell_type": "markdown", "source": [ "We can verify if the result is correct by concatenating all files in `examples/docs-to-classify` and counting lines/words/characters with the customary shell command `wc`." ], "metadata": { "id": "_a0Tgldl1951" } }, { "cell_type": "code", "source": [ "!cat examples/docs-to-classify/* |wc" ], "metadata": { "id": "4QKl1xclwCtj", "outputId": "ea617440-4cb2-4071-b39d-13f2138cef0d", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 9, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " 660 6371 38444\n" ] } ] }, { "cell_type": "markdown", "source": [ "The number of lines and words is the same but `wc` returns a different number of characters ($38444$ vs. the $37967$ from the `mrjob` example)." ], "metadata": { "id": "IWPsqJM82ykm" } }, { "cell_type": "code", "source": [ "38444-37967" ], "metadata": { "id": "8DsTxmVC3ScJ", "outputId": "e50a4adb-8ecc-4996-acca-0486b859c614", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 10, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "477" ] }, "metadata": {}, "execution_count": 10 } ] }, { "cell_type": "code", "source": [ "!python examples/mr_wc.py examples/docs-to-classify/american_feuillage-whitman-america.txt 2>/dev/null |grep chars" ], "metadata": { "id": "T8Als_X74Ks1", "outputId": "a0e52ff2-ddbb-4e72-b3a4-a9f755665ebc", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 11, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\"chars\"\t9409\n" ] } ] }, { "cell_type": "code", "source": [ "%%bash\n", "\n", "for f in examples/docs-to-classify/*\n", "do\n", " wc -c $f\n", " DATA=$f\n", " python examples/mr_wc.py $DATA 2>/dev/null |grep chars\n", "done\n" ], "metadata": { "id": "sTpsgC0m3czJ", "outputId": "3fcb6278-7d83-4d7b-b50d-0664ab89ea5d", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 12, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "9555 examples/docs-to-classify/american_feuillage-whitman-america.txt\n", "\"chars\"\t9409\n", "933 examples/docs-to-classify/as_i_ponderd_in_silence-whitman.txt\n", "\"chars\"\t925\n", "1151 examples/docs-to-classify/buckingham_palace-milne-not_america.txt\n", "\"chars\"\t1085\n", "19857 examples/docs-to-classify/chants_democratic-whitman-america.txt\n", "\"chars\"\t19716\n", "288 examples/docs-to-classify/corner_of_the_street-milne-not_whitman.txt\n", "\"chars\"\t280\n", "154 examples/docs-to-classify/happiness-milne.txt\n", "\"chars\"\t152\n", "1496 examples/docs-to-classify/in_cabind_ships_at_sea-whitman.txt\n", "\"chars\"\t1482\n", "326 examples/docs-to-classify/lines_and_squares-milne-animals.txt\n", "\"chars\"\t318\n", "432 examples/docs-to-classify/ones_self_i_sing-whitman.txt\n", "\"chars\"\t428\n", "1154 examples/docs-to-classify/puppy_and_i-milne-animals.txt\n", "\"chars\"\t1092\n", "415 examples/docs-to-classify/the_christening-milne-animals.txt\n", "\"chars\"\t403\n", "869 examples/docs-to-classify/the_four_friends-milne-animals.txt\n", "\"chars\"\t867\n", "500 examples/docs-to-classify/to_a_historian-whitman.txt\n", "\"chars\"\t500\n", "193 examples/docs-to-classify/to_foreign_lands-whitman-america.txt\n", "\"chars\"\t191\n", "895 examples/docs-to-classify/to_thee_old_cause-whitman.txt\n", "\"chars\"\t893\n", "226 examples/docs-to-classify/twinkletoes-milne.txt\n", "\"chars\"\t226\n" ] } ] }, { "cell_type": "markdown", "source": [ "`mrjob` appears to consistently return a smaller number of characters. Let us open the smallest file and count the characters manually to understand what's going on.\n", "\n", "The smallest file is `examples/docs-to-classify/happiness-milne.txt` with $152$ characters according to `mrjob` and $154$ according to `wc`." ], "metadata": { "id": "lEGKUCcu477o" } }, { "cell_type": "code", "source": [ "!cat examples/docs-to-classify/happiness-milne.txt" ], "metadata": { "id": "jRIQ4Nei3c5v", "outputId": "59d97d21-c1d1-4fdc-e192-21dad62afc93", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 13, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "John had\n", "Great Big\n", "Waterproof\n", "Boots on;\n", "John had a\n", "Great Big\n", "Waterproof\n", "Hat;\n", "John had a\n", "Great Big\n", "Waterproof\n", "Mackintosh –\n", "And that\n", "(Said John)\n", "Is\n", "That.\n" ] } ] }, { "cell_type": "markdown", "source": [ "Two hours later ... every time I count the characters I get a different number 🤔" ], "metadata": { "id": "2hgvNBRf6dnA" } }, { "cell_type": "markdown", "source": [ "Let's try using `wc`: if the result of `wc -c` is greater than the result of `wc -m`, the file contains multi-byte characters." ], "metadata": { "id": "HHg39VDF69Ef" } }, { "cell_type": "code", "source": [ "!wc -m examples/docs-to-classify/happiness-milne.txt" ], "metadata": { "id": "WoePPoub6kdD", "outputId": "fe6a54b4-0c49-46c5-8936-8419cf97e301", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 14, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "152 examples/docs-to-classify/happiness-milne.txt\n" ] } ] }, { "cell_type": "code", "source": [ "!wc -c examples/docs-to-classify/happiness-milne.txt" ], "metadata": { "id": "NhHZox6j6v2L", "outputId": "bf3eb818-caca-4036-ea89-559275e765f1", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 15, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "154 examples/docs-to-classify/happiness-milne.txt\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "AAMQoJwzs42L" }, "source": [ "OK, so our `mrjob` script is counting multi-byte characters as multiple characters. Let us verify that:" ] }, { "cell_type": "code", "source": [ "%%bash\n", "\n", "for f in examples/docs-to-classify/*\n", "do\n", " wc -m $f\n", " DATA=$f\n", " python examples/mr_wc.py $DATA 2>/dev/null |grep chars\n", "done" ], "metadata": { "id": "vgZ7p68T7stq", "outputId": "b6e40717-7a18-4079-9813-4b48ba61c5dd", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 16, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "9409 examples/docs-to-classify/american_feuillage-whitman-america.txt\n", "\"chars\"\t9409\n", "925 examples/docs-to-classify/as_i_ponderd_in_silence-whitman.txt\n", "\"chars\"\t925\n", "1085 examples/docs-to-classify/buckingham_palace-milne-not_america.txt\n", "\"chars\"\t1085\n", "19716 examples/docs-to-classify/chants_democratic-whitman-america.txt\n", "\"chars\"\t19716\n", "280 examples/docs-to-classify/corner_of_the_street-milne-not_whitman.txt\n", "\"chars\"\t280\n", "152 examples/docs-to-classify/happiness-milne.txt\n", "\"chars\"\t152\n", "1482 examples/docs-to-classify/in_cabind_ships_at_sea-whitman.txt\n", "\"chars\"\t1482\n", "318 examples/docs-to-classify/lines_and_squares-milne-animals.txt\n", "\"chars\"\t318\n", "428 examples/docs-to-classify/ones_self_i_sing-whitman.txt\n", "\"chars\"\t428\n", "1092 examples/docs-to-classify/puppy_and_i-milne-animals.txt\n", "\"chars\"\t1092\n", "403 examples/docs-to-classify/the_christening-milne-animals.txt\n", "\"chars\"\t403\n", "867 examples/docs-to-classify/the_four_friends-milne-animals.txt\n", "\"chars\"\t867\n", "500 examples/docs-to-classify/to_a_historian-whitman.txt\n", "\"chars\"\t500\n", "191 examples/docs-to-classify/to_foreign_lands-whitman-america.txt\n", "\"chars\"\t191\n", "893 examples/docs-to-classify/to_thee_old_cause-whitman.txt\n", "\"chars\"\t893\n", "226 examples/docs-to-classify/twinkletoes-milne.txt\n", "\"chars\"\t226\n" ] } ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "lk55j4zks42L", "outputId": "98035567-8dc5-460a-a66a-ebe871709d72", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "37967\n" ] } ], "source": [ "!cat examples/docs-to-classify/* |wc -m" ] }, { "cell_type": "markdown", "metadata": { "id": "vpc1DS1Qs42N" }, "source": [ "✅ $37967$ is the same result as the output of the `mrjob` script!" ] }, { "cell_type": "markdown", "source": [ "Note that this job ran on only **locally**.\n", "\n", "To see an example of a job running on a cluster, please check the tutorial below:\n", "\n", "[Getting started with `mrjob`](https://github.com/groda/big_data/blob/master/getting_started_with_mrjob.ipynb)" ], "metadata": { "id": "lZGbeeBq939w" } }, { "cell_type": "markdown", "source": [ "## Context: $\\text{mrjob}$ Maintenance Status\n", "\n", "$\\text{mrjob}$ was originally developed and open-sourced by **Yelp**, the well-known business review platform. Yelp created and heavily relied on $\\text{mrjob}$ as their primary framework for running analytical jobs across their large Hadoop clusters.\n", "\n", "While $\\text{mrjob}$ is a robust and widely-used tool that simplifies the development and deployment of Python-based MapReduce and Spark jobs, the project's **active maintenance has slowed significantly in recent years.**\n", "\n", "Here's why this is relevant:\n", "\n", "* **Yelp's Transition:** Like many tech companies, Yelp has likely evolved its data infrastructure, shifting toward newer technologies (such as pure Spark, Flink, or cloud-native solutions) that offer better performance or integration with modern cloud platforms. This reduces the immediate need for them to heavily invest resources in updating the $\\text{mrjob}$ core library.\n", "* **Feature Stagnation:** The codebase generally receives fewer updates, bug fixes, and new features compared to actively maintained frameworks. Users may find that support for the **very latest versions of Hadoop, Spark, or Python** can lag behind.\n", "* **Stability vs. Modernity:** Despite the lack of recent updates, $\\text{mrjob}$ remains stable and perfectly functional for environments using compatible versions of Hadoop and Spark. It serves as a strong, proven framework for those who value its **simplicity and unified Python interface** over the bleeding edge of data technology.\n" ], "metadata": { "id": "HsZIEDzkuBcb" } }, { "cell_type": "markdown", "source": [ "> 💡 Reviving an open-source project like `mrjob` would be a valuable contribution to the world of open-source. The ncurrent repository can be found in https://github.com/Yelp/mrjob." ], "metadata": { "id": "226MPEqzHQB3" } } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "colab": { "provenance": [], "include_colab_link": true } }, "nbformat": 4, "nbformat_minor": 0 }