{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "Vvgkiyb9MnxG" }, "source": [ "\n", "# N-grams with PySpark\n", "\n", "\n", "In this notebook we are going to extract _n-grams_ from a text file using PySpark.\n", "\n", "N-grams in Natural Language Processing refer to sequences of tokens (or words) used to calculate the probability of a word given its preceding context. In a bigram model, the prediction for the next word relies on the single preceding word, while in a trigram model, it considers the two preceding words, and so forth.\n", "\n", "We will utilize the PySpark distribution provided by `pip` with its integrated Spark engine running as a single Java virtual machine in _pseudo-distributed mode_.\n", "\n", "More questions on PySpark are answered in [PySpark on Google Colab](https://github.com/groda/big_data/blob/master/PySpark_On_Google_Colab.ipynb)." ] }, { "cell_type": "markdown", "metadata": { "id": "6Es_sqr_zkEv" }, "source": [ "# Install PySpark" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-13T09:43:23.120979Z", "iopub.status.busy": "2024-10-13T09:43:23.120771Z", "iopub.status.idle": "2024-10-13T09:43:23.970992Z", "shell.execute_reply": "2024-10-13T09:43:23.970131Z" }, "id": "Vb65ztyHzktk", "outputId": "54ceaf85-8972-47f7-b05d-388134bb3012" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pyspark in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (3.4.0)\r\n", "Requirement already satisfied: py4j==0.10.9.7 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pyspark) (0.10.9.7)\r\n" ] } ], "source": [ "!pip install pyspark" ] }, { "cell_type": "markdown", "metadata": { "id": "cN-_-qhg1f5Y" }, "source": [ "# Download a text file" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-13T09:43:23.974042Z", "iopub.status.busy": "2024-10-13T09:43:23.973775Z", "iopub.status.idle": "2024-10-13T09:43:24.514012Z", "shell.execute_reply": "2024-10-13T09:43:24.513277Z" }, "id": "hRaJU8n61ZtM", "outputId": "2e45baed-4269-4e25-f30f-b40777d94b0b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2024-10-13 09:43:23-- https://www.gutenberg.org/cache/epub/996/pg996.txt\r\n", "Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47\r\n", "Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "HTTP request sent, awaiting response... " ] }, { "name": "stdout", "output_type": "stream", "text": [ "200 OK\r\n", "Length: 2391721 (2.3M) [text/plain]\r\n", "Saving to: ‘don_quixote.txt’\r\n", "\r\n", "\r", "don_quixote.txt 0%[ ] 0 --.-KB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "don_quixote.txt 66%[============> ] 1.52M 6.99MB/s \r", "don_quixote.txt 100%[===================>] 2.28M 10.3MB/s in 0.2s \r\n", "\r\n", "2024-10-13 09:43:24 (10.3 MB/s) - ‘don_quixote.txt’ saved [2391721/2391721]\r\n", "\r\n" ] } ], "source": [ "FILENAME = 'don_quixote.txt'\n", "!wget --no-clobber https://www.gutenberg.org/cache/epub/996/pg996.txt -O {FILENAME}" ] }, { "cell_type": "markdown", "metadata": { "id": "KJBFfbwD26gZ" }, "source": [ "# Spark Web UI\n", "\n", "The Spark Web UI\n", "\n", "> “*provides a suite of web user interfaces (UIs) that you can use to monitor the status and resource consumption of your Spark cluster.*” (https://spark.apache.org/docs/3.5.0/web-ui.html)\n", "\n", "See more in the official [Spark Documentation](https://spark.apache.org/docs/3.5.0/web-ui.html).\n" ] }, { "cell_type": "markdown", "metadata": { "id": "wehFp8H83xaM" }, "source": [ "## Set `SPARK_HOME`" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-13T09:43:24.516243Z", "iopub.status.busy": "2024-10-13T09:43:24.516028Z", "iopub.status.idle": "2024-10-13T09:43:24.541857Z", "shell.execute_reply": "2024-10-13T09:43:24.541131Z" }, "id": "8H1zb3ot3wX7", "outputId": "7c020a4c-5223-4aa9-f50d-d574f0f66b6f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Output of find_spark_home.py: /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pyspark\n", "\n" ] } ], "source": [ "import subprocess\n", "import os\n", "# Run the script and capture its output\n", "result = subprocess.run([\"find_spark_home.py\"], capture_output=True, text=True)\n", "\n", "# Print or use the captured output\n", "print(\"Output of find_spark_home.py:\", result.stdout)\n", "\n", "# set SPARK_HOME environment variable\n", "os.environ['SPARK_HOME'] = result.stdout.strip()" ] }, { "cell_type": "markdown", "metadata": { "id": "pGiK0pR6hBSw" }, "source": [ "## Import `output` library if in Colab\n", "\n", "The Web UI can be viewed in a separate windows or tab thanks to Colab's `output` library.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-10-13T09:43:24.544162Z", "iopub.status.busy": "2024-10-13T09:43:24.543767Z", "iopub.status.idle": "2024-10-13T09:43:24.547200Z", "shell.execute_reply": "2024-10-13T09:43:24.546643Z" }, "id": "BBSd3MIShROn" }, "outputs": [], "source": [ "# true if running on Google Colab\n", "import sys\n", "IN_COLAB = 'google.colab' in sys.modules\n", "\n", "if IN_COLAB:\n", " from google.colab import output" ] }, { "cell_type": "markdown", "metadata": { "id": "08oMOPq33KTL" }, "source": [ "### Spark UI" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "execution": { "iopub.execute_input": "2024-10-13T09:43:24.549192Z", "iopub.status.busy": "2024-10-13T09:43:24.548815Z", "iopub.status.idle": "2024-10-13T09:43:24.551657Z", "shell.execute_reply": "2024-10-13T09:43:24.551112Z" }, "id": "Zgb8PhxD28ao", "outputId": "7ad48528-3903-4ee2-f4b0-b46c4054f25d" }, "outputs": [], "source": [ "if IN_COLAB:\n", " output.serve_kernel_port_as_window(4040, path='/jobs/index.html')" ] }, { "cell_type": "markdown", "metadata": { "id": "J1P5kGgnzbk4" }, "source": [ "# Create a Spark context" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-10-13T09:43:24.553791Z", "iopub.status.busy": "2024-10-13T09:43:24.553432Z", "iopub.status.idle": "2024-10-13T09:43:27.567107Z", "shell.execute_reply": "2024-10-13T09:43:27.566318Z" }, "id": "jQmeEZ0bzbk6" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "24/10/13 09:43:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" ] } ], "source": [ "from pyspark import SparkContext\n", "\n", "sc = SparkContext(\n", " appName = \"Ngrams with pyspark \" + FILENAME\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "l580GAyJzbk7" }, "source": [ "## View Spark context" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 196 }, "execution": { "iopub.execute_input": "2024-10-13T09:43:27.569960Z", "iopub.status.busy": "2024-10-13T09:43:27.569453Z", "iopub.status.idle": "2024-10-13T09:43:27.578277Z", "shell.execute_reply": "2024-10-13T09:43:27.577595Z" }, "id": "DVbC0MBszbk7", "outputId": "447a0e1a-fdb3-46ed-f000-4da0e4e62210" }, "outputs": [ { "data": { "text/html": [ "\n", "
SparkContext
\n", "\n", " \n", "\n", "v3.4.0
local[*]
Ngrams with pyspark don_quixote.txt