{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Apache Spark\n", "\n", "## Install\n", "\n", "1. Install the Java JDK from\n", " [Oracle](https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html).\n", "1. `pip3 install pyspark` to get the Spark and the Python bindings\n", "1. Set an environment variable. E.g., in your `~/.zshrc` file:\n", "\n", "```\n", "PYSPARK_PYTHON=\"python3\"\n", "export PYSPARK_PYTHON\n", "```\n", "\n", "Start a new terminal and run `pyspark`.\n", "\n", "Now it is time to start reading in the\n", "[quick-start-guide](https://spark.apache.org/docs/latest/quick-start.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Use here in Jupyter\n", "\n", "We try to get a meaningful thing done with the words of the BHSA, here, in this notebook.\n", "\n", "We explode the `g_word_utf8` feature:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.27 s, sys: 42.6 ms, total: 1.32 s\n", "Wall time: 1.32 s\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "from tf.convert.tf import explode\n", "\n", "explode('~/github/etcbc/bhsa/tf/c/g_word_utf8.tf', 'explode/out')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\tבְּ\n", "2\tרֵאשִׁ֖ית\n", "3\tבָּרָ֣א\n", "4\tאֱלֹהִ֑ים\n", "5\tאֵ֥ת\n", "6\tהַ\n", "7\tשָּׁמַ֖יִם\n", "8\tוְ\n", "9\tאֵ֥ת\n", "10\tהָ\n", "11\tאָֽרֶץ\n", "12\tוְ\n", "13\tהָ\n", "14\tאָ֗רֶץ\n", "15\tהָיְתָ֥ה\n" ] } ], "source": [ "!head -n 15 explode/out/g_word_utf8.tf " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "426570\tבִּ\n", "426571\tירוּשָׁלִַ֖ם\n", "426572\tאֲשֶׁ֣ר\n", "426573\tבִּֽ\n", "426574\tיהוּדָ֑ה\n", "426575\tמִֽי\n", "426576\tבָכֶ֣ם\n", "426577\tמִ\n", "426578\tכָּל\n", "426579\tעַמֹּ֗ו\n", "426580\tיְהוָ֧ה\n", "426581\tאֱלֹהָ֛יו\n", "426582\tעִמֹּ֖ו\n", "426583\tוְ\n", "426584\tיָֽעַל\n" ] } ], "source": [ "!tail -n 15 explode/out/g_word_utf8.tf " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Brilliant." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Spark\n", "\n", "Spark *just works* in the notebook!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from pyspark import SparkConf, SparkContext" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 15.3 ms, sys: 12 ms, total: 27.3 ms\n", "Wall time: 3.18 s\n" ] } ], "source": [ "%%time\n", "\n", "conf = SparkConf().setAppName(\"bhsa\").setMaster(\"local\")\n", "sc = SparkContext(conf=conf)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "lines = sc.textFile(\"explode/out/g_word_utf8.tf\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "426584" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines.count()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1\\tבְּ'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines.first()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "pairs = lines.map(lambda s: tuple(reversed(s.split(\"\\t\"))))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('בְּ', '1')" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pairs.first()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Due to the hebrew you do not see that '1' is the second element:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'בְּ'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pairs.first()[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want the `1` as integer, or rather, as a tuple of one integer (becomes clear later)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def makePair(s):\n", " (node, value) = s.split(\"\\t\")\n", " return (value, (int(node),))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 604 µs, sys: 506 µs, total: 1.11 ms\n", "Wall time: 784 µs\n" ] } ], "source": [ "%%time\n", "\n", "pairs = lines.map(makePair)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "בְּ\n", "(1,)\n", "CPU times: user 5.41 ms, sys: 2.03 ms, total: 7.43 ms\n", "Wall time: 51.6 ms\n" ] } ], "source": [ "%%time\n", "\n", "firstPair = pairs.first()\n", "print(firstPair[0])\n", "print(firstPair[1])" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('בְּ', (1,)),\n", " ('רֵאשִׁ֖ית', (2,)),\n", " ('בָּרָ֣א', (3,)),\n", " ('אֱלֹהִ֑ים', (4,)),\n", " ('אֵ֥ת', (5,)),\n", " ('הַ', (6,)),\n", " ('שָּׁמַ֖יִם', (7,)),\n", " ('וְ', (8,)),\n", " ('אֵ֥ת', (9,)),\n", " ('הָ', (10,))]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pairs.take(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we get the occurrences of each distinct word form:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def add(occs, occ):\n", " return occs + occ" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.24 ms, sys: 2.01 ms, total: 9.24 ms\n", "Wall time: 25.2 ms\n" ] } ], "source": [ "%%time\n", "\n", "occs = pairs.reduceByKey(add)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`occs` should be the nodes that all have `בְּ` as their `g_word_utf8` value." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.2 ms, sys: 871 µs, total: 6.07 ms\n", "Wall time: 7.23 s\n" ] }, { "data": { "text/plain": [ "'בְּ'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "bOccs = occs.first()\n", "nodes = bOccs[1]\n", "word = bOccs[0]\n", "word" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6423" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(nodes)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1, 84, 500, 540, 542, 735, 737, 804, 820, 852)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nodes[0:10]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(426282,\n", " 426354,\n", " 426370,\n", " 426385,\n", " 426403,\n", " 426419,\n", " 426495,\n", " 426525,\n", " 426538,\n", " 426543)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nodes[-10:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the very, very beginning, but here you can see how you can get at the feature data in a completely\n", "different way.\n", "\n", "As a check, we do it in Text-Fabric:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "TF-app: ~/text-fabric-data/annotation/app-bhsa/code" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/etcbc/bhsa/tf/c" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/etcbc/phono/tf/c" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/etcbc/parallels/tf/c" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.12 s, sys: 652 ms, total: 5.77 s\n", "Wall time: 6.43 s\n" ] } ], "source": [ "%%time\n", "\n", "from tf.app import use\n", "\n", "A = use('bhsa', hoist=globals(), silent='deep')" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'בְּ'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word = F.g_word_utf8.v(1)\n", "word" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 140 ms, sys: 1.75 ms, total: 142 ms\n", "Wall time: 141 ms\n" ] } ], "source": [ "%%time\n", "\n", "nodes = [n for n in F.otype.s('word') if F.g_word_utf8.v(n) == word]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6423" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(nodes)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 84, 500, 540, 542, 735, 737, 804, 820, 852]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nodes[0:10]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[426282,\n", " 426354,\n", " 426370,\n", " 426385,\n", " 426403,\n", " 426419,\n", " 426495,\n", " 426525,\n", " 426538,\n", " 426543]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nodes[-10:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Performance\n", "\n", "In Spark, loading the system takes more than 3 seconds,\n", "although the processor is not very busy during that time.\n", "\n", "Later, the cell that does `occs.first()` takes 7 seconds. \n", "But after that, the `occs` are cached.\n", "\n", "In Text-Fabric, loading the features takes slightly less than 7 seconds,\n", "although we load many more features than just `g_word_utf8`!\n", "After that, all features are cached.\n", "\n", "But, although in this case Text-Fabric wins, it might very well be that if you really start\n", "crunching numbers, Spark outperforms Text-Fabric in a devastating way.\n", "\n", "We'll see." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }