{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Apache Spark\n",
"\n",
"## Install\n",
"\n",
"1. Install the Java JDK from\n",
" [Oracle](https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html).\n",
"1. `pip3 install pyspark` to get the Spark and the Python bindings\n",
"1. Set an environment variable. E.g., in your `~/.zshrc` file:\n",
"\n",
"```\n",
"PYSPARK_PYTHON=\"python3\"\n",
"export PYSPARK_PYTHON\n",
"```\n",
"\n",
"Start a new terminal and run `pyspark`.\n",
"\n",
"Now it is time to start reading in the\n",
"[quick-start-guide](https://spark.apache.org/docs/latest/quick-start.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Use here in Jupyter\n",
"\n",
"We try to get a meaningful thing done with the words of the BHSA, here, in this notebook.\n",
"\n",
"We explode the `g_word_utf8` feature:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1.27 s, sys: 42.6 ms, total: 1.32 s\n",
"Wall time: 1.32 s\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"\n",
"from tf.convert.tf import explode\n",
"\n",
"explode('~/github/etcbc/bhsa/tf/c/g_word_utf8.tf', 'explode/out')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1\tבְּ\n",
"2\tרֵאשִׁ֖ית\n",
"3\tבָּרָ֣א\n",
"4\tאֱלֹהִ֑ים\n",
"5\tאֵ֥ת\n",
"6\tהַ\n",
"7\tשָּׁמַ֖יִם\n",
"8\tוְ\n",
"9\tאֵ֥ת\n",
"10\tהָ\n",
"11\tאָֽרֶץ\n",
"12\tוְ\n",
"13\tהָ\n",
"14\tאָ֗רֶץ\n",
"15\tהָיְתָ֥ה\n"
]
}
],
"source": [
"!head -n 15 explode/out/g_word_utf8.tf "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"426570\tבִּ\n",
"426571\tירוּשָׁלִַ֖ם\n",
"426572\tאֲשֶׁ֣ר\n",
"426573\tבִּֽ\n",
"426574\tיהוּדָ֑ה\n",
"426575\tמִֽי\n",
"426576\tבָכֶ֣ם\n",
"426577\tמִ\n",
"426578\tכָּל\n",
"426579\tעַמֹּ֗ו\n",
"426580\tיְהוָ֧ה\n",
"426581\tאֱלֹהָ֛יו\n",
"426582\tעִמֹּ֖ו\n",
"426583\tוְ\n",
"426584\tיָֽעַל\n"
]
}
],
"source": [
"!tail -n 15 explode/out/g_word_utf8.tf "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Brilliant."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Spark\n",
"\n",
"Spark *just works* in the notebook!"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from pyspark import SparkConf, SparkContext"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 15.3 ms, sys: 12 ms, total: 27.3 ms\n",
"Wall time: 3.18 s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"conf = SparkConf().setAppName(\"bhsa\").setMaster(\"local\")\n",
"sc = SparkContext(conf=conf)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"lines = sc.textFile(\"explode/out/g_word_utf8.tf\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"426584"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lines.count()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'1\\tבְּ'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lines.first()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"pairs = lines.map(lambda s: tuple(reversed(s.split(\"\\t\"))))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('בְּ', '1')"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pairs.first()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Due to the hebrew you do not see that '1' is the second element:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'בְּ'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pairs.first()[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want the `1` as integer, or rather, as a tuple of one integer (becomes clear later)."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def makePair(s):\n",
" (node, value) = s.split(\"\\t\")\n",
" return (value, (int(node),))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 604 µs, sys: 506 µs, total: 1.11 ms\n",
"Wall time: 784 µs\n"
]
}
],
"source": [
"%%time\n",
"\n",
"pairs = lines.map(makePair)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"בְּ\n",
"(1,)\n",
"CPU times: user 5.41 ms, sys: 2.03 ms, total: 7.43 ms\n",
"Wall time: 51.6 ms\n"
]
}
],
"source": [
"%%time\n",
"\n",
"firstPair = pairs.first()\n",
"print(firstPair[0])\n",
"print(firstPair[1])"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('בְּ', (1,)),\n",
" ('רֵאשִׁ֖ית', (2,)),\n",
" ('בָּרָ֣א', (3,)),\n",
" ('אֱלֹהִ֑ים', (4,)),\n",
" ('אֵ֥ת', (5,)),\n",
" ('הַ', (6,)),\n",
" ('שָּׁמַ֖יִם', (7,)),\n",
" ('וְ', (8,)),\n",
" ('אֵ֥ת', (9,)),\n",
" ('הָ', (10,))]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pairs.take(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we get the occurrences of each distinct word form:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"def add(occs, occ):\n",
" return occs + occ"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 7.24 ms, sys: 2.01 ms, total: 9.24 ms\n",
"Wall time: 25.2 ms\n"
]
}
],
"source": [
"%%time\n",
"\n",
"occs = pairs.reduceByKey(add)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`occs` should be the nodes that all have `בְּ` as their `g_word_utf8` value."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 5.2 ms, sys: 871 µs, total: 6.07 ms\n",
"Wall time: 7.23 s\n"
]
},
{
"data": {
"text/plain": [
"'בְּ'"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"\n",
"bOccs = occs.first()\n",
"nodes = bOccs[1]\n",
"word = bOccs[0]\n",
"word"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6423"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(nodes)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1, 84, 500, 540, 542, 735, 737, 804, 820, 852)"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nodes[0:10]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(426282,\n",
" 426354,\n",
" 426370,\n",
" 426385,\n",
" 426403,\n",
" 426419,\n",
" 426495,\n",
" 426525,\n",
" 426538,\n",
" 426543)"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nodes[-10:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the very, very beginning, but here you can see how you can get at the feature data in a completely\n",
"different way.\n",
"\n",
"As a check, we do it in Text-Fabric:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"TF-app: ~/text-fabric-data/annotation/app-bhsa/code"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"data: ~/text-fabric-data/etcbc/bhsa/tf/c"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"data: ~/text-fabric-data/etcbc/phono/tf/c"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"data: ~/text-fabric-data/etcbc/parallels/tf/c"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 5.12 s, sys: 652 ms, total: 5.77 s\n",
"Wall time: 6.43 s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"from tf.app import use\n",
"\n",
"A = use('bhsa', hoist=globals(), silent='deep')"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'בְּ'"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word = F.g_word_utf8.v(1)\n",
"word"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 140 ms, sys: 1.75 ms, total: 142 ms\n",
"Wall time: 141 ms\n"
]
}
],
"source": [
"%%time\n",
"\n",
"nodes = [n for n in F.otype.s('word') if F.g_word_utf8.v(n) == word]"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6423"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(nodes)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[1, 84, 500, 540, 542, 735, 737, 804, 820, 852]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nodes[0:10]"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[426282,\n",
" 426354,\n",
" 426370,\n",
" 426385,\n",
" 426403,\n",
" 426419,\n",
" 426495,\n",
" 426525,\n",
" 426538,\n",
" 426543]"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nodes[-10:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Performance\n",
"\n",
"In Spark, loading the system takes more than 3 seconds,\n",
"although the processor is not very busy during that time.\n",
"\n",
"Later, the cell that does `occs.first()` takes 7 seconds. \n",
"But after that, the `occs` are cached.\n",
"\n",
"In Text-Fabric, loading the features takes slightly less than 7 seconds,\n",
"although we load many more features than just `g_word_utf8`!\n",
"After that, all features are cached.\n",
"\n",
"But, although in this case Text-Fabric wins, it might very well be that if you really start\n",
"crunching numbers, Spark outperforms Text-Fabric in a devastating way.\n",
"\n",
"We'll see."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}