{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Welcome\n", "-------\n", "\n", "\n", "Welcome to the Apache Spark tutorial notebooks.\n", "\n", "This very simple notebook is designed to test that your environment is setup correctly.\n", "\n", "Please `Run All` cells. \n", "\n", "The notebook should run without errors and you should see a histogram plot at the end.\n", "\n", "(You can also check the expected output [here](https://piotrszul.github.io/spark-tutorial/notebooks/0.1_Welcome.html))\n", "\n", "\n", "#### Let's go\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check that there are some input data available:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Project Gutenberg EBook of The Prince, by Nicolo Machiavelli\r\n", "\r\n", "This eBook is for the use of anyone anywhere at no cost and with\r\n", "almost no restrictions whatsoever. You may copy it, give it away or\r\n", "re-use it under the terms of the Project Gutenberg License included\r\n", "with this eBook or online at www.gutenberg.org\r\n", "\r\n", "\r\n", "Title: The Prince\r\n", "\r\n" ] } ], "source": [ "%%sh\n", "\n", "# All the test data sets are located in the `data` directory.\n", "# You can preview them using unix command such as `cat`, `head`, `tail`, `ls`, etc. \n", "# in `shell` cells marked with the `%%sh` magic, e.g.: \n", "\n", "head -n 10 data/prince_by_machiavelli.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check if spark is available and what version are we using (should be 2.1+):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "u'2.1.0'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# `spark` is the main entry point for all spark related operations.\n", "# It is an instance of SparkSession and pyspark automatically creates one for you.\n", "# Another one is `sc` an instance of SparkContext, which is used for low lever RRD API.\n", "\n", "spark.version" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to run a simple `Spark` program to compute the number of occurences of words in Machiavelli's \"Prince\", and display ten most frequent ones:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(3108, u'the'), (2107, u'to'), (1935, u'and'), (1802, u'of'), (993, u'in'), (920, u'he'), (779, u'a'), (745, u'that'), (640, u'his'), (585, u'it')]\n" ] } ], "source": [ "import operator\n", "import re\n", "\n", "\n", "# Here we use Spark RDD API to split a text file into invividual words, \n", "# to count the number of occurences of each word and to take top 10 most frequent words.\n", "\n", "wordCountRDD = sc.textFile('data/prince_by_machiavelli.txt') \\\n", " .flatMap(lambda line: re.split(r'[^a-z\\-\\']+', line.lower())) \\\n", " .filter(lambda word: len(word) > 0 ) \\\n", " .map(lambda word: (word, 1)) \\\n", " .reduceByKey(operator.add)\n", "\n", "# `take()` function takes the first n elements of an RDD \n", "# and returns them in a python `list` object, \n", " \n", "top10Words = wordCountRDD \\\n", " .map(lambda (k,v):(v,k)) \\\n", " .sortByKey(False) \\\n", " .take(10)\n", " \n", " \n", "# which can then be printed out\n", "print(top10Words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Spark SQL is a higer level API for structured data. The data are represented in `data frames` - table like object with columns and rows concenptully similar to `panadas` or `R` data fames.\n", "\n", "Let's use Spark SQL to display a table with the 10 least frequent words:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>word</th>\n", " <th>count</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>secondly</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>surrounding</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>consolidated</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>comparatively</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>chill</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>prospering</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>calculate</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>attracted</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>similarity</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>popoli</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ "DataFrame[word: string, count: bigint]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# A data frame can be created from an RDD;\n", "# schema defines the names (and types) of columns.\n", "\n", "wordCountDF = spark.createDataFrame(wordCountRDD, schema = ['word', 'count'])\n", "\n", "# it just means: sort by column `count` and take the first ten elements\n", "\n", "bottom10Words = wordCountDF.sort('count').limit(10)\n", "\n", "# `display` function can be used to display data frames (and also all other sorts of objects)\n", "display(bottom10Words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's save the results to a csv file.\n", "\n", "For the tutorial all the output files are saved in the `output` directory:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# data frames can be saved in many common 'table' formats, for example `csv`.\n", "# the `mode='overwrite'` tells Spark to overwite the output file is it exists\n", "\n", "wordCountDF.write.csv('output/prince-word-count.csv', mode='overwrite', header=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's preview the output:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 120\n", "-rw-r--r-- 1 szu004 staff 0 10 Jul 09:01 _SUCCESS\n", "-rw-r--r-- 1 szu004 staff 28071 10 Jul 09:01 part-00000-37d5fa68-916f-4243-99f5-4615ed30faf0.csv\n", "-rw-r--r-- 1 szu004 staff 28953 10 Jul 09:01 part-00001-37d5fa68-916f-4243-99f5-4615ed30faf0.csv\n", "\n", "Content:\n", "word,count\n", "pardon,2\n", "dissolution,2\n", "desirable,2\n", "papacy,1\n", "four,14\n", "demanded,1\n", "protest,1\n", "thirst,1\n", "consists,2\n" ] } ], "source": [ "%%sh \n", "\n", "# Same as with the input data sets, we can use the `%%sh` cells to preview the \n", "# files produced to the `output` directory.\n", "\n", "# Please note that output we have produced above is actually a directory:\n", "\n", "ls -l output/prince-word-count.csv\n", "\n", "# The `part-*` files inside contain the actual data.\n", "\n", "echo\n", "echo \"Content:\"\n", "\n", "head -n 10 output/prince-word-count.csv/part-00000-*.csv " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we can use python `matplotlib` to visualise the result.\n", "\n", "Let's plot the histogram of the distribution of word counts:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYEAAAECCAYAAAAYfWtSAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAADy5JREFUeJzt3W+IHdd5x/Hvs3UsKsfNHxfJVFvLLiKoUWvclO6LJqUL\naSM5oVFrByoVk9YQXEJrB/qiSmhgZegLF1NwiJMWGkXEBaH+gdZW61C7hDWYkEQ4UZUmVqXSWpYU\na+tS0zoYG8d++mJmr682Wnvu7szOHZ/vBxbde3bvnefO3b0/zTln5kRmIkkq00zfBUiS+mMISFLB\nDAFJKpghIEkFMwQkqWCGgCQVzBCQpIIZApJUsM5CICI2R8TxiPhgV9uQJK1Pl0cCB4C/6vD5JUnr\n1CgEIuJQRCxFxMkV7Xsi4lREnI6IA2PtvwJ8F3gWiFYrliS1JppcOygi3gd8H3ggM2+s22aA08D7\nge8Bx4F9mXkqIv4Y2AzsAl7IzN/oqH5J0jpc0eSHMvPxiNi+onkOOJOZZwEi4iiwFziVmZ+u2z4K\n/HeL9UqSWtQoBFaxDTg3dv88VTCMZOYDqz04Irx8qSStQWa21s3e6xTRzBzs18LCQu81WH//dVj/\n8L6GXHtm+/93Xk8IXACuG7s/W7c1dvDgQRYXF9dRgiSVYXFxkYMHD7b+vJOEQHDpTJ/jwI6I2B4R\nVwL7gIcm2fjBgweZn5+f5CGSVKT5+fn+QiAijgBfBd4VEU9HxO2Z+QpwJ/AI8B3gaGY+2XqFU2ro\n4WX9/bL+/gy59i40miLayYYjcmFhgfn5ed8USXoDi4uLLC4ucvfdd5MtDgz3GgJ9bVuShioiWg0B\nLyAnSQXrNQScHSRJzXQ1O8juIEkaELuDJEmtMQQkqWCOCUjSALwpxwSeeOKJNT12y5YtzM7OtlyR\nJE2/tscEeg2Bq6/+GSLeMuEjX2FmZonnnrvYSV2SNM3eVCEAS8CWCR/5PJs2/QQvvvh8F2VJ0lRr\nOwTWs55AC+4FPgTM91uGJE255ctGtM0jAUkaEM8TkCS1xhCQpIIZApJUMENAkgrm7CBJGgBnB404\nO0hSuZwdJElqjSEgSQUzBCSpYIaAJBXMEJCkgjlFVJIGwCmiI04RlVQup4hKklpjCEhSwQwBSSqY\nISBJBTMEJKlghoAkFcwQkKSCGQKSVDDPGJakAfCM4RHPGJZULs8YliS1xhCQpIIZApJUMENAkgpm\nCEhSwQwBSSqYISBJBTMEJKlghoAkFayTy0ZExE7gE8A1wFcy88+72I4kaX06ORLIzFOZ+XHgN4Ff\n7GIbkqT1axQCEXEoIpYi4uSK9j0RcSoiTkfEgRXf+zXgH4CH2ytXktSmpkcCh4Hd4w0RMQPcX7fv\nAvbX3UAAZOaxzPwQcFtLtUqSWtZoTCAzH4+I7Sua54AzmXkWICKOAnuBUxHxy8AtwCbgH1usV5LU\novUMDG8Dzo3dP08VDGTmY8Bjb/wU9wJX1bfncV0BSbpUV+sILGu8nkB9JHAsM2+s798K7M7MO+r7\ntwFzmXlXw+dzPQFJmtA0rSdwAbhu7P5s3SZJGohJuoOi/lp2HNhRHyE8A+wD9k+2eZeXlKQmel1e\nMiKOUH1SX0PVh7OQmYcj4mbgPqojikOZeU/jDdsdJEkTa7s7yDWGJWlA2g6BTi4b0ZzdQZLURK/d\nQV3wSECSJjdNs4MkSQNnd5AkDYDdQSN2B0kql91BkqTWGAKSVDDHBCRpABwTGHFMQFK5HBOQJLXG\nEJCkgjkmIEkD4JjAiGMCksrlmIAkqTWGgCQVzBCQpIIZApJUMGcHSdIAODtoxNlBksrl7CBJUmsM\nAUkqmCEgSQUzBCSpYIaAJBXMKaKSNABOER1xiqikcjlFVJLUGkNAkgpmCEhSwQwBSSqYISBJBTME\nJKlghoAkFcyTxSRpADxZbMSTxSSVy5PFJEmtMQQkqWCGgCQVzBCQpIIZApJUMENAkgpmCEhSwQwB\nSSqYISBJBTMEJKlgnV07KCL2Ul0Y6Grgi5n5aFfbkiStTWchkJkPAg9GxNuprhRnCEjSlGncHRQR\nhyJiKSJOrmjfExGnIuJ0RBy4zEM/DXxuvYVKkto3yZjAYWD3eENEzAD31+27gP0RsXPs+/cAD2fm\niRZqlSS1rHEIZObjwHMrmueAM5l5NjNfBo4CewEi4k7g/cBHIuKOluqVJLVovWMC24BzY/fPUwUD\nmflZ4LOv//B7gavq2/O4uIwkXaqrxWSWTbSoTERsB45l5o31/VuB3Zl5R33/NmAuM+9q8FwuKiNJ\nE5q2RWUuANeN3Z+t2yRJAzBpd1DUX8uOAzvqI4RngH3A/uZP5xrDktRE72sMR8QRqk/ra6j6cRYy\n83BE3AzcR3VUcSgz72n4fHYHSdKE2u4OcqF5SRqQtkOgszOGm7E7SJKa6L07qPUNeyQgSRObttlB\nkqQBsztIkgbA7qARu4MklcvuIElSawwBSSqYYwKSNACOCYw4JiCpXI4JSJJaYwhIUsEcE5CkAXBM\nYMQxAUnlckxAktQaQ0CSCmYISFLBHBiWpAFwYHjEgWFJ5XJgWJLUGkNAkgpmCEhSwQwBSSqYISBJ\nBXOKqCQNgFNER5wiKqlcThGVJLXGEJCkghkCklQwQ0CSCjbIEHjppR8QEWv6uvba6/suX5KmRs9T\nRNfqRWBts5qWllobVJekwRvkkYAkqR2GgCQVzDOGJWkAPGN45Hngx1jrmAAEfb1mSVovzxiWJLXG\nEJCkghkCklQwQ0CSCmYISFLBDAFJKpghIEkFMwQkqWCGgCQVzBCQpIJ1EgIRcUNEfCEi/rqL55ck\ntaOTEMjM/8zMj3Xx3JKk9jQKgYg4FBFLEXFyRfueiDgVEacj4kA3JUqSutL0SOAwsHu8ISJmgPvr\n9l3A/ojYueJxU7iM1yaXppSkWqMQyMzHgedWNM8BZzLzbGa+DBwF9gJExDsj4s+Am6bvCOElqstQ\nT/61tHS2j4IlqTPrWVRmG3Bu7P55qmAgM/8H+PgbP8W9wFX17XlcXEaSLtXVYjLLGi8qExHbgWOZ\neWN9/1Zgd2beUd+/DZjLzLsaPl9vi8q4II2koZqmRWUuANeN3Z+t2yRJAzFJd1Bw6UDvcWBHfYTw\nDLAP2D/Z5l1jWJKa6HWN4Yg4QvVJfQ1VH85CZh6OiJuB+6iOKA5l5j2NN2x3kCRNrO3uoEZHApn5\nW6u0fxn48to375GAJDXR65FAFzwSkKTJTdPAsCRp4AwBSSrYek4Wa4FjApLUhGMCI44JSCqXYwKS\npNbYHSRJA2B30IjdQZLKZXeQJKk1hoAkFcwQkKSCOTA8kWppyrXYunU7Fy8+1W45korhwPBIvwPD\nDipL6pMDw5Kk1hgCklQwQ0CSCubAsCQNgAPDIw4MSyqXA8OSpNYYApJUMENAkgpmCEhSwQwBSSqY\nU0QlaQCcIjriFFFJ5XKKqCSpNYaAJBXMEJCkghkCklQwQ0CSCmYISFLBDAFJKpghIEkF84zhDbOJ\niLWd3zEzs5lXX31hTY/dunU7Fy8+tabHSpoenjE8Mtwzhj1TWdJ6ecawJKk1hoAkFcwQkKSCGQKS\nVDBDQJIKZghIUsEMAUkqmCEgSQUzBCSpYIaAJBWsk2sHRcRm4PPAS8BjmXmki+1IktanqyOBW4C/\nyczfBT7c0TZ6tth3AevSxYWoNpL192vI9Q+59i40CoGIOBQRSxFxckX7nog4FRGnI+LA2LdmgXP1\n7VdaqnXKLPZdwLoM/Q/B+vs15PqHXHsXmh4JHAZ2jzdExAxwf92+C9gfETvrb5+jCgKoLoEpSZpC\njUIgMx8HnlvRPAecycyzmfkycBTYW3/v74CPRMTngGNtFStJalfj9QQiYjtwLDNvrO/fCuzOzDvq\n+7cBc5l5V8Pn8yL3krQGba4n0NvKYm2+CEnS2qxndtAF4Lqx+7N1myRpICYJgeDSQd7jwI6I2B4R\nVwL7gIfaLE6S1K2mU0SPAF8F3hURT0fE7Zn5CnAn8AjwHeBoZj7ZXamSpNZl5oZ/AXuAU8Bp4EAf\nNTSs8yngX4BvAd+o295BFXz/BvwT8Laxn/8UcAZ4EvhAD/UeApaAk2NtE9cLvAc4Wb8/9/VY+wJw\nHvhm/bVnGmuvtzsLfIXqP0TfBu4a2P5fWf+dQ3kPgE3A1+u/028DCwPb96vVvyH7fkP+QFa84Bng\n34HtwFuAE8DOja6jYa3/AbxjRdufAH9Y3z4A3FPffnf9Jl4BXF+/xtjget8H3MSlH6QT11v/Qv5C\nffthqllgfdS+APzBZX72p6ep9npb1wI31bffWn/w7BzQ/l+t/kG8B8Dm+t8fAb5GNYV9EPv+derf\nkH3fxwXkXu/8gmkT/HCX2V7gS/XtLwG/Xt/+MFWX2A8y8ymqlJ7biCKX5eXP55io3oi4Frg6M4/X\nP/fA2GM6s0rtcPmTDfcyRbUDZObFzDxR3/4+1f/QZhnO/r9c/dvqb0/9e5CZL9Q3N1F9OCYD2few\nav2wAfu+jxDYxmuXlIDqcGfbKj/btwQejYjjEfGxum1rZi5B9YcDbKnbV76uC0zH69oyYb3bqN6T\nZX2/P78fESci4gsR8ba6baprj4jrqY5qvsbkvy+9v4ax+r9eN039exARMxHxLeAi8Gj9QTiYfb9K\n/bAB+95LSb++92bme4APAr8XEb/Eawm9bGgnvQ2p3s8DP5WZN1H9cfxpz/W8oYh4K/C3wCfq/1EP\n6vflMvUP4j3IzFcz8+eojr7mImIXA9r3l6n/3WzQvu8jBAZzfkFmPlP/+yzw91TdO0sRsRWgPvz6\nr/rHLwA/OfbwaXldk9Y7Na8jM5/NunMT+Ate616bytoj4gqqD9C/zMwH6+bB7P/L1T+09yAz/4/q\n6o57GNC+XzZe/0bt+z5CYBDnF0TE5vp/RUTEVcAHqEbuHwJ+p/6x3waW/9gfAvZFxJURcQOwA/jG\nhhZdWXk+x0T11ofN/xsRcxERwEfHHtO1S2qv/3CX3QL8a317GmsH+CLw3cz8zFjbkPb/D9U/hPcg\nIn58uaskIn4U+FWqMY1B7PtV6j+1Yft+I0a+LzO6vYdq9sEZ4JN91NCgxhuoZi4tT9v6ZN3+TuCf\n6/ofAd4+9phPUY3U9zVF9AjwParFfJ4GbqeaJjdRvcDP16/5DPCZHmt/gGq62wmqI7Gt01h7vd33\nUl02ffl35pv17/nEvy897f/V6p/69wD42breE3Wtf1S3D2Xfr1b/huz7xheQkyS9+TgwLEkFMwQk\nqWCGgCQVzBCQpIIZApJUMENAkgpmCEhSwf4fE5ESiGBr+scAAAAASUVORK5CYII=\n", "text/plain": [ "<matplotlib.figure.Figure at 0x105e30c10>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "# we can convert (small) Spark data frames to `pandas`\n", "wordCountPDF = wordCountDF.toPandas()\n", "\n", "# and then use pyplot (plt) to display the results\n", "# Please note that we call `plt.close()` first - this is needed for Databricks \n", "# to start a new plot.\n", "\n", "plt.close()\n", "plt.hist(wordCountPDF['count'], bins = 20, log = True)\n", "plt.show()\n", "display()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now play around modifyging some pieces of the code.\n", "\n", "When you are done and you are running off the local machine remeber **to close the notebook with `File/Close and Halt`**" ] } ], "metadata": { "kernelspec": { "display_name": "PySpark", "language": "python", "name": "pyspark" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 1 }