{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Welcome\n", "-------\n", "\n", "\n", "Welcome to the Apache Spark tutorial notebooks.\n", "\n", "This very simple notebook is designed to test that your environment is setup correctly.\n", "\n", "Please `Run All` cells. \n", "\n", "The notebook should run without errors and you should see a histogram plot at the end.\n", "\n", "(You can also check the expected output [here](https://piotrszul.github.io/spark-tutorial/notebooks/0.1_Welcome.html))\n", "\n", "\n", "#### Let's go\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check that there are some input data available:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Project Gutenberg EBook of The Prince, by Nicolo Machiavelli\r\n", "\r\n", "This eBook is for the use of anyone anywhere at no cost and with\r\n", "almost no restrictions whatsoever. You may copy it, give it away or\r\n", "re-use it under the terms of the Project Gutenberg License included\r\n", "with this eBook or online at www.gutenberg.org\r\n", "\r\n", "\r\n", "Title: The Prince\r\n", "\r\n" ] } ], "source": [ "%%sh\n", "\n", "# All the test data sets are located in the `data` directory.\n", "# You can preview them using unix command such as `cat`, `head`, `tail`, `ls`, etc. \n", "# in `shell` cells marked with the `%%sh` magic, e.g.: \n", "\n", "head -n 10 data/prince_by_machiavelli.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check if spark is available and what version are we using (should be 2.1+):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "u'2.1.0'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# `spark` is the main entry point for all spark related operations.\n", "# It is an instance of SparkSession and pyspark automatically creates one for you.\n", "# Another one is `sc` an instance of SparkContext, which is used for low lever RRD API.\n", "\n", "spark.version" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to run a simple `Spark` program to compute the number of occurences of words in Machiavelli's \"Prince\", and display ten most frequent ones:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(3108, u'the'), (2107, u'to'), (1935, u'and'), (1802, u'of'), (993, u'in'), (920, u'he'), (779, u'a'), (745, u'that'), (640, u'his'), (585, u'it')]\n" ] } ], "source": [ "import operator\n", "import re\n", "\n", "\n", "# Here we use Spark RDD API to split a text file into invividual words, \n", "# to count the number of occurences of each word and to take top 10 most frequent words.\n", "\n", "wordCountRDD = sc.textFile('data/prince_by_machiavelli.txt') \\\n", " .flatMap(lambda line: re.split(r'[^a-z\\-\\']+', line.lower())) \\\n", " .filter(lambda word: len(word) > 0 ) \\\n", " .map(lambda word: (word, 1)) \\\n", " .reduceByKey(operator.add)\n", "\n", "# `take()` function takes the first n elements of an RDD \n", "# and returns them in a python `list` object, \n", " \n", "top10Words = wordCountRDD \\\n", " .map(lambda (k,v):(v,k)) \\\n", " .sortByKey(False) \\\n", " .take(10)\n", " \n", " \n", "# which can then be printed out\n", "print(top10Words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Spark SQL is a higer level API for structured data. The data are represented in `data frames` - table like object with columns and rows concenptully similar to `panadas` or `R` data fames.\n", "\n", "Let's use Spark SQL to display a table with the 10 least frequent words:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
| \n", " | word | \n", "count | \n", "
|---|---|---|
| 0 | \n", "secondly | \n", "1 | \n", "
| 1 | \n", "surrounding | \n", "1 | \n", "
| 2 | \n", "consolidated | \n", "1 | \n", "
| 3 | \n", "comparatively | \n", "1 | \n", "
| 4 | \n", "chill | \n", "1 | \n", "
| 5 | \n", "prospering | \n", "1 | \n", "
| 6 | \n", "calculate | \n", "1 | \n", "
| 7 | \n", "attracted | \n", "1 | \n", "
| 8 | \n", "similarity | \n", "1 | \n", "
| 9 | \n", "popoli | \n", "1 | \n", "