{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 1 - Hello Spark\n", "This lab will introduce you to Apache Spark. It will be written in Python and run in IBM's Data Science Experience environment through a Jupyter notebook. While you work, it will be valuable to reference the [Apache Spark Documentation](http://spark.apache.org/docs/latest/programming-guide.html). Since it is Python, be careful of whitespace!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1 - Working with Spark Context\n", "### Step 1.1 - Invoke the spark context: version will return the working version of Apache Spark

\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
The spark context is automatically set in a Jupyter notebook. It is called: sc
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
Type:
    sc.version
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Optional\n", "

\n", "
\n", "
\n", "
Jupyter notebooks have command completion which can be invoked via the TAB key.
Type:
    sc.<TAB>
to see all the possible options within the Spark context
\n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "u'1.6.0'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Step 1 - Check spark version\n", "sc.version" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2 - Working with Resilient Distributed Datasets (RDD)\n", "\n", "### Step 2.1 - Create an RDD with numbers 1 to 10\n", "\n", "RDDs are the basic abstraction unit in Spark. An RDD represents an immutable, partitioned, fault-tolerant collection of elements that can be operated on in parallel.
\n", "There are three ways to create an RDD: parallelizing an existing collection, referencing a dataset in an external storage system which offers a Hadoop InputFormat -- or transforming an existing RDD.
\n", "
\n", "Create an iterable or collection in your program with numbers 1 to 10 and then invoke the Spark Context's (sc) parallelize() method on it.
\n", "\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
Type:
\n", "    \n", "x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

\n", "Or we can try to be a little clever by typing:
\n", "    \n", "x = range(1, 11)\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
Type:
\n", "    \n", "x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
\n", "     x_nbr_rdd = sc.parallelize(x)\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Optional Advanced\n", "

\n", "
\n", "
\n", "
An optional parameter to parallelize is the number of partitions to cut the dataset into. Spark will run one task for each partition. Typically you want 2-4 partitions for each CPU. Normally, Spark will set it automatically, but you can control this by specifying it manually as a second parameter to the parallelize method.

\n", "You can obtain the partitions size by calling <RDD>.getNumPartitions()
\n", "Try experimenting with different partitions sizes -- including ones higher than the number of values. To see how the values are distributed use:

\n", "\n", "def f(iterator):
\n", "     \n", " count = 0
\n", "     \n", " for value in iterator:
\n", "         \n", " count = count + 1
\n", "     \n", " yield count
\n", "x_nbr_rdd.mapPartitions(f).collect()

\n", "
\n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 2.1 - Create RDD of numbers 1-10\n", " \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2.2 - Return the first element

\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
Use the first() method on the RDD to return the first element in an RDD. You could also use the take() method with a parameter of 1. first() and take(1) are equivalent. Both will take the first element in the RDD's 0th partition.
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
Type:
\n", "    x_nbr_rdd.first()
\n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 2.2 - Return first element\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2.3 - Return an array of the first five elements

\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
Use the take() method
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
Type:
\n", "    x_nbr_rdd.take(5)
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Optional Advanced\n", "

\n", "
\n", "
\n", "
How would you get the 5th-7th elements? take() only accepts one parameter so take(5,7) will not work.
\n", "
\n", "
\n", "
\n", "
\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 2.3 - Return an array of the first five elements\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2.4 - Perform a map transformation to increment each element of the array by 1. The map function creates a new RDD by applying the function provided in the argument to each element. For more information go to [Transformations](http://spark.apache.org/docs/latest/programming-guide.html#transformations)

\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
Use the map(func) function on the RDD. Map invokes function func on each element of the RDD. You can also use a inline (or lambda) function. The syntax for a lambda function is:
\n", "    \n", "lambda <var>: <myCode>\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
Type:
\n", "    x_nbr_rdd_2 = x_nbr_rdd.map(lambda x: x+1)
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Optional Advanced\n", "

\n", "
\n", "
\n", "
Write a function which increments the value by 1 and pass that function to map()
\n", "
\n", "
\n", "
\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 2.4 - Write your map function\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2.5 - Note that there was no result for step 2.4. Why was this? Take a look at all the elements of the new RDD.
\n", "Type:
\n", "     x_nbr_rdd_2.collect() " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 2.5 - Check out the elements of the new RDD. Warning: Be careful with this in real life! Collect returns everything!\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2.6 - Create a new RDD with one string \"Hello Spark\" and print it by getting the first element.

\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
Create a variable with the String \"Hello Spark\" and turn it into an RDD with the parallelize() function. Remember that parallelize() is invoked from the Spark context!
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
Type:
\n", "     y = \"Hello Spark\"
\n", "     y_str_rdd = sc.parallelize(y)
\n", "     y_str_rdd.first()
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Optional Advanced\n", "

\n", "
\n", "
\n", "
Why did getting the first element only print 'H' instead of \"Hello Spark\"? What does collect() do? Is there a way to have the first element be the full string instead of an individual character?
\n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 2.6 - Create a string y, then turn it into an RDD\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2.7 - Create a third RDD with the following strings and extract the first line.\n", "    IBM Data Science Experience is built for enterprise-scale deployment.
\n", "    Manage your data, your analytical assets, and your projects in a secured cloud environment.
\n", "    When you create an account in the IBM Data Science Experience, we deploy for you a Spark as a Service instance to power your analysis and 5 GB of IBM Object Storage to store your data.

\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
Use an array -- [] -- to contain all three strings. Don't forget to enclose them in quotes!
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
     z = [ \"IBM Data Science Experience is built for enterprise-scale deployment.\", \"Manage your data, your analytical assets, and your projects in a secured cloud environment.\", \"When you create an account in the IBM Data Science Experience, we deploy for you a Spark as a Service instance to power your analysis and 5 GB of IBM Object Storage to store your data.\" ]
\n", "     z_str_rdd = sc.parallelize(z)
\n", "     z_str_rdd.first() \n", "
\n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 2.7 - Create String RDD with many lines / entries, Extract first line\n", "z = [ \"IBM Data Science Experience is built for enterprise-scale deployment.\", \"Manage your data, your analytical assets, and your projects in a secured cloud environment.\", \"When you create an account in the IBM Data Science Experience, we deploy for you a Spark as a Service instance to power your analysis and 5 GB of IBM Object Storage to store your data.\" ]\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2.8 - Count the number of entries in this RDD\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint\n", "

\n", "
\n", "
\n", "
Type:
\n", "     z_str_rdd.count()
\n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 2.8 - Count the number of entries in the RDD\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2.9 - Inspect the elements of this RDD by collecting all the values\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint\n", "

\n", "
\n", "
\n", "
Type:
\n", "    z_str_rdd.collect()
\n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 2.9 - Show all the entries in the RDD\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2.10 - Split all the entries in the RDD on the spaces. Then print it out. Pay careful attention to the new format.\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
To split on spaces, use the split() function.
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
Since you want to run on every line, use map() on the RDD and write a lambda function to call split()
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 3\n", "

\n", "
\n", "
\n", "
Type:
\n", "    z_str_rdd_split = z_str_rdd.map(lambda line: line.split(\" \"))
\n", "    z_str_rdd_split.collect()

\n", "Question: Is there any difference between split(\" \") and split()?
\n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "#Step 2.10 - Perform a map transformation to split all entries in the RDD\n", "#Check out the entries in the new RDD\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2.11 - Explore a new transformation: flatMap\n", "
\n", "We want to count the words in all the lines, but currently they are split by line. We need to 'flatten' the line return values into one object.
\n", "flatMap will \"flatten\" all the elements of an RDD element into 0 or more output terms.

\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
flatmap() parameters work the same way as in map()
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
Type:
\n", "     z_str_rdd_split_flatmap = z_str_rdd.flatMap(lambda line: line.split(\" \"))
\n", "     z_str_rdd_split_flatmap.collect()
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Optional Advanced\n", "

\n", "
\n", "
\n", "
Use the replace() and lower() methods to remove all commas and periods then make everything lower-case
\n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "#Step 2.11 - Learn the difference between two transformations: map and flatMap.\n", "\n", "\n", "#What do you notice? How are the outputs of 2.10 and 2.11 different?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2.12 - Augment each entry in the previous RDD with the number \"1\" to create pairs or tuples. The first element of the tuple will be the word and the second elements of the tuple will be the digit \"1\". This is a common step in performing a count as we need values to sum.\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
Maps don't always have to perform calculations, they can just echo values as well. Simply echo the value and a 1
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
We need to create tuples which are values enclosed in parenthesis, so you'll need to enclose the value, 1 in parens. For example: (x, 1)
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 3\n", "

\n", "
\n", "
\n", "
Type:
\n", "     countWords = z_str_rdd_split_flatmap.map(lambda word:(word,1))
\n", "     countWords.collect()
\n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 2.12 - Create pairs or tuple RDD and print it.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2.13 Now we have above what is known as a [Pair RDD](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions). Each entry in the RDD has a KEY and a VALUE.
\n", "The KEY is the word (Light, of, the, ...) and the value is the number \"1\". \n", "We can now AGGREGATE this RDD by summing up all the values BY KEY

\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
We want to sum all values by key in the key-value pairs. The generic function to do this is reduceByKey(func):
\n", "     When called on a dataset of (K [Key], V [Value]) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V.

Which means func(v1, v2) runs across all values for a specific key. Think of v1 as the output (initialized as 0 or \"\") and v2 as the iterated value over each value in the set with the same key. With each iterated value, v1 is updated.
\n", " Use a lambda function to sum up the values just as you wrote for map()
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
Type:
\n", "    countWords2 = countWords.reduceByKey(lambda x,y: x+y)
\n", "    countWords2.collect()
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Optional Advanced\n", "

\n", "
\n", "
\n", "
Sort the results by the count. You could call sortByKey() on the result, but it works on the key....
\n", " Also, while the function used in map() has only one parameter, when working with Pair RDDs, that parameter is an array of two values....\n", "
\n", "
\n", "
\n", "
\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "#Step 2.13 - Check out the results of the aggregation\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3 - Reading a file and counting words\n", "### Step 3.1 - Read the Apache Spark README.md file from Github. The ! allows you to embed file system commands\n", "
\n", "We remove README.md in case there was an updated version -- but also for another reason you will discover in Lab 2

\n", "Type:
\n", "\n", "    !rm README.md* -f
\n", "    !wget https://raw.githubusercontent.com/apache/spark/master/README.md
\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 3.1 - Pull data file into workbench\n", "!rm README.md* -f\n", "!wget https://raw.githubusercontent.com/apache/spark/master/README.md\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3.2 - Create an RDD by reading from the local filesystem and count the number of lines Here is the [textfile()](http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=textfile#pyspark.SparkContext.textFile) documentation.

\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
README.md has been loaded into local storage so there is no path needed. textFile() returns an RDD -- you do not have to parallelize the result.
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
Type:
\n", "    textfile_rdd = sc.textFile(\"README.md\")
\n", "    textfile_rdd.count()
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Optional Advanced 3\n", "

\n", "
\n", "
\n", "
By default, textFile() uses UTF-8 format. Read the file as UNICODE.
\n", "
\n", "
\n", "
\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 3.2 - Create RDD from data file\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3.3 - Filter out lines that contain \"Spark\". This will be achieved using the [filter](http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=filter#pyspark.RDD.filter) transformation. Python allows us to use the 'in' syntax to search strings.
\n", "We will also take a look at the first line in the newly filtered RDD.

\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
filter(), just like map() can take a lambda function as its input
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
Type:
\n", "    Spark_lines = textfile_rdd.filter(lambda line: \"Spark\" in line)
\n", "    Spark_lines.first()
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Advanced Optional\n", "

\n", "
\n", "
\n", "
There are 28 lines which contain the word \"Spark\". Find all lines which contain it when case-insensitive
\n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 3.3 - Filter for only lines with word Spark\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3.4 - Print the number of Spark lines in this filtered RDD out of the total number and print the result as a concatenated string.

\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
The print() statement prints to the console. (Note: be careful on a cluster because a print on a distributed machine will not be seen). You can cast integers to string by using the str() method.
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
Strings can be concatenated together with the + sign. You can mark a statement as spanning multiple lines by putting a \\ at the end of the line.
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 3\n", "

\n", "
\n", "
\n", "
Type:
\n", "    print \"The file README.md has \" + str(Spark_lines.count()) + \\
\n", "        \" of \" + str(textfile_rdd.count()) + \\
\n", "        \" lines with the word Spark in it.\"
\n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 3.4 - count the number of lines\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3.5 - Now count the number of times the word Spark appears in the original text, not just the number of lines that contain it.\n", "Looking back at previous exercises, you will need to:
\n", "    1 - Execute a flatMap transformation on the original RDD Spark_lines and split on white space.
\n", "    2 - Filter out all instances of the word Spark
\n", "    3 - Count all instances
\n", "    4 - Print the total count

\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 1\n", "

\n", "
\n", "
\n", "
str not in string is how to filter out
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Hint 2\n", "

\n", "
\n", "
\n", "
flatMapRDD = textfile_rdd.flatMap(lambda line: line.split())
\n", " flatMapRDDFilter = flatMapRDD.filter(lambda line: \"Spark\" not in line)
\n", " flatMapRDDFilterCount = flatMapRDDFilter.count()
\n", " print flatMapRDDFilterCount
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Optional Advanced\n", "

\n", "
\n", "
\n", "
Put the entire statement on one line and make the filter case-insensitive.
\n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 3.5 - Count the number of instances of tokens starting with \"Spark\"\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4 - Perform analysis on a data file\n", "This part is a little more open ended and there are a few ways to complete it. Scroll up to previous examples for some guidance. You will download a data file, transform the data, and then average the prices. The data file will be a sample of tech stock prices over six days.
\n", "\n", "Data Location: https://raw.githubusercontent.com/JosephKambourakisIBM/SparkPoT/master/StockPrices.csv
\n", "The data file is a csv
\n", "Here is a sample of the file:
\n", "      \"IBM\",\"159.720001\" ,\"159.399994\" ,\"158.880005\",\"159.539993\", \"159.550003\", \"160.350006\"" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "#Step 4.1 - Delete the file if it exists, download a new copy and load it into an RDD\n", "!rm StockPrices.csv -f\n", "!wget https://raw.githubusercontent.com/JosephKambourakisIBM/SparkPoT/master/StockPrices.csv\n", " \n", "stockPrices_RDD = sc.textFile(\"StockPrices.csv\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 4.2 - Transform the data to extract the stock ticker symbol and the prices.\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Step 4.3 - Compute the averages and print them for each symbol.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2 with Spark 1.6", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }