', 1)` for each word element in the RDD.\n",
"We can create the pair RDD using the `map()` transformation with a `lambda()` function to create a new RDD."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"[('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)]\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# ANSWER\n",
"wordPairs = wordsRDD.map(lambda s: (s, 1))\n",
"print wordPairs.collect()"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# TEST Pair RDDs (1f)\n",
"Test.assertEquals(wordPairs.collect(),\n",
" [('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)],\n",
" 'incorrect value for wordPairs')"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# PRIVATE_TEST Pair RDDs (1f)\n",
"Test.assertEquals(wordPairs.collect(),\n",
" [('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)],\n",
" 'incorrect value for wordPairs')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ** Part 2: Counting with pair RDDs **"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's count the number of times a particular word appears in the RDD. There are multiple ways to perform the counting, but some are much less efficient than others.\n",
" \n",
"A naive approach would be to `collect()` all of the elements and count them in the driver program. While this approach could work for small datasets, we want an approach that will work for any size dataset including terabyte- or petabyte-sized datasets. In addition, performing all of the work in the driver program is slower than performing it in parallel in the workers. For these reasons, we will use data parallel operations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** (2a) `groupByKey()` approach **\n",
" \n",
"An approach you might first consider (we'll see shortly that there are better ways) is based on using the [groupByKey()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.groupByKey) transformation. As the name implies, the `groupByKey()` transformation groups all the elements of the RDD with the same key into a single list in one of the partitions.\n",
" \n",
"There are two problems with using `groupByKey()`:\n",
" + The operation requires a lot of data movement to move all the values into the appropriate partitions.\n",
" + The lists can be very large. Consider a word count of English Wikipedia: the lists for common words (e.g., the, a, etc.) would be huge and could exhaust the available memory in a worker.\n",
" \n",
"Use `groupByKey()` to generate a pair RDD of type `('word', iterator)`."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"rat: [1, 1]\n",
"elephant: [1]\n",
"cat: [1, 1]\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# ANSWER\n",
"# Note that groupByKey requires no parameters\n",
"wordsGrouped = wordPairs.groupByKey()\n",
"for key, value in wordsGrouped.collect():\n",
" print '{0}: {1}'.format(key, list(value))"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# TEST groupByKey() approach (2a)\n",
"Test.assertEquals(sorted(wordsGrouped.mapValues(lambda x: list(x)).collect()),\n",
" [('cat', [1, 1]), ('elephant', [1]), ('rat', [1, 1])],\n",
" 'incorrect value for wordsGrouped')"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# PRIVATE_TEST groupByKey() approach (2a)\n",
"Test.assertEquals(sorted(wordsGrouped.mapValues(lambda x: list(x)).collect()),\n",
" [('cat', [1, 1]), ('elephant', [1]), ('rat', [1, 1])],\n",
" 'incorrect value for wordsGrouped')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** (2b) Use `groupByKey()` to obtain the counts **\n",
" \n",
"Using the `groupByKey()` transformation creates an RDD containing 3 elements, each of which is a pair of a word and a Python iterator.\n",
" \n",
"Now sum the iterator using a `map()` transformation. The result should be a pair RDD consisting of (word, count) pairs."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"[('rat', 2), ('elephant', 1), ('cat', 2)]\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# ANSWER\n",
"wordCountsGrouped = wordsGrouped.map(lambda (k, v): (k, sum(v)))\n",
"print wordCountsGrouped.collect()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# TEST Use groupByKey() to obtain the counts (2b)\n",
"Test.assertEquals(sorted(wordCountsGrouped.collect()),\n",
" [('cat', 2), ('elephant', 1), ('rat', 2)],\n",
" 'incorrect value for wordCountsGrouped')\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# PRIVATE_TEST Use groupByKey() to obtain the counts (2b)\n",
"Test.assertEquals(sorted(wordCountsGrouped.collect()),\n",
" [('cat', 2), ('elephant', 1), ('rat', 2)],\n",
" 'incorrect value for wordCountsGrouped')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** (2c) Counting using `reduceByKey` **\n",
" \n",
"A better approach is to start from the pair RDD and then use the [reduceByKey()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.reduceByKey) transformation to create a new pair RDD. The `reduceByKey()` transformation gathers together pairs that have the same key and applies the function provided to two values at a time, iteratively reducing all of the values to a single value. `reduceByKey()` operates by applying the function first within each partition on a per-key basis and then across the partitions, allowing it to scale efficiently to large datasets."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"[('rat', 2), ('elephant', 1), ('cat', 2)]\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# ANSWER\n",
"wordCounts = wordPairs.reduceByKey(lambda a, b: a + b)\n",
"print wordCounts.collect()"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# TEST Counting using reduceByKey (2c)\n",
"Test.assertEquals(sorted(wordCounts.collect()), [('cat', 2), ('elephant', 1), ('rat', 2)],\n",
" 'incorrect value for wordCounts')"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# PRIVATE_TEST Counting using reduceByKey (2c)\n",
"Test.assertEquals(sorted(wordCounts.collect()), [('cat', 2), ('elephant', 1), ('rat', 2)],\n",
" 'incorrect value for wordCounts')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** (2d) All together **\n",
" \n",
"The expert version of the code performs the `map()` to pair RDD, `reduceByKey()` transformation, and `collect` in one statement."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"[('rat', 2), ('elephant', 1), ('cat', 2)]\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# ANSWER\n",
"wordCountsCollected = (wordsRDD\n",
" .map(lambda s: (s, 1))\n",
" .reduceByKey(lambda a, b : a + b)\n",
" .collect())\n",
"print wordCountsCollected"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# TEST All together (2d)\n",
"Test.assertEquals(sorted(wordCountsCollected), [('cat', 2), ('elephant', 1), ('rat', 2)],\n",
" 'incorrect value for wordCountsCollected')\n"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# PRIVATE_TEST All together (2d)\n",
"Test.assertEquals(sorted(wordCountsCollected), [('cat', 2), ('elephant', 1), ('rat', 2)],\n",
" 'incorrect value for wordCountsCollected')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ** Part 3: Finding unique words and a mean value **"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** (3a) Unique words **\n",
" \n",
"Calculate the number of unique words in `wordsRDD`. You can use other RDDs that you have already created to make this easier."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"3\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"## ANSWER\n",
"uniqueWords = wordCounts.count()\n",
"print uniqueWords"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# TEST Unique words (3a)\n",
"Test.assertEquals(uniqueWords, 3, 'incorrect count of uniqueWords')"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# PRIVATE_TEST Unique words (3a)\n",
"Test.assertEquals(uniqueWords, 3, 'incorrect count of uniqueWords')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** (3b) Mean using `reduce` **\n",
" \n",
"Find the mean number of words per unique word in `wordCounts`.\n",
" \n",
"Use a `reduce()` action to sum the counts in `wordCounts` and then divide by the number of unique words. First `map()` the pair RDD `wordCounts`, which consists of (key, value) pairs, to an RDD of values."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"5\n",
"1.67\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# ANSWER\n",
"from operator import add\n",
"totalCount = (wordCounts\n",
" .map(lambda (k, v): v)\n",
" .reduce(add))\n",
"average = totalCount / float(wordCounts.count())\n",
"print totalCount\n",
"print round(average, 2)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# TEST Mean using reduce (3b)\n",
"Test.assertEquals(round(average, 2), 1.67, 'incorrect value of average')"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# PRIVATE_TEST Mean using reduce (3b)\n",
"Test.assertEquals(round(average, 2), 1.67, 'incorrect value of average')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ** Part 4: Apply word count to a file **"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section we will finish developing our word count application. We'll have to build the `wordCount` function, deal with real world problems like capitalization and punctuation, load in our data source, and compute the word count on the new data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** (4a) `wordCount` function **\n",
" \n",
"First, define a function for word counting. You should reuse the techniques that have been covered in earlier parts of this lab. This function should take in an RDD that is a list of words like `wordsRDD` and return a pair RDD that has all of the words and their associated counts."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"[('rat', 2), ('elephant', 1), ('cat', 2)]\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# ANSWER\n",
"def wordCount(wordListRDD):\n",
" \"\"\"Creates a pair RDD with word counts from an RDD of words.\n",
"\n",
" Args:\n",
" wordListRDD (RDD of str): An RDD consisting of words.\n",
"\n",
" Returns:\n",
" RDD of (str, int): An RDD consisting of (word, count) tuples.\n",
" \"\"\"\n",
" return (wordListRDD\n",
" .map(lambda s: (s, 1))\n",
" .reduceByKey(add))\n",
"print wordCount(wordsRDD).collect()"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# TEST wordCount function (4a)\n",
"Test.assertEquals(sorted(wordCount(wordsRDD).collect()),\n",
" [('cat', 2), ('elephant', 1), ('rat', 2)],\n",
" 'incorrect definition for wordCount function')"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# PRIVATE_TEST wordCount function (4a)\n",
"privateWordsRDD = sc.parallelize(['cat', 'cat', 'cat', 'rat', 'elephant', 'elephant'])\n",
"Test.assertEquals(sorted(wordCount(privateWordsRDD).collect()),\n",
" [('cat', 3), ('elephant', 2), ('rat', 1)],\n",
" 'incorrect definition for wordCount function')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** (4b) Capitalization and punctuation **\n",
" \n",
"Real world files are more complicated than the data we have been using in this lab. Some of the issues we have to address are:\n",
" + Words should be counted independent of their capitialization (e.g., Spark and spark should be counted as the same word).\n",
" + All punctuation should be removed.\n",
" + Any leading or trailing spaces on a line should be removed.\n",
" \n",
"Define the function `removePunctuation` that converts all text to lower case, removes any punctuation, and removes leading and trailing spaces. Use the Python [re](https://docs.python.org/2/library/re.html) module to remove any text that is not a letter, number, or space. Reading `help(re.sub)` might be useful.\n",
"If you are unfamiliar with regular expressions, you may want to review [this tutorial](https://developers.google.com/edu/python/regular-expressions) from Google. Also, [this website](https://regex101.com/#python) is a great resource for debugging your regular expression."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"hi you\n",
"no underscore\n",
"remove punctuation then spaces\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# ANSWER\n",
"import re\n",
"def removePunctuation(text):\n",
" \"\"\"Removes punctuation, changes to lower case, and strips leading and trailing spaces.\n",
"\n",
" Note:\n",
" Only spaces, letters, and numbers should be retained. Other characters should should be\n",
" eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after\n",
" punctuation is removed.\n",
"\n",
" Args:\n",
" text (str): A string.\n",
"\n",
" Returns:\n",
" str: The cleaned up string.\n",
" \"\"\"\n",
" return re.sub(r'[^A-Za-z0-9 ]', '', text).lower().strip()\n",
"print removePunctuation('Hi, you!')\n",
"print removePunctuation(' No under_score!')\n",
"print removePunctuation(' * Remove punctuation then spaces * ')"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# TEST Capitalization and punctuation (4b)\n",
"Test.assertEquals(removePunctuation(\" The Elephant's 4 cats. \"),\n",
" 'the elephants 4 cats',\n",
" 'incorrect definition for removePunctuation function')"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# PRIVATE_TEST Capitalization and punctuation (4b)\n",
"Test.assertEquals(removePunctuation(\" Hi, It's possible I'm cheating. \"),\n",
" 'hi its possible im cheating',\n",
" 'incorrect definition for removePunctuation function')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** (4c) Load a text file **\n",
" \n",
"For the next part of this lab, we will use the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). To convert a text file into an RDD, we use the `SparkContext.textFile()` method. We also apply the recently defined `removePunctuation()` function using a `map()` transformation to strip out the punctuation and change all text to lower case. Since the file is large we use `take(15)`, so that we only print 15 lines."
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"0: 1609\n",
"1: \n",
"2: the sonnets\n",
"3: \n",
"4: by william shakespeare\n",
"5: \n",
"6: \n",
"7: \n",
"8: 1\n",
"9: from fairest creatures we desire increase\n",
"10: that thereby beautys rose might never die\n",
"11: but as the riper should by time decease\n",
"12: his tender heir might bear his memory\n",
"13: but thou contracted to thine own bright eyes\n",
"14: feedst thy lights flame with selfsubstantial fuel\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Just run this code\n",
"import os.path\n",
"baseDir = os.path.join('databricks-datasets')\n",
"inputPath = os.path.join('cs100', 'lab1', 'data-001', 'shakespeare.txt')\n",
"fileName = os.path.join(baseDir, inputPath)\n",
"\n",
"shakespeareRDD = (sc\n",
" .textFile(fileName, 8)\n",
" .map(removePunctuation))\n",
"print '\\n'.join(shakespeareRDD\n",
" .zipWithIndex() # to (line, lineNum)\n",
" .map(lambda (l, num): '{0}: {1}'.format(num, l)) # to 'lineNum: line'\n",
" .take(15))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** (4d) Words from lines **\n",
" \n",
"Before we can use the `wordcount()` function, we have to address two issues with the format of the RDD:\n",
" + The first issue is that that we need to split each line by its spaces. ** Performed in (4d). **\n",
" + The second issue is we need to filter out empty lines. ** Performed in (4e). **\n",
" \n",
"Apply a transformation that will split each element of the RDD by its spaces. For each element of the RDD, you should apply Python's string [split()](https://docs.python.org/2/library/string.html#string.split) function. You might think that a `map()` transformation is the way to do this, but think about what the result of the `split()` function will be.\n",
" \n",
"> Note:\n",
"> * Do not use the default implemenation of `split()`, but pass in a separator value. For example, to split `line` by commas you would use `line.split(',')`."
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"[u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds']\n",
"927631\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# ANSWER\n",
"shakespeareWordsRDD = shakespeareRDD.flatMap(lambda x: x.split(' '))\n",
"shakespeareWordCount = shakespeareWordsRDD.count()\n",
"print shakespeareWordsRDD.top(5)\n",
"print shakespeareWordCount"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# TEST Words from lines (4d)\n",
"# This test allows for leading spaces to be removed either before or after\n",
"# punctuation is removed.\n",
"Test.assertTrue(shakespeareWordCount == 927631 or shakespeareWordCount == 928908,\n",
" 'incorrect value for shakespeareWordCount')\n",
"Test.assertEquals(shakespeareWordsRDD.top(5),\n",
" [u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds'],\n",
" 'incorrect value for shakespeareWordsRDD')"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# PRIVATE_TEST Words from lines (4d)\n",
"# This test allows for leading spaces to be removed either before or after\n",
"# punctuation is removed.\n",
"Test.assertTrue(shakespeareWordCount == 927631 or shakespeareWordCount == 928908,\n",
" 'incorrect value for shakespeareWordCount')\n",
"Test.assertEquals(shakespeareWordsRDD.map(lambda x: len(x)).sum(), 3697209,\n",
" 'incorrect value for shakespeareWordsRDD')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** (4e) Remove empty elements **\n",
" \n",
"The next step is to filter out the empty elements. Remove all entries where the word is `''`."
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"882996\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# ANSWER\n",
"shakeWordsRDD = shakespeareWordsRDD.filter(lambda x: x != '')\n",
"shakeWordCount = shakeWordsRDD.count()\n",
"print shakeWordCount"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# TEST Remove empty elements (4e)\n",
"Test.assertEquals(shakeWordCount, 882996, 'incorrect value for shakeWordCount')"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# PRIVATE_TEST Remove empty elements (4e)\n",
"Test.assertEquals(shakeWordsRDD\n",
" .filter(lambda x: x == '')\n",
" .count(), 0, 'incorrect value for shakeWordsRDD')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** (4f) Count the words **\n",
" \n",
"We now have an RDD that is only words. Next, let's apply the `wordCount()` function to produce a list of word counts. We can view the top 15 words by using the `takeOrdered()` action; however, since the elements of the RDD are pairs, we need a custom sort function that sorts using the value part of the pair.\n",
" \n",
"You'll notice that many of the words are common English words. These are called stopwords. In a later lab, we will see how to eliminate them from the results.\n",
"Use the `wordCount()` function and `takeOrdered()` to obtain the fifteen most common words and their counts."
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"the: 27361\n",
"and: 26028\n",
"i: 20681\n",
"to: 19150\n",
"of: 17463\n",
"a: 14593\n",
"you: 13615\n",
"my: 12481\n",
"in: 10956\n",
"that: 10890\n",
"is: 9134\n",
"not: 8497\n",
"with: 7771\n",
"me: 7769\n",
"it: 7678\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# ANSWER\n",
"top15WordsAndCounts = wordCount(shakeWordsRDD).takeOrdered(15, key=lambda x: -x[1])\n",
"print '\\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts))"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# TEST Count the words (4f)\n",
"Test.assertEquals(top15WordsAndCounts,\n",
" [(u'the', 27361), (u'and', 26028), (u'i', 20681), (u'to', 19150), (u'of', 17463),\n",
" (u'a', 14593), (u'you', 13615), (u'my', 12481), (u'in', 10956), (u'that', 10890),\n",
" (u'is', 9134), (u'not', 8497), (u'with', 7771), (u'me', 7769), (u'it', 7678)],\n",
" 'incorrect value for top15WordsAndCounts')"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"1 test passed.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# PRIVATE_TEST Count the words (4f)\n",
"Test.assertEquals(top15WordsAndCounts,\n",
" [(u'the', 27361), (u'and', 26028), (u'i', 20681), (u'to', 19150), (u'of', 17463),\n",
" (u'a', 14593), (u'you', 13615), (u'my', 12481), (u'in', 10956), (u'that', 10890),\n",
" (u'is', 9134), (u'not', 8497), (u'with', 7771), (u'me', 7769), (u'it', 7678)],\n",
" 'incorrect value for top15WordsAndCounts')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"name": "z2.word_count_solution",
"notebookId": 3572859401357127
},
"nbformat": 4,
"nbformat_minor": 1
}