{"paragraphs":[{"text":"%md\n## Sentiment Analysis with Spark\nThis module will teach you how to build sentiment analysis algorithms with Apache Spark. We will be doing data transformation using Scala and Apache Spark 2, and we will be classifying tweets as happy or sad using a Gradient Boosting algorithm. Although we're focusing on sentiment analysis, Gradient Boosting is a versatile technique that can be applied to many classification problems. You should be able to reuse this code to classify text in many other ways, such as spam or not spam, news or not news, provided you can create enough labeled examples with which to train the model.","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Sentiment Analysis with Spark

\n

This module will teach you how to build sentiment analysis algorithms with Apache Spark. We will be doing data transformation using Scala and Apache Spark 2, and we will be classifying tweets as happy or sad using a Gradient Boosting algorithm. Although we’re focusing on sentiment analysis, Gradient Boosting is a versatile technique that can be applied to many classification problems. You should be able to reuse this code to classify text in many other ways, such as spam or not spam, news or not news, provided you can create enough labeled examples with which to train the model.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684252_1002031524","id":"20170314-235415_746413739","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:2561"},{"text":"%md\n### Configuration\n\nBefore starting this model you should make sure HDFS and Spark2 are started. ","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Configuration

\n

Before starting this model you should make sure HDFS and Spark2 are started.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684253_1001646776","id":"20170316-002654_1627110001","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2562"},{"text":"%md\n### Download Tweets\n\nGradient Boosting is a supervised machine learning algorithm, which means we will have to provide it with many examples of statements that are labeled as happy or sad. In an ideal world we would prefer to have a large dataset where a group of experts hand-labeled each statement as happy or sad. Since we don't have that dataset we can improvise by streaming tweets that contain the words \"happy\" or \"sad\", and use the presence of these words as our labels. This isn't perfect: a few sentences like \"I'm not happy\" will end up being incorrectly labeled as happy. If you wanted more accurate labeled data, you could use a part of speech tagger like Stanford NLP or SyntaxNet, which would let you make sure the word \"happy\" is always describing \"I\" or \"I'm\" and the word \"not\" isn't applied to \"happy\". However, this basic labeling will be good enough to train a working model. \n\nIf you've followed the first sentiment analysis tutorial you've learned how to use Nifi to stream live tweets to your local computer and HDFS storage. If you've followed this tutorial you can stream your own tweets by configuring the GetTwitter processor to filter on \"happy\" and \"sad\". If you're running on the sandbox and want to process a large amount of tweets, you may also want to raise the amount of memory available to YARN and Spark2. You can do that by modifying the setting “Memory allocated for all YARN containers on a node” to > 4G for YARN and spark_daemon_memory to > 4G for Spark2.\n\nOtherwise, you can run the next cell to download pre-packaged tweets.","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Download Tweets

\n

Gradient Boosting is a supervised machine learning algorithm, which means we will have to provide it with many examples of statements that are labeled as happy or sad. In an ideal world we would prefer to have a large dataset where a group of experts hand-labeled each statement as happy or sad. Since we don’t have that dataset we can improvise by streaming tweets that contain the words “happy” or “sad”, and use the presence of these words as our labels. This isn’t perfect: a few sentences like “I’m not happy” will end up being incorrectly labeled as happy. If you wanted more accurate labeled data, you could use a part of speech tagger like Stanford NLP or SyntaxNet, which would let you make sure the word “happy” is always describing “I” or “I’m” and the word “not” isn’t applied to “happy”. However, this basic labeling will be good enough to train a working model.

\n

If you’ve followed the first sentiment analysis tutorial you’ve learned how to use Nifi to stream live tweets to your local computer and HDFS storage. If you’ve followed this tutorial you can stream your own tweets by configuring the GetTwitter processor to filter on “happy” and “sad”. If you’re running on the sandbox and want to process a large amount of tweets, you may also want to raise the amount of memory available to YARN and Spark2. You can do that by modifying the setting “Memory allocated for all YARN containers on a node” to > 4G for YARN and spark_daemon_memory to > 4G for Spark2.

\n

Otherwise, you can run the next cell to download pre-packaged tweets.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684253_1001646776","id":"20170315-185911_269819862","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2563"},{"text":"%sh\n\nmkdir /tmp/tweets\nrm -rf /tmp/tweets/*\ncd /tmp/tweets\nwget -O /tmp/tweets/tweets.zip https://raw.githubusercontent.com/hortonworks/data-tutorials/master/tutorials/hdp/hdp-2.6/sentiment-analysis-with-apache-spark/assets/tweets.zip\nunzip /tmp/tweets/tweets.zip\nrm /tmp/tweets/tweets.zip\n\n# Remove existing (if any) copy of data from HDFS. You could do this with Ambari file view.\nhdfs dfs -mkdir /tmp/tweets_staging/\nhdfs dfs -rmr -f /tmp/tweets_staging/* -skipTrash\n\n# Move downloaded JSON file from local storage to HDFS\nhdfs dfs -put /tmp/tweets/* /tmp/tweets_staging\n\n","dateUpdated":"2017-04-27T07:21:24+0000","config":{"colWidth":12,"editorMode":"ace/mode/sh","results":{},"enabled":true,"editorSetting":{"language":"sh","editOnDblClick":false}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684254_1002801022","id":"20170315-205558_857231102","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2564"},{"text":"%md\n### Load into Spark\n\nLets load the tweets into Spark SQL and take a look at them.","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Load into Spark

\n

Lets load the tweets into Spark SQL and take a look at them.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684254_1002801022","id":"20170315-212655_1153565828","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2565"},{"text":"import org.apache.spark._\nimport org.apache.spark.rdd._\nimport org.apache.spark.SparkContext._\nimport org.apache.spark.mllib.feature.HashingTF\nimport org.apache.spark.{SparkConf, SparkContext}\nimport org.apache.spark.mllib.regression.LabeledPoint\nimport org.apache.spark.mllib.tree.GradientBoostedTrees\nimport org.apache.spark.mllib.tree.configuration.BoostingStrategy\nimport org.apache.spark._\nimport org.apache.spark.rdd._\nimport org.apache.spark.SparkContext._\nimport scala.util.{Success, Try}\n\n val sqlContext = new org.apache.spark.sql.SQLContext(sc)\n\n var tweetDF = sqlContext.read.json(\"hdfs:///tmp/tweets_staging/*\")\n // tweetDF.printSchema()\n tweetDF.show()\n","dateUpdated":"2017-04-27T07:21:24+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","results":{},"enabled":true,"editorSetting":{"language":"scala"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684255_1002416273","id":"20170314-222922_910635069","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2566"},{"text":"%md\n### Clean Records\n\nWe want to remove any tweet that doesn't contain \"happy\" or \"sad\". We've also chosen to select an equal number of happy and sad tweets to prevent bias in the model. Since we've loaded our data into a Spark DataFrame, we can use SQL-like statements to transform and select our data.","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Clean Records

\n

We want to remove any tweet that doesn’t contain “happy” or “sad”. We’ve also chosen to select an equal number of happy and sad tweets to prevent bias in the model. Since we’ve loaded our data into a Spark DataFrame, we can use SQL-like statements to transform and select our data.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684255_1002416273","id":"20170315-214825_1815274191","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2567"},{"text":"\nvar messages = tweetDF.select(\"msg\")\nprintln(\"Total messages: \" + messages.count())\n\nvar happyMessages = messages.filter(messages(\"msg\").contains(\"happy\"))\nval countHappy = happyMessages.count()\nprintln(\"Number of happy messages: \" + countHappy)\n\nvar unhappyMessages = messages.filter(messages(\"msg\").contains(\" sad\"))\nval countUnhappy = unhappyMessages.count()\nprintln(\"Unhappy Messages: \" + countUnhappy)\n\nval smallest = Math.min(countHappy, countUnhappy).toInt\n\n//Create a dataset with equal parts happy and unhappy messages\nvar tweets = happyMessages.limit(smallest).unionAll(unhappyMessages.limit(smallest))\n \n ","dateUpdated":"2017-04-27T07:21:24+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","results":{},"enabled":true,"editorSetting":{"language":"scala"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684256_988180564","id":"20170314-222925_680225414","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2568"},{"text":"%md\n### Label Data\n\nNow label each happy tweet as 1 and unhappy tweets as 0. In order to prevent our model from cheating, we're going to remove the words happy and sad from the tweets. This will force it to infer whether the user is happy or sad by the presence of other words. \n\nFinally, we also split each tweet into a collection of words. For convenience we convert the Spark Dataframe to an RDD which lets you easily transform data using the map function.","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Label Data

\n

Now label each happy tweet as 1 and unhappy tweets as 0. In order to prevent our model from cheating, we’re going to remove the words happy and sad from the tweets. This will force it to infer whether the user is happy or sad by the presence of other words.

\n

Finally, we also split each tweet into a collection of words. For convenience we convert the Spark Dataframe to an RDD which lets you easily transform data using the map function.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684256_988180564","id":"20170315-215800_466665578","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2569"},{"text":"val messagesRDD = tweets.rdd\n//We use scala's Try to filter out tweets that couldn't be parsed\nval goodBadRecords = messagesRDD.map(\n row =>{\n Try{\n val msg = row(0).toString.toLowerCase()\n var isHappy:Int = 0\n if(msg.contains(\" sad\")){\n isHappy = 0\n }else if(msg.contains(\"happy\")){\n isHappy = 1\n }\n var msgSanitized = msg.replaceAll(\"happy\", \"\")\n msgSanitized = msgSanitized.replaceAll(\"sad\",\"\")\n //Return a tuple\n (isHappy, msgSanitized.split(\" \").toSeq)\n }\n }\n)\n\n//We use this syntax to filter out exceptions\nval exceptions = goodBadRecords.filter(_.isFailure)\nprintln(\"total records with exceptions: \" + exceptions.count())\nexceptions.take(10).foreach(x => println(x.failed))\nvar labeledTweets = goodBadRecords.filter((_.isSuccess)).map(_.get)\nprintln(\"total records with successes: \" + labeledTweets.count())\n","dateUpdated":"2017-04-27T07:21:24+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","results":{},"enabled":true,"editorSetting":{"language":"scala"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684257_987795815","id":"20170314-234521_264925171","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2570"},{"text":"%md\n\n\n\nWe now have a collection of tuples of the form (Int, Seq[String]), where a 1 for the first term indicates happy and 0 indicates sad. The second term is a sequence of words, including emojis.\n\nLet's take a look.\n","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

We now have a collection of tuples of the form (Int, Seq[String]), where a 1 for the first term indicates happy and 0 indicates sad. The second term is a sequence of words, including emojis.

\n

Let’s take a look.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684257_987795815","id":"20170315-220802_1466536053","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2571"},{"text":"labeledTweets.take(10).foreach(x => println(x))\n","dateUpdated":"2017-04-27T07:21:24+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","results":{},"enabled":true,"editorSetting":{"language":"scala","editOnDblClick":false}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684257_987795815","id":"20170315-221309_2084520780","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2572"},{"text":"%md\n### Transform Data\n\nGradient Boosting expects as input a vector (feature array) of fixed length, so we need a way to convert our tweets into some numeric vector that represents that tweet. A standard way to do this is to use the hashing trick, in which we hash each word and index it into a fixed-length array. What we get back is an array that represents the count of each word in the tweet. This approach is called the bag of words model, which means we are representing each sentence or document as a collection of discrete words and ignore grammar or the order in which words appear in a sentence. An alternative approach to bag of words would be to use an algorithm like Doc2Vec or Latent Semantic Indexing, which would use machine learning to build a vector representations of tweets.\n\nIn Spark we're using HashingTF for feature hashing. Note that we're using an array of size 2000. Since this is smaller than the size of the vocabulary we'll encounter on Twitter, it means two words with different meaning can be hashed to the same location in the array. Although it would seem this would be an issue, in practice this preserves enough information that the model still works. This is actually one of the strengths of feature hashing, that it allows you to represent a large or growing vocabulary in a fixed amount of space.","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Transform Data

\n

Gradient Boosting expects as input a vector (feature array) of fixed length, so we need a way to convert our tweets into some numeric vector that represents that tweet. A standard way to do this is to use the hashing trick, in which we hash each word and index it into a fixed-length array. What we get back is an array that represents the count of each word in the tweet. This approach is called the bag of words model, which means we are representing each sentence or document as a collection of discrete words and ignore grammar or the order in which words appear in a sentence. An alternative approach to bag of words would be to use an algorithm like Doc2Vec or Latent Semantic Indexing, which would use machine learning to build a vector representations of tweets.

\n

In Spark we’re using HashingTF for feature hashing. Note that we’re using an array of size 2000. Since this is smaller than the size of the vocabulary we’ll encounter on Twitter, it means two words with different meaning can be hashed to the same location in the array. Although it would seem this would be an issue, in practice this preserves enough information that the model still works. This is actually one of the strengths of feature hashing, that it allows you to represent a large or growing vocabulary in a fixed amount of space.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684258_988950062","id":"20170315-222108_1329422295","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2573"},{"text":" val hashingTF = new HashingTF(2000)\n\n //Map the input strings to a tuple of labeled point + input text\n val input_labeled = labeledTweets.map(\n t => (t._1, hashingTF.transform(t._2)))\n .map(x => new LabeledPoint((x._1).toDouble, x._2))\n\n ","dateUpdated":"2017-04-27T07:21:24+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","results":{},"enabled":true,"editorSetting":{"language":"scala"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684258_988950062","id":"20170315-221527_265576053","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2574"},{"text":"%md\nLet's check out the hashed vectors.","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Let’s check out the hashed vectors.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684259_988565313","id":"20170315-225630_2078720791","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2575"},{"text":"input_labeled.take(10).foreach(println)","dateUpdated":"2017-04-27T07:21:24+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","results":{},"enabled":true,"editorSetting":{"language":"scala"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684259_988565313","id":"20170315-225826_221402586","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2576"},{"text":"%md\nAs you can see, we've converted each tweet into a vector of integers. This will work great for a machine learning model, but we want to preserve some tweets in a form we can read.","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":true,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684259_988565313","id":"20170315-225921_1191286770","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2577"},{"text":"//We're keeping the raw text for inspection later\nvar sample = labeledTweets.take(1000).map(\n t => (t._1, hashingTF.transform(t._2), t._2))\n .map(x => (new LabeledPoint((x._1).toDouble, x._2), x._3))","dateUpdated":"2017-04-27T07:21:24+0000","config":{"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684261_986256820","id":"20170315-225625_1635512137","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2578"},{"text":"%md\n### Split into Training and Validation Sets\n\nWhen training any machine learning model you want to separate your data into a training set and a validation set. The training set is what you actually use to build the model, whereas the validation set is used to evaluate the model's performance afterwards on data that it has never encountered before. This is extremely important, because a model can have very high accuracy when evaluating training data but fail spectacularly when it encounters data it hasn't seen before.\n\nThis situation is called ***overfitting***. A good predictive model will build a generalized representation of your data in a way that reflects real things going on in your problem domain, and this generalization gives it predictive power. A model that overfits will instead try to predict the exact answer for each piece of your input data, and in doing so it will fail to generalize. The way we know a model is overfitting is when it has high accuracy on the training dataset but poor or no accuracy when tested against the validation set. This is why it's important to always test your model against a validation set. \n\n#### Fixing overfitting: \n\nA little overfitting is usually expected and can often be ignored. If you see that your validation accuracy is very low compared to your training accuracy, you can fix this overfitting by either increasing the size of your training data or by decreasing the number of parameters in your model. By decreasing the number of parameters you decrease the model's ability to memorize large numbers of patterns. This forces it to build a model of your data in general, which makes it represent your problem domain instead of just memorizing your training data. ","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Split into Training and Validation Sets

\n

When training any machine learning model you want to separate your data into a training set and a validation set. The training set is what you actually use to build the model, whereas the validation set is used to evaluate the model’s performance afterwards on data that it has never encountered before. This is extremely important, because a model can have very high accuracy when evaluating training data but fail spectacularly when it encounters data it hasn’t seen before.

\n

This situation is called overfitting. A good predictive model will build a generalized representation of your data in a way that reflects real things going on in your problem domain, and this generalization gives it predictive power. A model that overfits will instead try to predict the exact answer for each piece of your input data, and in doing so it will fail to generalize. The way we know a model is overfitting is when it has high accuracy on the training dataset but poor or no accuracy when tested against the validation set. This is why it’s important to always test your model against a validation set.

\n

Fixing overfitting:

\n

A little overfitting is usually expected and can often be ignored. If you see that your validation accuracy is very low compared to your training accuracy, you can fix this overfitting by either increasing the size of your training data or by decreasing the number of parameters in your model. By decreasing the number of parameters you decrease the model’s ability to memorize large numbers of patterns. This forces it to build a model of your data in general, which makes it represent your problem domain instead of just memorizing your training data.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684262_987411066","id":"20170315-230331_1819827530","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2579"},{"text":"\n // Split the data into training and validation sets (30% held out for validation testing)\n val splits = input_labeled.randomSplit(Array(0.7, 0.3))\n val (trainingData, validationData) = (splits(0), splits(1))","dateUpdated":"2017-04-27T07:21:24+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","results":{},"enabled":true,"editorSetting":{"language":"scala"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684262_987411066","id":"20170315-220757_1010718454","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2580"},{"text":"%md\n### Build the Model\n\nWe're using a Gradient Boosting model. The reason we chose Gradient Boosting for classification over some other model is because it's easy to use (doesn't require tons of parameter tuning), and it tends to have a high classification accuracy. For this reason it is frequently used in machine learning competitions. \n\nThe tuning parameters we're using here are:\n-number of iterations (passes over the data)\n-Max Depth of each decision tree\n\nIn practice when building machine learning models you usually have to test different settings and combinations of tuning parameters until you find one that works best. For this reason it's usually best to first train the model on a subset of data or with a small number of iterations. This lets you quickly experiment with different tuning parameter combinations.\n\nThis step may take a few minutes on a sandbox VM. If you're running on a sandbox and it's taking more than five minutes you may want to stop the process and decrease the number of iterations. ","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Build the Model

\n

We’re using a Gradient Boosting model. The reason we chose Gradient Boosting for classification over some other model is because it’s easy to use (doesn’t require tons of parameter tuning), and it tends to have a high classification accuracy. For this reason it is frequently used in machine learning competitions.

\n

The tuning parameters we’re using here are:
-number of iterations (passes over the data)
-Max Depth of each decision tree

\n

In practice when building machine learning models you usually have to test different settings and combinations of tuning parameters until you find one that works best. For this reason it’s usually best to first train the model on a subset of data or with a small number of iterations. This lets you quickly experiment with different tuning parameter combinations.

\n

This step may take a few minutes on a sandbox VM. If you’re running on a sandbox and it’s taking more than five minutes you may want to stop the process and decrease the number of iterations.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684263_987026317","id":"20170315-233003_2105288198","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2581"},{"text":"\n val boostingStrategy = BoostingStrategy.defaultParams(\"Classification\")\n boostingStrategy.setNumIterations(20) //number of passes over our training data\n boostingStrategy.treeStrategy.setNumClasses(2) //We have two output classes: happy and sad\n boostingStrategy.treeStrategy.setMaxDepth(5) \n //Depth of each tree. Higher numbers mean more parameters, which can cause overfitting.\n //Lower numbers create a simpler model, which can be more accurate. \n //In practice you have to tweak this number to find the best value.\n\n val model = GradientBoostedTrees.train(trainingData, boostingStrategy)","dateUpdated":"2017-04-27T07:21:24+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","results":{},"enabled":true,"editorSetting":{"language":"scala"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684263_987026317","id":"20170315-232951_1402100499","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2582"},{"text":"%md\n### Evaluate Model\n\nLet's evaluate the model to see how it performed against our training and test set.","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Evaluate Model

\n

Let’s evaluate the model to see how it performed against our training and test set.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684263_987026317","id":"20170315-234102_792111976","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2583"},{"text":"// Evaluate model on test instances and compute test error\nvar labelAndPredsTrain = trainingData.map { point =>\n val prediction = model.predict(point.features)\n Tuple2(point.label, prediction)\n}\n\nvar labelAndPredsValid = validationData.map { point =>\n val prediction = model.predict(point.features)\n Tuple2(point.label, prediction)\n}\n\n//Since Spark has done the heavy lifting already, lets pull the results back to the driver machine.\n//Calling collect() will bring the results to a single machine (the driver) and will convert it to a Scala array.\n\n//Start with the Training Set\nval results = labelAndPredsTrain.collect()\n\nvar happyTotal = 0\nvar unhappyTotal = 0\nvar happyCorrect = 0\nvar unhappyCorrect = 0\nresults.foreach(\n r => {\n if (r._1 == 1) {\n happyTotal += 1\n } else if (r._1 == 0) {\n unhappyTotal += 1\n }\n if (r._1 == 1 && r._2 ==1) {\n happyCorrect += 1\n } else if (r._1 == 0 && r._2 == 0) {\n unhappyCorrect += 1\n }\n }\n)\nprintln(\"unhappy messages in Training Set: \" + unhappyTotal + \" happy messages: \" + happyTotal)\nprintln(\"happy % correct: \" + happyCorrect.toDouble/happyTotal)\nprintln(\"unhappy % correct: \" + unhappyCorrect.toDouble/unhappyTotal)\n\nval testErr = labelAndPredsTrain.filter(r => r._1 != r._2).count.toDouble / trainingData.count()\nprintln(\"Test Error Training Set: \" + testErr)\n\n\n\n//Compute error for validation Set\nval results = labelAndPredsValid.collect()\n\nvar happyTotal = 0\nvar unhappyTotal = 0\nvar happyCorrect = 0\nvar unhappyCorrect = 0\nresults.foreach(\n r => {\n if (r._1 == 1) {\n happyTotal += 1\n } else if (r._1 == 0) {\n unhappyTotal += 1\n }\n if (r._1 == 1 && r._2 ==1) {\n happyCorrect += 1\n } else if (r._1 == 0 && r._2 == 0) {\n unhappyCorrect += 1\n }\n }\n)\nprintln(\"unhappy messages in Validation Set: \" + unhappyTotal + \" happy messages: \" + happyTotal)\nprintln(\"happy % correct: \" + happyCorrect.toDouble/happyTotal)\nprintln(\"unhappy % correct: \" + unhappyCorrect.toDouble/unhappyTotal)\n\nval testErr = labelAndPredsValid.filter(r => r._1 != r._2).count.toDouble / validationData.count()\nprintln(\"Test Error Validation Set: \" + testErr)\n\n","dateUpdated":"2017-04-27T07:21:24+0000","config":{"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684264_985102573","id":"20170315-233951_1248887406","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2584"},{"text":"%md\n\nThe results show that the model is very good at detecting unhappy messages (90% accuracy), and significantly less adept at identifying happy messages (65% accuracy). To improve this we could provide the model more examples of happy messages to learn from. \n\nAlso note that our training accuracy is slightly higher than our validation accuracy. This is an example of slightly overfitting the training data. Since the training accuracy is only slightly higher than the validation accuracy, this is normal and not something we should concerned about. However, if the validation accuracy was significantly worse than the training accuracy it would mean the model had grossly overfit its training data. In that situation, you would want to either increase the amount of data available for training or decrease the number of parameters (the complexity) of the model. \n\nNow let's inspect individual tweets and see how the model interpreted them. This can often provide some insight into what the model is doing right and wrong.","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

The results show that the model is very good at detecting unhappy messages (90% accuracy), and significantly less adept at identifying happy messages (65% accuracy). To improve this we could provide the model more examples of happy messages to learn from.

\n

Also note that our training accuracy is slightly higher than our validation accuracy. This is an example of slightly overfitting the training data. Since the training accuracy is only slightly higher than the validation accuracy, this is normal and not something we should concerned about. However, if the validation accuracy was significantly worse than the training accuracy it would mean the model had grossly overfit its training data. In that situation, you would want to either increase the amount of data available for training or decrease the number of parameters (the complexity) of the model.

\n

Now let’s inspect individual tweets and see how the model interpreted them. This can often provide some insight into what the model is doing right and wrong.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684264_985102573","id":"20170320-190228_1141669795","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2585"},{"text":"//Print some examples and how they scored\nval predictions = sample.map { point =>\n val prediction = model.predict(point._1.features)\n (point._1.label, prediction, point._2)\n}\n\n//The first entry is the true label. 1 is happy, 0 is unhappy. \n//The second entry is the prediction.\npredictions.take(100).foreach(x => println(\"label: \" + x._1 + \" prediction: \" + x._2 + \" text: \" + x._3.mkString(\" \")))","dateUpdated":"2017-04-27T07:21:24+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","results":{},"enabled":true,"editorSetting":{"language":"scala"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684265_984717824","id":"20170315-235355_1294226754","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2586"},{"text":"%md\n\nOnce you've trained your first model, you should go back and tweak the model parameters to see if you can increase model accuracy. In this case, try tweaking the depth of each tree and the number of iterations over the training data. You could also let the model see a greater percentage of happy tweets than unhappy tweets to see if that improves prediction accuracy for happy tweets.\n","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Once you’ve trained your first model, you should go back and tweak the model parameters to see if you can increase model accuracy. In this case, try tweaking the depth of each tree and the number of iterations over the training data. You could also let the model see a greater percentage of happy tweets than unhappy tweets to see if that improves prediction accuracy for happy tweets.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684265_984717824","id":"20170316-002506_341852523","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2587"},{"text":"%md\n### Exporting the Model\n\nOnce your model is as accurate as you can make it, you can export it for production use. Models trained with Spark can be easily loaded back into a Spark Streaming workflow for use in production.","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Exporting the Model

\n

Once your model is as accurate as you can make it, you can export it for production use. Models trained with Spark can be easily loaded back into a Spark Streaming workflow for use in production.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684266_985872071","id":"20170315-235739_1251499727","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2588"},{"text":"model.save(sc, \"hdfs:///tmp/tweets/RandomForestModel\")","dateUpdated":"2017-04-27T07:21:24+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","results":{},"enabled":true,"editorSetting":{"language":"scala"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684266_985872071","id":"20170315-235824_1541028619","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2589"},{"text":"%md\nYou've now seen how to build a sentiment analysis model. The techniques you've seen here can be applied to other text classification models besides sentiment analysis. Try analyzing other keywords besides happy and sad and see what results you get. ","dateUpdated":"2017-04-27T07:21:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

You’ve now seen how to build a sentiment analysis model. The techniques you’ve seen here can be applied to other text classification models besides sentiment analysis. Try analyzing other keywords besides happy and sad and see what results you get.

\n
"}]},"apps":[],"jobName":"paragraph_1493277684267_985487322","id":"20170315-235646_797080385","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2590"},{"text":"\nprintln(model.predict(hashingTF.transform(\"To this cute little happy sunshine who never fails to bright up my day with his sweet lovely smiles \".split(\" \").toSeq)))","dateUpdated":"2017-04-27T07:21:24+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","results":{},"enabled":true,"editorSetting":{"language":"scala","editOnDblClick":false}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684267_985487322","id":"20170316-000532_1106625181","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2591"},{"dateUpdated":"2017-04-27T07:21:24+0000","config":{"colWidth":12,"editorMode":"ace/mode/scala","results":{},"enabled":true,"editorSetting":{"language":"scala"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1493277684268_983563577","id":"20170320-195111_1459375677","dateCreated":"2017-04-27T07:21:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:2592"}],"name":"Sentiment Analysis Spark","id":"2CG8Q9R8K","angularObjects":{"2CFY9KRTF:shared_process":[],"2C8A4SZ9T_livy2:shared_process":[],"2CF23SSUS:shared_process":[],"2CEE2CWEY:shared_process":[],"2CGMF23WM:shared_process":[],"2CFGJU6ZR:shared_process":[],"2C4U48MY3_spark2:shared_process":[],"2CH3M43DS:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}