{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download this notebook from [https://raw.githubusercontent.com/daniel-acuna/python_data_science_intro/master/notebooks/lab-sentiment_analysis.ipynb](https://raw.githubusercontent.com/daniel-acuna/python_data_science_intro/master/notebooks/lab-sentiment_analysis.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Spark ML: An application to Sentiment Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Spark ML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In previous versions of Spark, most Machine Learning funcionality was provided through RDD (Resilient Distributed Datasets). However, to improve performance and communicability of results, Spark developers ported the ML functionality to work almost exclusively with DataFrames. Future releases of Spark will not update the support of ML with RDDs.\n",
    "\n",
    "In this modern Spark ML approach, there are _Estimators_ and _Transformers_. Estimators have some parameters that need to be fit into the data. After fitting, Estimators return Transformers. Tranformers can be applied to dataframes, taking one (or several) columns as input and creating (or several) columns as output.\n",
    "\n",
    "A _Pipeline_ combines several _Tranformers_ with a final _Estimator_. The _Pipeline_, therefore, can be fit to the data because the final step of the process (the _Estimator_) is fit to the data. The result of the fitting is a pipelined _Transformer_ that takes an input dataframe through all the stages of the Pipeline.\n",
    "\n",
    "There is a third type of functionality that allows to select features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, for analyzing text, a typical pipelined estimator is as follows:\n",
    "\n",
    "<img src=\"http://spark.apache.org/docs/latest/img/ml-Pipeline.png\" alt=\"ML Pipeline\" style=\"width: 100%;\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After fitting, the Pipeline becomes a transformer:\n",
    "\n",
    "<img src=\"http://spark.apache.org/docs/latest/img/ml-PipelineModel.png\" alt=\"ML Model\" style=\"width: 100%;\"/>\n",
    "(Images from http://spark.apache.org/docs/latest/ml-pipeline.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Importantly, transformers can be saved and exchanged with other data scientists, improving reproducibility."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading packages and connecting to Spark cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import findspark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "findspark.init()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pyspark\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "conf = pyspark.SparkConf().\\\n",
    "    setAppName('sentiment-analysis').\\\n",
    "    setMaster('local[*]')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.sql import SQLContext, HiveContext\n",
    "sc = pyspark.SparkContext(conf=conf)\n",
    "sqlContext = HiveContext(sc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# dataframe functions\n",
    "from pyspark.sql import functions as fn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction to dataframes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A `DataFrame` is a relatively new addition to Spark that stores a distributed dataset of structured columns. It is very similar to an R dataframe or a RDBS table. All columns are of the same type. A `DataFrame` can be constructed out of a variety of sources, such as a database, CSV files, JSON files, or a Parquet file (columnar storage). The preferred method for storing dataframes is Parquet due to its speed and compression ratio."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Manipulating a DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can create a dataframe from a RDD using the `sqlContext`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Create a RDDs\n",
    "documents_rdd = sc.parallelize([\n",
    "        [1, 'cats are cute', 0],\n",
    "        [2, 'dogs are playfull', 0],\n",
    "        [3, 'lions are big', 1],\n",
    "        [4, 'cars are fast', 1]])\n",
    "users_rdd = sc.parallelize([\n",
    "        [0, 'Alice', 20],\n",
    "        [1, 'Bob', 23],\n",
    "        [2, 'Charles', 32]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the previous RDDs, we can call the `toDF` method and specify the name of columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "documents_df = documents_rdd.toDF(['doc_id', 'text', 'user_id'])\n",
    "users_df = users_rdd.toDF(['user_id', 'name', 'age'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Spark will automatically try to guess the column types. We can take a look at those types:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- doc_id: long (nullable = true)\n",
      " |-- text: string (nullable = true)\n",
      " |-- user_id: long (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "documents_df.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- user_id: long (nullable = true)\n",
      " |-- name: string (nullable = true)\n",
      " |-- age: long (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "users_df.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to SQL, we can apply a function to a column or several columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.sql import functions as fn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DataFrame[avg(age): double]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# compute the average age of users\n",
    "user_age_df = users_df.select(fn.avg('age'))\n",
    "user_age_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the function is not evaluated until an _action_ (e.g., `take`, `show`, `collect`) is taken"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+\n",
      "|avg(age)|\n",
      "+--------+\n",
      "|    25.0|\n",
      "+--------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "user_age_df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can cross (e.g., join) two dataframes _ala_ SQL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+-----+---+------+-----------------+\n",
      "|user_id| name|age|doc_id|             text|\n",
      "+-------+-----+---+------+-----------------+\n",
      "|      0|Alice| 20|     1|    cats are cute|\n",
      "|      0|Alice| 20|     2|dogs are playfull|\n",
      "|      1|  Bob| 23|     3|    lions are big|\n",
      "|      1|  Bob| 23|     4|    cars are fast|\n",
      "+-------+-----+---+------+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "users_df.join(documents_df, on='user_id').show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also do outer joins"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+-------+---+------+-----------------+\n",
      "|user_id|   name|age|doc_id|             text|\n",
      "+-------+-------+---+------+-----------------+\n",
      "|      0|  Alice| 20|     1|    cats are cute|\n",
      "|      0|  Alice| 20|     2|dogs are playfull|\n",
      "|      1|    Bob| 23|     3|    lions are big|\n",
      "|      1|    Bob| 23|     4|    cars are fast|\n",
      "|      2|Charles| 32|  null|             null|\n",
      "+-------+-------+---+------+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "users_df.join(documents_df, on='user_id', how='left').show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can apply group functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+-------+-----------+\n",
      "|user_id|   name|count(text)|\n",
      "+-------+-------+-----------+\n",
      "|      0|  Alice|          2|\n",
      "|      1|    Bob|          2|\n",
      "|      2|Charles|          0|\n",
      "+-------+-------+-----------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "users_df.join(documents_df, 'user_id', how='left').\\\n",
    "    groupby('user_id', 'name').\\\n",
    "    agg(fn.count('text')).\\\n",
    "    show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can change the name of computed columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+-------+------+\n",
      "|user_id|   name|n_pets|\n",
      "+-------+-------+------+\n",
      "|      0|  Alice|     2|\n",
      "|      1|    Bob|     2|\n",
      "|      2|Charles|     0|\n",
      "+-------+-------+------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "users_df.join(documents_df, 'user_id', how='left').\\\n",
    "    groupby('user_id', 'name').\\\n",
    "    agg(fn.count('text').alias('n_pets')).\\\n",
    "    show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Add columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+-------+---+-----------+\n",
      "|user_id|   name|age|name_length|\n",
      "+-------+-------+---+-----------+\n",
      "|      0|  Alice| 20|          5|\n",
      "|      1|    Bob| 23|          3|\n",
      "|      2|Charles| 32|          7|\n",
      "+-------+-------+---+-----------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "users_df.withColumn('name_length', fn.length('name')).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many, many types of functions. E.g., [see here](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transformers and Estimators"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several ways of transforming the data from raw input to something that can be analyzed with a statistical model.\n",
    "\n",
    "Some examples of such transformers are displayed below:\n",
    "\n",
    "#### Tokenizer\n",
    "\n",
    "Suppose that we want to split the words or _tokens_ of a document. This is what `Tokenizer` does."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import Tokenizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Almost all transfomers and estimator require you to specificy the input column of the dataframe and the output column that will be added to the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# the tokenizer object\n",
    "tokenizer = Tokenizer().setInputCol('text').setOutputCol('words')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now transform the dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------+-----------------+-------+--------------------+\n",
      "|doc_id|             text|user_id|               words|\n",
      "+------+-----------------+-------+--------------------+\n",
      "|     1|    cats are cute|      0|   [cats, are, cute]|\n",
      "|     2|dogs are playfull|      0|[dogs, are, playf...|\n",
      "|     3|    lions are big|      1|   [lions, are, big]|\n",
      "|     4|    cars are fast|      1|   [cars, are, fast]|\n",
      "+------+-----------------+-------+--------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "tokenizer.transform(documents_df).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### CountVectorizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This transformer counts how many times a word appears in a list and produces a vector with such counts. This is very useful for text analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import CountVectorizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A `CountVectorizer` is different from a `Tokenizer` because it needs to learn how many different tokens there are in the input column. With that number, it will output vectors with consistent dimensions. Therefore, `CountVectorizer` is an `Estimator` that, when fitted, returns a `Transformer`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "count_vectorizer_estimator = CountVectorizer().setInputCol('words').setOutputCol('features')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to user the words column that  generated by the `tokenizer` transformer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "count_vectorizer_transformer = count_vectorizer_estimator.fit(tokenizer.transform(documents_df))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "which results in:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------+-----------------+-------+---------------------+-------------------------+\n",
      "|doc_id|text             |user_id|words                |features                 |\n",
      "+------+-----------------+-------+---------------------+-------------------------+\n",
      "|1     |cats are cute    |0      |[cats, are, cute]    |(9,[0,3,5],[1.0,1.0,1.0])|\n",
      "|2     |dogs are playfull|0      |[dogs, are, playfull]|(9,[0,2,7],[1.0,1.0,1.0])|\n",
      "|3     |lions are big    |1      |[lions, are, big]    |(9,[0,1,6],[1.0,1.0,1.0])|\n",
      "|4     |cars are fast    |1      |[cars, are, fast]    |(9,[0,4,8],[1.0,1.0,1.0])|\n",
      "+------+-----------------+-------+---------------------+-------------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "count_vectorizer_transformer.transform(tokenizer.transform(documents_df)).show(truncate=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The column `features` is a sparse vector representation. For example, for the first document, we have three features present: 0, 3, and 5. By looking at the vocabulary learned by `count_vectorizer_transformer`, we can know which words those feature indices refer to:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['are', 'big', 'playfull', 'cute', 'fast', 'cats', 'lions', 'dogs', 'cars']"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# list of words in the vocabulary\n",
    "count_vectorizer_transformer.vocabulary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['are', 'cute', 'cats'], \n",
       "      dtype='<U8')"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.array(count_vectorizer_transformer.vocabulary)[[0, 3, 5]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pipelines\n",
    "\n",
    "Sometimes, we have long preprocessing steps that take raw data and transform it through several stages. As explained before, these complex transformations can be captured by Pipelines.\n",
    "\n",
    "Pipelines are always estimators, even when they contain several transformers. After a pipeline is `fit` to the data, the pipeline becomes an transformer.\n",
    "\n",
    "We will now define a pipeline that takes the raw `text` column and produces the `features` column previously explained"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml import Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pipeline_cv_estimator = Pipeline(stages=[tokenizer, count_vectorizer_estimator])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pipeline_cv_transformer = pipeline_cv_estimator.fit(documents_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------+-----------------+-------+--------------------+--------------------+\n",
      "|doc_id|             text|user_id|               words|            features|\n",
      "+------+-----------------+-------+--------------------+--------------------+\n",
      "|     1|    cats are cute|      0|   [cats, are, cute]|(9,[0,3,5],[1.0,1...|\n",
      "|     2|dogs are playfull|      0|[dogs, are, playf...|(9,[0,2,7],[1.0,1...|\n",
      "|     3|    lions are big|      1|   [lions, are, big]|(9,[0,1,6],[1.0,1...|\n",
      "|     4|    cars are fast|      1|   [cars, are, fast]|(9,[0,4,8],[1.0,1...|\n",
      "+------+-----------------+-------+--------------------+--------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "pipeline_cv_transformer.transform(documents_df).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In more complex scenarios, you can even chain Pipeline transformers. We will see this case in the actual use case below.\n",
    "\n",
    "For a more detail explanation of Pipelines, Estimators, and Transformers, [see here](http://spark.apache.org/docs/latest/ml-pipeline.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the review, sentiment, and tweet datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2016-10-25 01:17:09--  https://github.com/daniel-acuna/python_data_science_intro/blob/master/data/imdb_reviews_preprocessed.parquet.zip?raw=true\n",
      "Resolving github.com... 192.30.253.112, 192.30.253.113\n",
      "Connecting to github.com|192.30.253.112|:443... connected.\n",
      "HTTP request sent, awaiting response... 302 Found\n",
      "Location: https://github.com/daniel-acuna/python_data_science_intro/raw/master/data/imdb_reviews_preprocessed.parquet.zip [following]\n",
      "--2016-10-25 01:17:10--  https://github.com/daniel-acuna/python_data_science_intro/raw/master/data/imdb_reviews_preprocessed.parquet.zip\n",
      "Reusing existing connection to github.com:443.\n",
      "HTTP request sent, awaiting response... 302 Found\n",
      "Location: https://raw.githubusercontent.com/daniel-acuna/python_data_science_intro/master/data/imdb_reviews_preprocessed.parquet.zip [following]\n",
      "--2016-10-25 01:17:10--  https://raw.githubusercontent.com/daniel-acuna/python_data_science_intro/master/data/imdb_reviews_preprocessed.parquet.zip\n",
      "Resolving raw.githubusercontent.com... 151.101.44.133\n",
      "Connecting to raw.githubusercontent.com|151.101.44.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 13717398 (13M) [application/octet-stream]\n",
      "Saving to: ‘imdb_reviews_preprocessed.parquet.zip’\n",
      "\n",
      "imdb_reviews_prepro 100%[===================>]  13.08M  4.10MB/s    in 3.2s    \n",
      "\n",
      "2016-10-25 01:17:14 (4.10 MB/s) - ‘imdb_reviews_preprocessed.parquet.zip’ saved [13717398/13717398]\n",
      "\n",
      "Archive:  imdb_reviews_preprocessed.parquet.zip\n",
      "   creating: imdb_reviews_preprocessed.parquet/\n",
      "  inflating: imdb_reviews_preprocessed.parquet/.part-r-00000-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet.crc  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/.part-r-00001-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet.crc  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/.part-r-00002-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet.crc  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/.part-r-00003-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet.crc  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/.part-r-00004-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet.crc  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/.part-r-00005-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet.crc  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/.part-r-00006-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet.crc  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/.part-r-00007-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet.crc  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/.part-r-00008-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet.crc  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/.part-r-00009-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet.crc  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/_common_metadata  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/_metadata  \n",
      " extracting: imdb_reviews_preprocessed.parquet/_SUCCESS  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/part-r-00000-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/part-r-00001-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/part-r-00002-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/part-r-00003-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/part-r-00004-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/part-r-00005-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/part-r-00006-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/part-r-00007-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/part-r-00008-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet  \n",
      "  inflating: imdb_reviews_preprocessed.parquet/part-r-00009-d6d1fcf6-a8d0-4996-aec5-ca0f47be35f2.gz.parquet  \n"
     ]
    }
   ],
   "source": [
    "!wget https://github.com/daniel-acuna/python_data_science_intro/blob/master/data/imdb_reviews_preprocessed.parquet.zip?raw=true -O imdb_reviews_preprocessed.parquet.zip && unzip imdb_reviews_preprocessed.parquet.zip && rm imdb_reviews_preprocessed.parquet.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2016-10-25 01:17:14--  https://github.com/daniel-acuna/python_data_science_intro/blob/master/data/sentiments.parquet.zip?raw=true\n",
      "Resolving github.com... 192.30.253.112, 192.30.253.113\n",
      "Connecting to github.com|192.30.253.112|:443... connected.\n",
      "HTTP request sent, awaiting response... 302 Found\n",
      "Location: https://github.com/daniel-acuna/python_data_science_intro/raw/master/data/sentiments.parquet.zip [following]\n",
      "--2016-10-25 01:17:14--  https://github.com/daniel-acuna/python_data_science_intro/raw/master/data/sentiments.parquet.zip\n",
      "Reusing existing connection to github.com:443.\n",
      "HTTP request sent, awaiting response... 302 Found\n",
      "Location: https://raw.githubusercontent.com/daniel-acuna/python_data_science_intro/master/data/sentiments.parquet.zip [following]\n",
      "--2016-10-25 01:17:14--  https://raw.githubusercontent.com/daniel-acuna/python_data_science_intro/master/data/sentiments.parquet.zip\n",
      "Resolving raw.githubusercontent.com... 151.101.44.133\n",
      "Connecting to raw.githubusercontent.com|151.101.44.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 38387 (37K) [application/zip]\n",
      "Saving to: ‘sentiments.parquet.zip’\n",
      "\n",
      "sentiments.parquet. 100%[===================>]  37.49K  --.-KB/s    in 0.06s   \n",
      "\n",
      "2016-10-25 01:17:14 (658 KB/s) - ‘sentiments.parquet.zip’ saved [38387/38387]\n",
      "\n",
      "Archive:  sentiments.parquet.zip\n",
      "   creating: sentiments.parquet/\n",
      "  inflating: sentiments.parquet/.part-r-00000-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet.crc  \n",
      "  inflating: sentiments.parquet/.part-r-00001-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet.crc  \n",
      "  inflating: sentiments.parquet/.part-r-00002-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet.crc  \n",
      "  inflating: sentiments.parquet/.part-r-00003-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet.crc  \n",
      "  inflating: sentiments.parquet/.part-r-00004-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet.crc  \n",
      "  inflating: sentiments.parquet/.part-r-00005-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet.crc  \n",
      "  inflating: sentiments.parquet/.part-r-00006-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet.crc  \n",
      "  inflating: sentiments.parquet/.part-r-00007-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet.crc  \n",
      "  inflating: sentiments.parquet/_common_metadata  \n",
      "  inflating: sentiments.parquet/_metadata  \n",
      " extracting: sentiments.parquet/_SUCCESS  \n",
      "  inflating: sentiments.parquet/part-r-00000-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet  \n",
      "  inflating: sentiments.parquet/part-r-00001-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet  \n",
      "  inflating: sentiments.parquet/part-r-00002-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet  \n",
      "  inflating: sentiments.parquet/part-r-00003-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet  \n",
      "  inflating: sentiments.parquet/part-r-00004-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet  \n",
      "  inflating: sentiments.parquet/part-r-00005-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet  \n",
      "  inflating: sentiments.parquet/part-r-00006-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet  \n",
      "  inflating: sentiments.parquet/part-r-00007-e719650f-4cd0-4bf6-b325-d485724c78e8.gz.parquet  \n"
     ]
    }
   ],
   "source": [
    "!wget https://github.com/daniel-acuna/python_data_science_intro/blob/master/data/sentiments.parquet.zip?raw=true -O sentiments.parquet.zip && unzip sentiments.parquet.zip && rm sentiments.parquet.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2016-10-25 01:17:15--  https://github.com/daniel-acuna/python_data_science_intro/blob/master/data/tweets.parquet.zip?raw=true\n",
      "Resolving github.com... 192.30.253.112, 192.30.253.113\n",
      "Connecting to github.com|192.30.253.112|:443... connected.\n",
      "HTTP request sent, awaiting response... 302 Found\n",
      "Location: https://github.com/daniel-acuna/python_data_science_intro/raw/master/data/tweets.parquet.zip [following]\n",
      "--2016-10-25 01:17:15--  https://github.com/daniel-acuna/python_data_science_intro/raw/master/data/tweets.parquet.zip\n",
      "Reusing existing connection to github.com:443.\n",
      "HTTP request sent, awaiting response... 302 Found\n",
      "Location: https://raw.githubusercontent.com/daniel-acuna/python_data_science_intro/master/data/tweets.parquet.zip [following]\n",
      "--2016-10-25 01:17:15--  https://raw.githubusercontent.com/daniel-acuna/python_data_science_intro/master/data/tweets.parquet.zip\n",
      "Resolving raw.githubusercontent.com... 151.101.44.133\n",
      "Connecting to raw.githubusercontent.com|151.101.44.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 136483 (133K) [application/zip]\n",
      "Saving to: ‘tweets.parquet.zip’\n",
      "\n",
      "tweets.parquet.zip  100%[===================>] 133.28K  --.-KB/s    in 0.1s    \n",
      "\n",
      "2016-10-25 01:17:15 (944 KB/s) - ‘tweets.parquet.zip’ saved [136483/136483]\n",
      "\n",
      "Archive:  tweets.parquet.zip\n",
      "   creating: tweets.parquet/\n",
      "  inflating: tweets.parquet/.part-r-00000-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00001-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00002-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00003-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00004-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00005-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00006-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00007-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00008-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00009-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00010-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00011-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00012-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00013-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00014-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/.part-r-00015-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet.crc  \n",
      "  inflating: tweets.parquet/_common_metadata  \n",
      "  inflating: tweets.parquet/_metadata  \n",
      " extracting: tweets.parquet/_SUCCESS  \n",
      "  inflating: tweets.parquet/part-r-00000-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00001-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00002-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00003-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00004-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00005-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00006-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00007-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00008-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00009-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00010-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00011-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00012-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00013-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00014-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n",
      "  inflating: tweets.parquet/part-r-00015-308756a7-7af0-4b11-9119-6ce7cdecce7e.gz.parquet  \n"
     ]
    }
   ],
   "source": [
    "!wget https://github.com/daniel-acuna/python_data_science_intro/blob/master/data/tweets.parquet.zip?raw=true -O tweets.parquet.zip && unzip tweets.parquet.zip && rm tweets.parquet.zip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load sentiment data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sentiments_df = sqlContext.read.parquet('sentiments.parquet')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- word: string (nullable = true)\n",
      " |-- sentiment: long (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sentiments_df.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The schema is very simple: for each word, we have whether it is positive (+1) or negative (-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+---------+\n",
      "|     word|sentiment|\n",
      "+---------+---------+\n",
      "|       a+|        1|\n",
      "|   abound|        1|\n",
      "|  abounds|        1|\n",
      "|abundance|        1|\n",
      "| abundant|        1|\n",
      "+---------+---------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# a sample of positive words\n",
    "sentiments_df.where(fn.col('sentiment') == 1).show(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----------+---------+\n",
      "|      word|sentiment|\n",
      "+----------+---------+\n",
      "|   2-faced|       -1|\n",
      "|   2-faces|       -1|\n",
      "|  abnormal|       -1|\n",
      "|   abolish|       -1|\n",
      "|abominable|       -1|\n",
      "+----------+---------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# a sample of negative words\n",
    "sentiments_df.where(fn.col('sentiment') == -1).show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets see how many of each category we have"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+--------+\n",
      "|sentiment|count(1)|\n",
      "+---------+--------+\n",
      "|       -1|    4783|\n",
      "|        1|    2006|\n",
      "+---------+--------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sentiments_df.groupBy('sentiment').agg(fn.count('*')).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have almost two times the number of negative words!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A simple approach to sentiment analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One simple approach for sentiment analysis is to simple count the number of positive and negative words in a text and then compute the average sentiment. Assuming that positive words are +1 and negative words are -1, we can classify a text as positive if the average sentiment is greater than zero and negative otherwise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To test our approach, we will use a sample of [IMDB](http://www.imdb.com/) reviews that were tagged as positive and negative.\n",
    "\n",
    "Let's load them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "imdb_reviews_df = sqlContext.read.parquet('imdb_reviews_preprocessed.parquet')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at a positive review"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Row(id='pos_10008', review='You know, Robin Williams, God bless him, is constantly shooting himself in the foot lately with all these dumb comedies he has done this decade (with perhaps the exception of \"Death To Smoochy\", which bombed when it came out but is now a cult classic). The dramas he has made lately have been fantastic, especially \"Insomnia\" and \"One Hour Photo\". \"The Night Listener\", despite mediocre reviews and a quick DVD release, is among his best work, period.<br /><br />This is a very chilling story, even though it doesn\\'t include a serial killer or anyone that physically dangerous for that matter. The concept of the film is based on an actual case of fraud that still has yet to be officially confirmed. In high school, I read an autobiography by a child named Anthony Godby Johnson, who suffered horrific abuse and eventually contracted AIDS as a result. I was moved by the story until I read reports online that Johnson may not actually exist. When I saw this movie, the confused feelings that Robin Williams so brilliantly portrayed resurfaced in my mind.<br /><br />Toni Collette probably gives her best dramatic performance too as the ultimately sociopathic \"caretaker\". Her role was a far cry from those she had in movies like \"Little Miss Sunshine\". There were even times she looked into the camera where I thought she was staring right at me. It takes a good actress to play that sort of role, and it\\'s this understated (yet well reviewed) role that makes Toni Collette probably one of the best actresses of this generation not to have even been nominated for an Academy Award (as of 2008). It\\'s incredible that there is at least one woman in this world who is like this, and it\\'s scary too.<br /><br />This is a good, dark film that I highly recommend. Be prepared to be unsettled, though, because this movie leaves you with a strange feeling at the end.', score=1.0)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "imdb_reviews_df.where(fn.col('score') == 1).first()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And a negative one"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Row(id='neg_10008', review='The film is bad. There is no other way to say it. The story is weak and outdated, especially for this country. I don\\'t think most people know what a \"walker\" is or will really care. I felt as if I was watching a movie from the 70\\'s. The subject was just not believable for the year 2007, even being set in DC. I think this rang true for everyone else who watched it too as the applause were low and quick at the end. Most didn\\'t stay for the Q&A either.<br /><br />I don\\'t think Schrader really thought the film out ahead of time. Many of the scenes seemed to be cut short as if they were never finished or he just didn\\'t know how to finish them. He jumped from one scene to the next and you had to try and figure out or guess what was going on. I really didn\\'t get Woody\\'s (Carter) private life or boyfriend either. What were all the \"artistic\" male bondage and torture pictures (from Iraq prisons) about? What was he thinking? I think it was his very poor attempt at trying to create this dark private subculture life for Woody\\'s character (Car). It didn\\'t work. It didn\\'t even seem to make sense really.<br /><br />The only good thing about this film was Woody Harrelson. He played his character (Car) flawlessly. You really did get a great sense of what a \"walker\" may have been like (say twenty years ago). He was great and most likely will never get recognized for it. <br /><br />As for Lauren, Lily and Kristin... Boring.<br /><br />Don\\'t see it! It is painful! Unless you are a true Harrelson fan.', score=0.0)"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "imdb_reviews_df.where(fn.col('score') == 0).first()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The first problem that we encounter is that the reviews are in plain text. We need to split the words and then match them to `sentiment_df`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To do, we will use a transformation that takes raw text and outputs a list of words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import RegexTokenizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`RegexTokenizer` extracts a sequence of matches from the input text. Regular expressions are a powerful tool to extract strings with certain characteristics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tokenizer = RegexTokenizer().setGaps(False)\\\n",
    "  .setPattern(\"\\\\p{L}+\")\\\n",
    "  .setInputCol(\"review\")\\\n",
    "  .setOutputCol(\"words\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The pattern `\\p{L}+` means that it will extract letters without accents (e.g., it will extract \"Acuna\" from \"Acuña\"). `setGaps` means that it will keep applying the rule until it can't extract new words. You have to set the input column from the incoming dataframe (in our case the `review` column) and the new column that will be added (e.g., `words`).\n",
    "\n",
    "We are ready to transform the input dataframe `imdb_reviews_df` with the tokenizer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DataFrame[id: string, review: string, score: double, words: array<string>]\n"
     ]
    }
   ],
   "source": [
    "review_words_df = tokenizer.transform(imdb_reviews_df)\n",
    "print(review_words_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Applying the transformation doesn't actually do anything until you apply an action. But as you can see, a new column `words` of type `array` of `string` was added by the transformation. We can see how it looks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+--------------------+-----+--------------------+\n",
      "|       id|              review|score|               words|\n",
      "+---------+--------------------+-----+--------------------+\n",
      "|pos_10008|You know, Robin W...|  1.0|[you, know, robin...|\n",
      "|pos_10015|Popular radio sto...|  1.0|[popular, radio, ...|\n",
      "|pos_10024|There's so many t...|  1.0|[there, s, so, ma...|\n",
      "|pos_10026|Without Kirsten M...|  1.0|[without, kirsten...|\n",
      "|pos_10035|I think James Cam...|  1.0|[i, think, james,...|\n",
      "+---------+--------------------+-----+--------------------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "review_words_df.show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we want to match every word from `sentiment_df` in the array `words` shown before. One way of doing this is to _explode_ the column `words` to create a row for each element in that list. Then, we would join that result with the dataframe `sentiment` to continue further."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+--------+\n",
      "|       id|    word|\n",
      "+---------+--------+\n",
      "|pos_10008|     you|\n",
      "|pos_10008|    know|\n",
      "|pos_10008|   robin|\n",
      "|pos_10008|williams|\n",
      "|pos_10008|     god|\n",
      "+---------+--------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "review_words_df.select('id', fn.explode('words').alias('word')).show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now if we join that with sentiment, we can see if there are positive and negative words in each review:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+---------+---------+\n",
      "|     word|       id|sentiment|\n",
      "+---------+---------+---------+\n",
      "|    bless|pos_10008|        1|\n",
      "|     dumb|pos_10008|       -1|\n",
      "|    death|pos_10008|       -1|\n",
      "|  classic|pos_10008|        1|\n",
      "|fantastic|pos_10008|        1|\n",
      "+---------+---------+---------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "review_word_sentiment_df = review_words_df.\\\n",
    "    select('id', fn.explode('words').alias('word')).\\\n",
    "    join(sentiments_df, 'word')\n",
    "review_word_sentiment_df.show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can simply average the sentiment per review id and, say, pick positive when the average is above 0, and negative otherwise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+------------------+---------+\n",
      "|       id|     avg_sentiment|predicted|\n",
      "+---------+------------------+---------+\n",
      "|pos_10035|              0.44|      1.0|\n",
      "|pos_10486|0.3076923076923077|      1.0|\n",
      "| pos_2706|0.2857142857142857|      1.0|\n",
      "| pos_3435|0.1111111111111111|      1.0|\n",
      "| pos_3930|0.5294117647058824|      1.0|\n",
      "+---------+------------------+---------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "simple_sentiment_prediction_df = review_word_sentiment_df.\\\n",
    "    groupBy('id').\\\n",
    "    agg(fn.avg('sentiment').alias('avg_sentiment')).\\\n",
    "    withColumn('predicted', fn.when(fn.col('avg_sentiment') > 0, 1.0).otherwise(0.))\n",
    "simple_sentiment_prediction_df.show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, lets compute the accuracy of our prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------------+\n",
      "|     avg(correct)|\n",
      "+-----------------+\n",
      "|0.732231471106131|\n",
      "+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "imdb_reviews_df.\\\n",
    "    join(simple_sentiment_prediction_df, 'id').\\\n",
    "    select(fn.expr('float(score = predicted)').alias('correct')).\\\n",
    "    select(fn.avg('correct')).\\\n",
    "    show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not bad with such a simple approach! But can we do better than this?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A data-driven sentiment prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are couple of problems with the previous approach:\n",
    "1. Positive and negative words had the same weight (e.g., good == amazing)\n",
    "1. Maybe a couple of negative words make the entire review negative, whereas positive words do not\n",
    "1. While our dataset is artificially balanced (12500 positive and 12500 negative), there are usually more positive than negative reviews, and therefore we should bias our predictions towards positive ones."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could use __data__ to estimate the sentiment that each word is contributing to the final sentiment of a review. Given that we are trying to predict negative and positve reviews, then we can use logistic regression for such binary prediction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### From text to numerical features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One typical approach is to count how many times a word appears in the text and then perform a reweighting so that words that are very common are \"counted\" less."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In Spark, we can achieve this by using several transformers:\n",
    "\n",
    "__Raw text => Tokens => Remove stop words => Term Frequency => Reweighting by Inverse Document frequency__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To perform this sequence we will create a __`Pipeline`__ to consistently represent the steps from raw text to TF-IDF."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we need to create a sequence to take from raw text to term frequency. This is necessary because we don't know the number of tokens in the text and therefore we need to _estimate_ such quantity from the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['a',\n",
       " 'about',\n",
       " 'above',\n",
       " 'across',\n",
       " 'after',\n",
       " 'afterwards',\n",
       " 'again',\n",
       " 'against',\n",
       " 'all',\n",
       " 'almost']"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we obtain the stop words from a website\n",
    "import requests\n",
    "stop_words = requests.get('http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words').text.split()\n",
    "stop_words[0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import StopWordsRemover\n",
    "sw_filter = StopWordsRemover()\\\n",
    "  .setStopWords(stop_words)\\\n",
    "  .setCaseSensitive(False)\\\n",
    "  .setInputCol(\"words\")\\\n",
    "  .setOutputCol(\"filtered\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, for this initial `Pipeline`, we define a counter vectorizer estimator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import CountVectorizer\n",
    "\n",
    "# we will remove words that appear in 5 docs or less\n",
    "cv = CountVectorizer(minTF=1., minDF=5., vocabSize=2**17)\\\n",
    "  .setInputCol(\"filtered\")\\\n",
    "  .setOutputCol(\"tf\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# we now create a pipelined transformer\n",
    "cv_pipeline = Pipeline(stages=[tokenizer, sw_filter, cv]).fit(imdb_reviews_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+--------------------+-----+--------------------+--------------------+--------------------+\n",
      "|       id|              review|score|               words|            filtered|                  tf|\n",
      "+---------+--------------------+-----+--------------------+--------------------+--------------------+\n",
      "|pos_10008|You know, Robin W...|  1.0|[you, know, robin...|[know, robin, wil...|(26677,[0,1,2,3,4...|\n",
      "|pos_10015|Popular radio sto...|  1.0|[popular, radio, ...|[popular, radio, ...|(26677,[0,1,2,3,4...|\n",
      "|pos_10024|There's so many t...|  1.0|[there, s, so, ma...|[s, things, fall,...|(26677,[0,1,2,4,5...|\n",
      "|pos_10026|Without Kirsten M...|  1.0|[without, kirsten...|[kirsten, miller,...|(26677,[1,3,4,23,...|\n",
      "|pos_10035|I think James Cam...|  1.0|[i, think, james,...|[think, james, ca...|(26677,[0,2,3,6,7...|\n",
      "+---------+--------------------+-----+--------------------+--------------------+--------------------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# now we can make the transformation between the raw text and the counts\n",
    "cv_pipeline.transform(imdb_reviews_df).show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The term frequency vector is represented with a sparse vector. We have 26,384 terms."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we build another pipeline that takes the output of the previous pipeline and _lowers_ the terms of documents that are very common."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import IDF\n",
    "idf = IDF().\\\n",
    "    setInputCol('tf').\\\n",
    "    setOutputCol('tfidf')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "idf_pipeline = Pipeline(stages=[cv_pipeline, idf]).fit(imdb_reviews_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+\n",
      "|       id|              review|score|               words|            filtered|                  tf|               tfidf|\n",
      "+---------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+\n",
      "|pos_10008|You know, Robin W...|  1.0|[you, know, robin...|[know, robin, wil...|(26677,[0,1,2,3,4...|(26677,[0,1,2,3,4...|\n",
      "|pos_10015|Popular radio sto...|  1.0|[popular, radio, ...|[popular, radio, ...|(26677,[0,1,2,3,4...|(26677,[0,1,2,3,4...|\n",
      "|pos_10024|There's so many t...|  1.0|[there, s, so, ma...|[s, things, fall,...|(26677,[0,1,2,4,5...|(26677,[0,1,2,4,5...|\n",
      "|pos_10026|Without Kirsten M...|  1.0|[without, kirsten...|[kirsten, miller,...|(26677,[1,3,4,23,...|(26677,[1,3,4,23,...|\n",
      "|pos_10035|I think James Cam...|  1.0|[i, think, james,...|[think, james, ca...|(26677,[0,2,3,6,7...|(26677,[0,2,3,6,7...|\n",
      "+---------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "idf_pipeline.transform(imdb_reviews_df).show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Therefore, the `idf_pipeline` takes the raw text from the datafarme `imdb_reviews_df` and creates a feature vector vector called `tfidf`!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tfidf_df = idf_pipeline.transform(imdb_reviews_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data science pipeline for estimating sentiments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's split the data into training, validation, and testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "training_df, validation_df, testing_df = imdb_reviews_df.randomSplit([0.6, 0.3, 0.1], seed=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[15085, 7347, 2568]"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[training_df.count(), validation_df.count(), testing_df.count()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One immediately apparent problem is that the number of features in the dataset is far larger than the number of training examples. This can lead to serious overfitting.\n",
    "\n",
    "Let's look at this more closely. Let's apply a simple prediction model known as logistic regression.\n",
    "\n",
    "[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) will take the `tfidf` features and predict whether the review is positive (`score == 1`) or negative (`score == 0`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.classification import LogisticRegression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lr = LogisticRegression().\\\n",
    "    setLabelCol('score').\\\n",
    "    setFeaturesCol('tfidf').\\\n",
    "    setRegParam(0.0).\\\n",
    "    setMaxIter(100).\\\n",
    "    setElasticNetParam(0.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets create a pipeline transformation by chaining the `idf_pipeline` with the logistic regression step (`lr`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lr_pipeline = Pipeline(stages=[idf_pipeline, lr]).fit(training_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets estimate the accuracy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/danielacuna/Downloads/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/classification.py:207: UserWarning: weights is deprecated. Use coefficients instead.\n",
      "  warnings.warn(\"weights is deprecated. Use coefficients instead.\")\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------------+\n",
      "|      avg(correct)|\n",
      "+------------------+\n",
      "|0.8395263372805226|\n",
      "+------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "lr_pipeline.transform(validation_df).\\\n",
    "    select(fn.expr('float(prediction = score)').alias('correct')).\\\n",
    "    select(fn.avg('correct')).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The performance is much better than before.\n",
    "\n",
    "The problem however is that we are overfitting because we have many features compared to the training examples:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, if we look at the weights of the features, there is a lot of noise:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "vocabulary = idf_pipeline.stages[0].stages[-1].vocabulary\n",
    "weights = lr_pipeline.stages[-1].coefficients.toArray()\n",
    "coeffs_df = pd.DataFrame({'word': vocabulary, 'weight': weights})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most negative words are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weight</th>\n",
       "      <th>word</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>26363</th>\n",
       "      <td>-7.376168</td>\n",
       "      <td>octane</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15790</th>\n",
       "      <td>-6.044970</td>\n",
       "      <td>waster</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24190</th>\n",
       "      <td>-5.310334</td>\n",
       "      <td>civility</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22733</th>\n",
       "      <td>-5.182578</td>\n",
       "      <td>necessities</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26432</th>\n",
       "      <td>-4.881442</td>\n",
       "      <td>collete</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         weight         word\n",
       "26363 -7.376168       octane\n",
       "15790 -6.044970       waster\n",
       "24190 -5.310334     civility\n",
       "22733 -5.182578  necessities\n",
       "26432 -4.881442      collete"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coeffs_df.sort_values('weight').head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the most positive:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weight</th>\n",
       "      <th>word</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>25089</th>\n",
       "      <td>7.635574</td>\n",
       "      <td>appreciable</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25693</th>\n",
       "      <td>7.076770</td>\n",
       "      <td>enlarged</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21021</th>\n",
       "      <td>6.458757</td>\n",
       "      <td>screenwriting</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22134</th>\n",
       "      <td>5.394323</td>\n",
       "      <td>sandwiched</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11673</th>\n",
       "      <td>4.549058</td>\n",
       "      <td>ringwald</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         weight           word\n",
       "25089  7.635574    appreciable\n",
       "25693  7.076770       enlarged\n",
       "21021  6.458757  screenwriting\n",
       "22134  5.394323     sandwiched\n",
       "11673  4.549058       ringwald"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coeffs_df.sort_values('weight', ascending=False).head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But none of them make sense. What is happening? We are overfitting the data. Those words that don't make sense are capturing just noise in the reviews.\n",
    "\n",
    "For example, the word `helming` appears in only one review:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Row(id='neg_899', word='helming', review='On the 1998 summer blockbuster hit BASEketball (1998): \"This is one of those movies that is usually seen on the big jumbo-tron screen in a sports bar during the day - when everyone is quite drunk. Unfortunately, I was sober when I saw this movie.\"<br /><br />So quoted the late Gene Siskel for this lame-brained, supposed yukfest that came out two weeks after the far superior \"There\\'s Something About Mary\" in a one-upmanship game during July of 1998. \"Mary\" was a gross-out fest, but in addition to the many gags, it had a lot of heart, which is why it was the highest grossing comedy of that memorable summer.<br /><br />\"BASEketball\" tried to outdo Mary, but it fizzled in more ways that one. You take the creators of \"South Park,\" Trey Parker and Matt Stone, who are fortunately not behind the movie but in front of the camera, the only member of ZAZ David Zucker helming the picture in desperate need of a paycheck, and the other two Jim Abrahams and Jerry Zucker clearly stayed out or probably warned him against the picture, a small bit by now 90 years young Ernest Borgnine, wasting his precious time in his distinguished career, dying on a hotdog and singing \"I\\'m Too Sexy\" as he videotapes his will, Jenny McCarthy, who has little screen time as Borgnine\\'s not-too-weeping trophy widow young enough to be his granddaughter, a bigger female part by Yasmine Bleeth as a dedicated social worker whose charges are underprivileged youngsters, and the only interesting and meaningful player in this turkey, Robert Vaughn as a corrupt archrival, and pointless cameos by \"Airplane!\" alumni Kareem Abdul Jabaar and the late Robert Stack who seemed nostalgic for the 1980 masterpiece and it\\'s much fresher humor created by the ZAZ family. What do all these people make up? A desperate cast and crew trying to replicate \"Airplane!\" humor and mixing it up with the crudity of \"South Park,\" but failing in every way.<br /><br />To make this long 100-minute movie short, \"BASEketball,\" a real game invented by David Zucker and his friends in his hometown of Milwaukee, is about two lazy losers (Parker and Stone) and their pint-sized mutual friend who invent baseball and basketball (hence the title) together on the driveway of one\\'s house. After Borgnine dies, he bequeaths the ownership of his BASEketball team, the Milwaukee Beers to Parker and Stone. Sure enough, the game goes national, and archrivals Vaughn and McCarthy want to take away ownership of the Beers team from them. But Bleeth is in love with both men, particularly Parker, and one poor, sick charge in need of a liver transplant goes ga-ga over them. Those are the characters, not strongly developed.<br /><br />Now witless gags ensue. Blood, electroshock hair, egg-throwing and screaming are among them. Parker and Stone nearly kill the youngster in the hospital, but he pulls through the liver transplant. Borgnine sings and rubs ointment on his chest in the videotaped will. McCarthy, who seemed to get over Borgnine\\'s death by choking on a frank right away, quickly massages Vaughn in the next scene. Cheerleaders dance in skimpy outfits. There is plenty of music on the soundtrack that is played for the hard of hearing. And David Zucker forces the parodies of \"Riverdance\" and \"Titanic.\" Parody forcing is nothing new to ZAZ, post \"Airplane!\" and \"The Naked Gun\" series.<br /><br />And like Siskel, I was sober as well, but I was also getting sleepy. This movie should be played over and over to coarse-mannered barroom patrons who enjoy it as they chug down beers, but will they remain alert and awake, or pass out during the unfunny parts? If they pass out, then they won\\'t realize that they are luckily missing stupidity and absurdity. Hats off to them!', score=0.0)"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "idf_pipeline.transform(training_df).\\\n",
    "    select('id', fn.explode('words').alias('word')).\\\n",
    "    where(fn.col('word') == 'helming').\\\n",
    "    join(training_df, 'id').\\\n",
    "    first()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regularization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One way to prevent overfitting during training is to modify the loss function and penalize weight values that are too large.\n",
    "\n",
    "There are two major regularization techniques, one based on penalizing the squared value of the weight (called L2 or ridge regularization) and anotherbased on penalizing the absolute value of the weight (called L1 or lasso regularization).\n",
    "\n",
    "The unregularized logistic regression loss function is:\n",
    "\n",
    "\\begin{equation}\n",
    "L_\\theta(p(X),Y) = - \\left( \\sum_i Y_i p_\\theta(X_i) + (1-Y_i)(1-p_\\theta(X_i)) \\right)\n",
    "\\end{equation}\n",
    "\n",
    "where $p_\\theta(\\cdot)$ is the sigmoid function:\n",
    "\n",
    "\\begin{equation}\n",
    "p_\\theta(X) = \\frac{1}{1+\\exp(-(\\theta_0 + \\sum_{j>0} x_j \\theta_j))}\n",
    "\\end{equation}\n",
    "\n",
    "If we modify the loss function $L_\\theta$ slightly\n",
    "\n",
    "\\begin{equation}\n",
    "L_\\theta^{\\lambda}(p(X),Y) = -\\left( \\sum_i Y_i p_\\theta(X_i) + (1-Y_i)(1-p_\\theta(X_i)) \\right) + \\lambda \\sum_{j>0} \\theta_j^2\n",
    "\\end{equation}\n",
    "\n",
    "we obtain what is known as L2 regularization.\n",
    "\n",
    "Notice how we increase the loss function by $\\lambda$ times the square of the weights. In practice, this means that __we will think twice about increasing the importance of a feature__. This loss function will prevent the algorithm for fitting certain data points, such as outliers or noise, unless the decrease in loss for the data grants it. Also, notice that the penalization doesn't apply to the bias parameter $\\theta_0$.\n",
    "\n",
    "You can see more clearly the effect of such cost function when $\\lambda$ goes to infinity: the features will not be used for predicting and only the bias term will matter! This prevents the algorithm from learning altogether, forcing it to underfit!\n",
    "\n",
    "One problem with L2 regularization is that all weights go to zero uniformly. In a sense, all features will matter but less than with the unregularized loss function. This is a really strange because we do not want all features to matter. In sentiment analysis, we want to select certain features because we want to understand that only some words have effects on the sentiment.\n",
    "\n",
    "A different modification of the original loss function can achieve this. This regularization is known as L1 or lasso reguarlization and penalizes the _absolute_ value of the weight\n",
    "\n",
    "\\begin{equation}\n",
    "L_\\theta^{\\lambda}(p(X),Y) = -\\left( \\sum_i Y_i p_\\theta(X_i) + (1-Y_i)(1-p_\\theta(X_i)) \\right) + \\lambda \\sum_{j>0} \\left| \\theta_j \\right|\n",
    "\\end{equation}\n",
    "\n",
    "The practical effect of L1 regularization is that the difference between a feature having no importance vs some small importance is massively bigger than with L2 regularization. __Therefore, optimizing the L1 loss function usually brings some features to have exactly zero weight.__\n",
    "\n",
    "One problem with L1 regularization is that it will never select more features that the number of examples. This is because it can always fit the training data perfectly when the number of features equals the number of examples. In our sentimental analysis, this is the case (there are more words than examples).\n",
    "\n",
    "One way of remedying this is to have a combination of both L1 and L2. This is known as __elastic net regularization__. For this type of regularization, we have to pick a parameter ($\\alpha$) deciding to consider L1 vs L2 regularization. If $\\alpha=0$, then we choose L2, and if $\\alpha=1$ we choose L1. For example, $\\alpha=0.5$ means half L1 and half L2.\n",
    "\n",
    "\\begin{equation}\n",
    "L_\\theta^{\\lambda,\\alpha}(p(X),Y) = -\\left( \\sum_i Y_i p_\\theta(X_i) + (1-Y_i)(1-p_\\theta(X_i)) \\right) + \\lambda \\left[(1-\\alpha) \\sum_{j>0} \\theta_j^2 + \\alpha \\sum_{j>0} \\left| \\theta_j \\right| \\right]\n",
    "\\end{equation}\n",
    "\n",
    "Unfortunately, elastic net regularization comes with two additional parameters, $\\lambda$ and $\\alpha$, and we must either select them a priori or used the validation set to choose the best one."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Spark allows to fit elatic net regularization easily"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "lambda_par = 0.02\n",
    "alpha_par = 0.3\n",
    "en_lr = LogisticRegression().\\\n",
    "        setLabelCol('score').\\\n",
    "        setFeaturesCol('tfidf').\\\n",
    "        setRegParam(lambda_par).\\\n",
    "        setMaxIter(100).\\\n",
    "        setElasticNetParam(alpha_par)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we define a new Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "en_lr_pipeline = Pipeline(stages=[idf_pipeline, en_lr]).fit(training_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the performance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/danielacuna/Downloads/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/classification.py:207: UserWarning: weights is deprecated. Use coefficients instead.\n",
      "  warnings.warn(\"weights is deprecated. Use coefficients instead.\")\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------------------------------+\n",
      "|avg('float((prediction = score)))|\n",
      "+---------------------------------+\n",
      "|               0.8663400027221996|\n",
      "+---------------------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "en_lr_pipeline.transform(validation_df).select(fn.avg(fn.expr('float(prediction = score)'))).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We improve performance slightly, but whats more important is that we improve the understanding of the word sentiments. Lets take at the weights:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "en_weights = en_lr_pipeline.stages[-1].coefficients.toArray()\n",
    "en_coeffs_df = pd.DataFrame({'word': vocabulary, 'weight': en_weights})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most negative words all make sense (\"worst\" is _actually_ more negative than than \"worse\")!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weight</th>\n",
       "      <th>word</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>103</th>\n",
       "      <td>-0.376574</td>\n",
       "      <td>worst</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>258</th>\n",
       "      <td>-0.355957</td>\n",
       "      <td>waste</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>-0.250211</td>\n",
       "      <td>bad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>201</th>\n",
       "      <td>-0.238551</td>\n",
       "      <td>awful</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1135</th>\n",
       "      <td>-0.208919</td>\n",
       "      <td>disappointment</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>190</th>\n",
       "      <td>-0.179398</td>\n",
       "      <td>boring</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>899</th>\n",
       "      <td>-0.174077</td>\n",
       "      <td>pointless</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>754</th>\n",
       "      <td>-0.173512</td>\n",
       "      <td>fails</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>254</th>\n",
       "      <td>-0.172652</td>\n",
       "      <td>worse</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1075</th>\n",
       "      <td>-0.171306</td>\n",
       "      <td>disappointing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>626</th>\n",
       "      <td>-0.170642</td>\n",
       "      <td>poorly</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>569</th>\n",
       "      <td>-0.170556</td>\n",
       "      <td>avoid</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>701</th>\n",
       "      <td>-0.170404</td>\n",
       "      <td>mess</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>531</th>\n",
       "      <td>-0.170075</td>\n",
       "      <td>dull</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>174</th>\n",
       "      <td>-0.168825</td>\n",
       "      <td>poor</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        weight            word\n",
       "103  -0.376574           worst\n",
       "258  -0.355957           waste\n",
       "11   -0.250211             bad\n",
       "201  -0.238551           awful\n",
       "1135 -0.208919  disappointment\n",
       "190  -0.179398          boring\n",
       "899  -0.174077       pointless\n",
       "754  -0.173512           fails\n",
       "254  -0.172652           worse\n",
       "1075 -0.171306   disappointing\n",
       "626  -0.170642          poorly\n",
       "569  -0.170556           avoid\n",
       "701  -0.170404            mess\n",
       "531  -0.170075            dull\n",
       "174  -0.168825            poor"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "en_coeffs_df.sort_values('weight').head(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Same thing with positive words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weight</th>\n",
       "      <th>word</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>0.281439</td>\n",
       "      <td>great</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>160</th>\n",
       "      <td>0.262072</td>\n",
       "      <td>excellent</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>0.197434</td>\n",
       "      <td>best</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2138</th>\n",
       "      <td>0.184789</td>\n",
       "      <td>refreshing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>320</th>\n",
       "      <td>0.169784</td>\n",
       "      <td>favorite</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>216</th>\n",
       "      <td>0.160008</td>\n",
       "      <td>wonderful</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>660</th>\n",
       "      <td>0.158946</td>\n",
       "      <td>superb</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1405</th>\n",
       "      <td>0.157252</td>\n",
       "      <td>wonderfully</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>270</th>\n",
       "      <td>0.138152</td>\n",
       "      <td>loved</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>317</th>\n",
       "      <td>0.133061</td>\n",
       "      <td>enjoyed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1270</th>\n",
       "      <td>0.132639</td>\n",
       "      <td>funniest</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227</th>\n",
       "      <td>0.126882</td>\n",
       "      <td>perfect</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>291</th>\n",
       "      <td>0.123461</td>\n",
       "      <td>amazing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>518</th>\n",
       "      <td>0.120457</td>\n",
       "      <td>enjoyable</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12307</th>\n",
       "      <td>0.116074</td>\n",
       "      <td>indy</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         weight         word\n",
       "13     0.281439        great\n",
       "160    0.262072    excellent\n",
       "29     0.197434         best\n",
       "2138   0.184789   refreshing\n",
       "320    0.169784     favorite\n",
       "216    0.160008    wonderful\n",
       "660    0.158946       superb\n",
       "1405   0.157252  wonderfully\n",
       "270    0.138152        loved\n",
       "317    0.133061      enjoyed\n",
       "1270   0.132639     funniest\n",
       "227    0.126882      perfect\n",
       "291    0.123461      amazing\n",
       "518    0.120457    enjoyable\n",
       "12307  0.116074         indy"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "en_coeffs_df.sort_values('weight', ascending=False).head(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Are there words with _literarily_ zero importance for predicting sentiment? Yes, and most of them!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(25554, 2)"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "en_coeffs_df.query('weight == 0.0').shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In fact, more than 95% of features are not needed to achieve a __better__ performance than all previous models!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9579038122727443"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "en_coeffs_df.query('weight == 0.0').shape[0]/en_coeffs_df.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at these _neutral_ words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weight</th>\n",
       "      <th>word</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>br</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.0</td>\n",
       "      <td>like</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0.0</td>\n",
       "      <td>story</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>0.0</td>\n",
       "      <td>really</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>0.0</td>\n",
       "      <td>people</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>0.0</td>\n",
       "      <td>don</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>0.0</td>\n",
       "      <td>way</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>0.0</td>\n",
       "      <td>movies</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>0.0</td>\n",
       "      <td>think</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>0.0</td>\n",
       "      <td>characters</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>0.0</td>\n",
       "      <td>character</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>0.0</td>\n",
       "      <td>watch</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>0.0</td>\n",
       "      <td>films</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>0.0</td>\n",
       "      <td>little</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    weight        word\n",
       "0      0.0          br\n",
       "1      0.0           s\n",
       "5      0.0        like\n",
       "9      0.0       story\n",
       "10     0.0      really\n",
       "12     0.0      people\n",
       "14     0.0         don\n",
       "15     0.0         way\n",
       "17     0.0      movies\n",
       "18     0.0       think\n",
       "19     0.0  characters\n",
       "20     0.0   character\n",
       "21     0.0       watch\n",
       "22     0.0       films\n",
       "28     0.0      little"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "en_coeffs_df.query('weight == 0.0').head(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But, did we choose the right $\\lambda$ and $\\alpha$ parameters? We should run an experiment where we try different combinations of them. Fortunately, Spark let us do this by using a grid - a method that generates combination of parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.tuning import ParamGridBuilder"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to build a new estimator pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "en_lr_estimator = Pipeline(stages=[idf_pipeline, en_lr])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "grid = ParamGridBuilder().\\\n",
    "    addGrid(en_lr.regParam, [0., 0.01, 0.02]).\\\n",
    "    addGrid(en_lr.elasticNetParam, [0., 0.2, 0.4]).\\\n",
    "    build()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the list of parameters that we will try:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,\n",
       "  Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='regParam', doc='regularization parameter (>= 0).'): 0.0},\n",
       " {Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,\n",
       "  Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='regParam', doc='regularization parameter (>= 0).'): 0.01},\n",
       " {Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,\n",
       "  Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='regParam', doc='regularization parameter (>= 0).'): 0.02},\n",
       " {Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.2,\n",
       "  Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='regParam', doc='regularization parameter (>= 0).'): 0.0},\n",
       " {Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.2,\n",
       "  Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='regParam', doc='regularization parameter (>= 0).'): 0.01},\n",
       " {Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.2,\n",
       "  Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='regParam', doc='regularization parameter (>= 0).'): 0.02},\n",
       " {Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.4,\n",
       "  Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='regParam', doc='regularization parameter (>= 0).'): 0.0},\n",
       " {Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.4,\n",
       "  Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='regParam', doc='regularization parameter (>= 0).'): 0.01},\n",
       " {Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.4,\n",
       "  Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='regParam', doc='regularization parameter (>= 0).'): 0.02}]"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grid"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting model 1\n",
      "Fitting model 2\n",
      "Fitting model 3\n",
      "Fitting model 4\n",
      "Fitting model 5\n",
      "Fitting model 6\n",
      "Fitting model 7\n",
      "Fitting model 8\n",
      "Fitting model 9\n"
     ]
    }
   ],
   "source": [
    "all_models = []\n",
    "for j in range(len(grid)):\n",
    "    print(\"Fitting model {}\".format(j+1))\n",
    "    model = en_lr_estimator.fit(training_df, grid[j])\n",
    "    all_models.append(model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/danielacuna/Downloads/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/classification.py:207: UserWarning: weights is deprecated. Use coefficients instead.\n",
      "  warnings.warn(\"weights is deprecated. Use coefficients instead.\")\n"
     ]
    }
   ],
   "source": [
    "# estimate the accuracy of each of them:\n",
    "accuracies = [m.\\\n",
    "    transform(validation_df).\\\n",
    "    select(fn.avg(fn.expr('float(score = prediction)')).alias('accuracy')).\\\n",
    "    first().\\\n",
    "    accuracy for m in all_models]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "best_model_idx = np.argmax(accuracies)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the best model we found has the following parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.2,\n",
       " Param(parent='LogisticRegression_4300b54e7e5f5111e7e0', name='regParam', doc='regularization parameter (>= 0).'): 0.01}"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grid[best_model_idx]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "best_model = all_models[best_model_idx]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8721927317272357"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "accuracies[best_model_idx]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finally, predicting tweet sentiments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can use this model to predict sentiments on Twitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+\n",
      "|text                                                                                                                                       |handle          |\n",
      "+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+\n",
      "|Peter Navarro: 'Trump the Bull vs. Clinton the Bear' #DrainTheSwamp \n",
      "https://t.co/mQRkfMG80j                                               |@realDonaldTrump|\n",
      "|'Democratic operative caught on camera: Hillary PERSONALLY ordered 'Donald Duck' troll campaign that broke the law'\n",
      "https://t.co/sTreHAfYUH|@realDonaldTrump|\n",
      "|Join me tomorrow in Sanford or Tallahassee, Florida!\n",
      "\n",
      "Sanford at 3pm:\n",
      "https://t.co/PZENw9Kheg\n",
      "\n",
      "Tallahassee at 6pm:\n",
      "https://t.co/WKI69e1bqD |@realDonaldTrump|\n",
      "|THANK YOU St. Augustine, Florida! Get out and VOTE! Join the MOVEMENT - and lets #DrainTheSwamp! Off to Tampa now!… https://t.co/zgwqhy2jBX|@realDonaldTrump|\n",
      "|Join me LIVE on my Facebook page in St. Augustine, Florida! Lets #DrainTheSwamp &amp; MAKE AMERICA GREAT AGAIN!… https://t.co/mPzVrcaR9L   |@realDonaldTrump|\n",
      "+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "tweets_df = sqlContext.read.parquet('tweets.parquet')\n",
    "tweets_df.show(5, truncate=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have 1K tweets from each candidate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----------------+--------+\n",
      "|          handle|count(1)|\n",
      "+----------------+--------+\n",
      "|@realDonaldTrump|    1000|\n",
      "| @HillaryClinton|    1000|\n",
      "+----------------+--------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "tweets_df.groupby('handle').agg(fn.count('*')).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now predict the sentiment of the Tweet using our best model, we need to rename the column so that it matches our previous pipeline (`review` => ...)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+----------+\n",
      "|              review|prediction|\n",
      "+--------------------+----------+\n",
      "|Peter Navarro: 'T...|       1.0|\n",
      "|'Democratic opera...|       1.0|\n",
      "|Join me tomorrow ...|       1.0|\n",
      "|THANK YOU St. Aug...|       1.0|\n",
      "|Join me LIVE on m...|       1.0|\n",
      "|Honored to receiv...|       1.0|\n",
      "|'Hillary Clinton ...|       1.0|\n",
      "|Leaving West Palm...|       0.0|\n",
      "|'The Clinton Foun...|       0.0|\n",
      "|Departing Farmers...|       1.0|\n",
      "|Get out to VOTE o...|       1.0|\n",
      "|We are winning an...|       1.0|\n",
      "|Why has nobody as...|       0.0|\n",
      "|Major story that ...|       0.0|\n",
      "|Wow, just came ou...|       1.0|\n",
      "|'Clinton Ally Aid...|       1.0|\n",
      "|'Clinton Charity ...|       1.0|\n",
      "|Thank you Naples,...|       0.0|\n",
      "|The attack on Mos...|       0.0|\n",
      "|#CrookedHillary #...|       1.0|\n",
      "+--------------------+----------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "best_model.transform(tweets_df.withColumnRenamed('text', 'review')).select('review', 'prediction').show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, lets summarize our results in a graph!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import seaborn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sentiment_pd = best_model.\\\n",
    "    transform(tweets_df.withColumnRenamed('text', 'review')).\\\n",
    "    groupby('handle').\\\n",
    "    agg(fn.avg('prediction').alias('prediction'), \n",
    "        (2*fn.stddev('prediction')/fn.sqrt(fn.count('*'))).alias('err')).\\\n",
    "    toPandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>handle</th>\n",
       "      <th>prediction</th>\n",
       "      <th>err</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>@realDonaldTrump</td>\n",
       "      <td>0.771</td>\n",
       "      <td>0.026588</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>@HillaryClinton</td>\n",
       "      <td>0.740</td>\n",
       "      <td>0.027756</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             handle  prediction       err\n",
       "0  @realDonaldTrump       0.771  0.026588\n",
       "1   @HillaryClinton       0.740  0.027756"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sentiment_pd.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAkEAAAFRCAYAAABzOnmrAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X1UlHX+//HXMIOKQt4QtaxiKuFN7tFcW7fUvpab3WyU\n5i3mDppb1q6aSnpSS0szlawljR9bti4u1je0LNfglLqx5WKuuaVbuaSkZIopIsh9DDfX749O8802\nYRwZpuHzfJzTOchwXdf77bjbs2sGtFmWZQkAAMAwQf4eAAAAwB+IIAAAYCQiCAAAGIkIAgAARiKC\nAACAkYggAABgJIe/B0DTq62tU3Fxpb/H8JmOHduyX4BqybtJ7Bfo2C9wRUSEeXUcd4JaIIfD7u8R\nfIr9AldL3k1iv0DHfuYhggAAgJGIIAAAYCQiCAAAGIkIAgAARiKCAACAkYggAABgJH5OEAAAflRX\nV6cvvjjSpOfs1q2H7Ha+Jb4xRBAAAH70xRdHNGvVVrVtf1mTnK+ypECr592p6OiYJjnf+Tz22ELd\ndddYVVdXq6DglO64Y9QPft3WrW/o9tvv1JEjh7Vr105NmXKvT+e6EEQQAAB+1rb9ZQrt2NnfY3jl\nl7+8rsHHN2xI1W23xSompqdiYno201SeIYIAADDMW29laOfOd1VZWanS0rOaMuVerVv3gqKiuio4\nuJXmzVugFSueUFlZqSRp1qy56tEjWps3b1Jm5l8VHn6pzp4tdp/r6NEv9MADM7R+/Z+Unb1T9fV1\nGjlyjOx2u86cOaPHHluocePitGXLZi1Zslzbt7+lV199Ra1atVaXLlGaN2+hdux4W7t379LXX3+t\nEyfyNWlSvG67Ldanvw9EEAAABqqu/lqrV6eouLhI9903WfX19brnnmm68soY/fGPz+maawZp1Kgx\nOn78mJYvX6Inn3xKr72Wrg0bNkmS7r033n0um82m3NyD+uCDf+pPf0pTbW2tXnjh/2n69Fn6y1/+\nrKVLV+iTT/4tm82m0tIS/fnPa7V+/Stq06aNnnsuSX/96+tq27atKioq9Mwza3T8+DE9/PAcIggA\nADS9q6/+uSSpY8dOCgsL09GjRxUV1VWSdOTI5/roo38pK2uHLMtSWVmp8vOPq0ePaDkc36RDnz5X\nnXO+L788qj59+kqSHA6Hpk+f5X7Msiz3xydO5Kt792i1adNGktS//wDt3btHV13V1/1y2WWXXS6X\nq8ZHm/8fvkUeAAADHTyYI0kqKjqjiooKdezYUUFB32TBFVd014QJd2vNmue1dOlK3Xzzr9WlS1fl\n5R2Ry+VSXV2dDh06eM75unbtpkOHPpMk1dbWas6c6aqpqVFQkE319XXur4uM/Km++OKIqqu/liTt\n3/+hO75sNtt3zmjJ17gTBACAn1WWFDT7uc6cOaNZs36vyspyzZ27QKtWrXA/Fh9/j1aseEJ//evr\nqqys1NSp09ShQwdNmjRZDzxwjzp06KSQkJBzzhcT01ODBl2nBx6YKsuydNddYxUcHKx+/a7WvHmz\ndc8990mS2rfvoKlTp2nGjPtlt9vVuXMX/e53D+pvf9v2vQlt8jWb9d17VGgxTp8u8/cIPhMREcZ+\nAaol7yaxX6Dz137N9XOCvrvfW29l6Msvj+r++6c36XX9JSIizKvjuBMEAIAf2e12n/9MH/wwIggA\nAMP4+ruuAgVvjAYAAEYiggAAgJGIIAAAYCQiCAAAGIkIAgAARiKCAACAkYggAABgJCIIAAAYiQgC\nAABGIoIAAICRiCAAAGAkIggAABiJCAIAAEYiggAAgJGIIAAAYCQiCAAAGIkIAgAARiKCAACAkRz+\nHgBN79ChQyoqKvf3GD5TXBzKfgGqJe8msV+gYz/vdOvWQ3a7vcnP2xyIoBbIueB/1bb9Zf4eAwDQ\nwlWWFGj1vDsVHR3j71G8QgS1QG3bX6bQjp39PQYAAD9qvCcIAAAYiQgCAABGIoIAAICRiCAAAGAk\nIggAABiJCAIAAEYiggAAgJGIIAAAYCQiCAAAGIkIAgAARiKCAACAkYggAABgJCIIAAAYiQgCAABG\nIoIAAICRiCAAAGAkIggAABiJCAIAAEYiggAAgJGIIAAAYCQiCAAAGIkIAgAARiKCAACAkYggAABg\nJCIIAAAYiQgCAABGIoIAAICRiCAAAGAkIggAABiJCAIAAEYiggAAgJGIIAAAYKSAiqCcnBzNnj1b\nkydPVnx8vOLi4vTHP/5RNTU1ys/P14QJE875+vT0dCUnJ6uwsFBLly6VJA0fPlwul0sLFixQdnb2\nRc2Tm5ur+++/X5MnT9a4ceOUnJwsSfrggw+UkJAgSXrwwQcbPMemTZtUV1d3UXMAAIAL5/D3AJ7a\ntWuX1q1bp8WLF6tbt26SpNraWqWnp+vBBx/UI488IpvN9oPHXnrppVq8eLEknfdrLlRZWZkSEhKU\nkpKiqKgoWZalWbNmaePGjerevbv7OmvWrGnwPM8//7xGjRolu93eJHMBAADPBEQEuVwupaSkaO3a\ntdqyZYvmzZun8PBw2e12OZ1OHTt2TLt37z7v8fn5+UpISNDGjRv/67Hy8nI9+uijKisrU0FBgSZN\nmqS4uDg5nU6Fh4erpKREnTp10p133qlhw4bp8OHDeuqpp3TbbbfpuuuuU1RUlKRv4ioxMVHBwcH6\n6KOP3OcfOnSosrOz5XQ61adPH+Xm5qqiokKrV6/Wrl27VFhYqISEBCUnJ2vlypX66KOPZLPZFBsb\nK6fTqQULFig4OFj5+fkqLCzUypUr1adPn6b/TQYAwDAB8XJYdna2hg8frpycHG3btk0bN27U8uXL\ntXPnTvXr10+DBw9WaWmpcnNzFR8fr/j4eDmdTq1fv959jvPdAfryyy8VGxurdevWad26dUpNTXU/\nFhsbq9TUVI0fP15vvPGGJGnz5s0aN26cCgoK3AH0rZCQEDkc5+/K/v37KzU1Vdddd50yMjI0duxY\nRUREKCkpSe+++65OnDihTZs26eWXX1ZGRoYOHTokSerSpYvWrVun3/zmNz8YcgAA4MIFxJ2gI0eO\nqFevXsrIyNDEiRMVFBSkkJAQxcTEqG3btsrPz5fNZlNMTIzS0tLcx6Wnp6uwsLDBc4eHh+svf/mL\ntm/frnbt2qm2ttb9WPfu3SVJv/zlL7Vs2TIVFRXp/fff10MPPaSvv/5aBw4cOOdcx48f18mTJ897\nrW/v4ERGRrrnsixLlmXp8OHDGjhwoCTJ4XCoX79++vzzz8857ic/+ck5d5kAALhYR95Z4fWxVn2t\nHjr4shyOYK+Of+21N72+dlMIiDtBISEhKikpUZs2bVRZWSlJevHFF9W3b1+dOnVK77zzjm688UZZ\nltXoub7/NampqRowYICeeuop3Xrrrec8HhT0f789I0eO1JNPPqkhQ4bIbrfrhhtuUHZ2to4dOyZJ\nqqmp0cqVK5Wbm3vea//Q3Si73a76+npFR0frww8/dJ9r37597ghrqvcxAQDQ1IKCgmS3e/dPRERY\nk/zjrYC4EzR48GCtXLlSjz32mBISEpSZmalevXpp9+7dKikp0eLFi1VeXu5RLHz/a2688UYtW7ZM\nmZmZCgsLU3BwsFwu13993V133aVnn31WGRkZkqTQ0FAlJibq0UcflWVZqqio0PDhwzVx4kR98MEH\njV73WwMHDtS0adOUlpamPXv2KC4uTjU1Nfr1r3/Ne38AAD7X41cLvD62vDhfK6Zdq+joGK+OP326\nzOtrf5e3IWSzPLl98iOQlJSks2fPKiEhQe3bt3d/Pi8vT8uWLdPUqVM1ZMgQn13/1KlTmj9//jnv\nGfqxunFqikI7dvb3GACAFu5iI6ipeBtBAXEnSJLmzJmjHTt2aO7cuaqqqpLNZlNdXZ0iIyM1f/58\nxcT47gnYsWOHnnvuOS1ZssRn1wAAAM0rYO4EwXPcCQIANIdAvxMUEG+MBgAAaGpEEAAAMBIRBAAA\njEQEAQAAIxFBAADASEQQAAAwEhEEAACMRAQBAAAjEUEAAMBIRBAAADASEQQAAIxEBAEAACMRQQAA\nwEhEEAAAMBIRBAAAjEQEAQAAIxFBAADASEQQAAAwEhEEAACMRAQBAAAjEUEAAMBIRBAAADASEQQA\nAIxEBAEAACMRQQAAwEhEEAAAMBIRBAAAjEQEAQAAIxFBAADASEQQAAAwEhEEAACMRAQBAAAjOfw9\nAJpeZUmBv0cAABgg0P99Y7Msy/L3EGhahw4dUlFRub/H8JlOnULZL0C15N0k9gt07Oedbt16yG63\nN/l5L0RERJhXxxFBLdTp02X+HsFnIiLC2C9AteTdJPYLdOwXuLyNIN4TBAAAjEQEAQAAIxFBAADA\nSEQQAAAwEhEEAACMRAQBAAAjEUEAAMBIRBAAADASEQQAAIxEBAEAACMRQQAAwEhEEAAAMBIRBAAA\njEQEAQAAIxFBAADASEQQAAAwEhEEAACMRAQBAAAjEUEAAMBIRBAAADASEQQAAIxEBAEAACN5HEFv\nvvmmkpKSVFVVpS1btvhyJgAAAJ/zKIKefvppvffee9q+fbvq6uq0efNmrVy50tezAQAA+IxHEZSd\nna1Vq1apdevWCg0NVWpqqnbu3Onr2QAAAHzGowgKCvrmy2w2myTJ5XK5PwcAABCIHJ580a233qrZ\ns2erpKRE69ev19atWxUbG+vr2QAAAHzGowiaNm2a/vGPf+inP/2pvvrqK82cOVM33nijr2cDAADw\nmQYjaO/eve6P27Rpo+HDh5/z2C9+8QvfTQYAAOBDDUbQmjVrzvuYzWZTWlpakw8EAADQHBqMoA0b\nNjTXHAAAAM2qwQhyOp3u7wj7IdwJAgAAgarBCJo5c6YkadOmTWrTpo1GjRolh8OhjIwMVVdXN8uA\nAAAAvtBgBA0aNEiSlJiYqM2bN7s/f/XVV2v06NG+nQwAAMCHPPqJh9XV1crLy3P/+uDBg6qtrfXZ\nUAAAAL7m0c8Jmj9/vpxOpy6//HLV19erqKhIzzzzjK9nAwAA8BmPImjo0KHKysrSoUOHZLPZ1KtX\nLzkcHh0KAADwo+RRyeTn5+ull15SSUmJLMtyf37FihU+GwwAAMCXPIqg2bNn65prrtE111zT4LfM\nAwAABAqPIqi2tlYPP/ywr2cBAABoNh59d9jAgQOVlZUll8vl63kAAACahUd3gt5++2299NJL53zO\nZrMpJyfHJ0MBAAD4mkcRlJ2d7es5AAAAmpVHEXTmzBm9+eabqqiokGVZqq+v1/Hjx/XUU0/5ej4A\nAACf8Og9QTNmzFBOTo62bt2qqqoqZWVlKSjIo0MBAAB+lDwqmeLiYiUmJmr48OG6+eabtWHDBuXm\n5vp6NgAAAJ/xKILat28vSerevbs+++wzhYWFqaamxqeDAQAA+JJH7wm69tpr9eCDD+rhhx/W1KlT\ndeDAAYWEhPh6NgAAAJ/xKIKmT5+u9PR07d27V3FxcbLZbOrcubOvZwMAAPAZj//ajNOnTys6Opq/\nNgMAALQIHkXQkSNH9Pbbb/t6FgAAgGbj0Ruju3btqhMnTvh6FgAAgGbT4J0gp9Mpm82moqIi3XHH\nHerdu7fsdrv78bS0NJ8PCAAA4AsNRtDMmTObaw4AAIBm1WAEDRo0qLnmAAAAaFb83RcAAMBIRBAA\nADASEQQAAIxEBAEAACMRQQAAwEhEEAAAMBIRBAAAjEQEAQAAIxFBAADASEQQAAAwEhEEAACMRAQB\nAAAjEUEAAMBIRBAAADASEQQAAIxEBAEAACMRQQAAwEhEEAAAMBIRBAAAjOTw9wBoeocOHVJRUbm/\nx/CZ4uJQ9gtQLXk3if0CHfv5T7duPWS325v9ukRQC+Rc8L9q2/4yf48BAECjKksKtHrenYqOjmn2\naxNBLVDb9pcptGNnf48BAMCPGu8JAgAARiKCAACAkYggAABgJCIIAAAYiQgCAABGIoIAAICRiCAA\nAGAkIggAABiJCAIAAEYiggAAgJGIIAAAYCQiCAAAGIkIAgAARiKCAACAkYggAABgJCIIAAAYiQgC\nAABGIoIAAICRiCAAAGAkIggAABiJCAIAAEYiggAAgJGIIAAAYCQiCAAAGIkIAgAARiKCAACAkYgg\nAABgJCIIAAAYiQgCAABGIoIAAICRiCAAAGAkIggAABjJ0RwXycnJ0QsvvKDi4mJZliWXy6Vhw4bp\n3nvvVXBwsNfndTqdWrp0qfbv3681a9YoKipK9fX1stlsmj59uq699tommT85OVkRERGaMGHCOZ8f\nOnSosrOzNWXKFNXV1SkvL0+dOnVShw4dNGTIEN1///1Ncn0AAND0fB5Bu3bt0rp167R48WJ169ZN\nklRbW6v09HTNnDlTzz//vNfnttls7o/vuOMOJSQkSJLOnDmjSZMm6eWXX1Z4ePhFze+J9evXS5IW\nLFig22+/XUOHDvX5NQEAwMXxaQS5XC6lpKRo7dq12rJli+bNm6fw8HDZ7XY5nU4dO3ZM77//vp59\n9lm1atVK48ePV2RkpJKSkmS329W1a1ctXbpUVVVVevTRR1VWVqaCggJNmjRJcXFxsizrB68bHh6u\nW265RX//+981atQoLViwQMeOHZNlWZoyZYpuu+02OZ1O9enTR7m5uaqoqNDq1asVGRmpP/zhDzpw\n4ICKi4vVu3dvLV++3H3e+vp6LVq0SIcPH1aXLl1UU1PT4P7Jycnat2+fKisrtWzZMi1cuFAbN26U\nJE2YMEFJSUl6/fXXdfToURUXF+vs2bOaNGmStm3bpqNHjyoxMVHh4eGaNWuWLrvsMp08eVLXX3+9\n5syZ03RPEgAAhvJpBGVnZ2v48OHKycnRtm3btHHjRp09e1bDhg3TqlWrVF1drZycHLlcLm3atEmS\ndMstt+iVV15Rp06dtHr1ar3++uv62c9+ptjYWN10000qKCiQ0+lUXFxcg9cODw9XcXGxNm7cqPDw\ncK1atUoVFRUaPXq0+2Wy/v37a+HChUpKSlJGRoYmTpyo9u3ba926dbIsS7fffrsKCgrc59yxY4dc\nLpfS09P11Vdfafv27Y3+HkRHR2vhwoXKz88/587Vdz8OCQnRqlWrtHbtWu3cuVPPP/+8Xn/9dWVm\nZio+Pl4nTpxQamqq2rVrp7vvvls5OTnq06fPBT0XAAD40pF3Vnh1nFVfq4cOviyHw7u3x7z22pte\nHSf5OIKOHDmiXr16uQMjKChIISEhiomJUdu2bZWfn68uXbqoe/fukqSioiKdPn1as2fPliRVV1dr\n8ODB+p//+R+tX79e27dvV7t27VRbW9votU+cOKG+fftq3759Gjx4sCSpXbt2io6O1rFjxyTJHRKR\nkZEqLCxUmzZtVFhYqIceekht27ZVVVXVOdf64osv1K9fP/cxkZGRjc7x7W7fV19f7/74qquukiRd\ncsklio6Odn9cXV0tSerdu7fCwsIkSf369VNeXh4RBABoMYKCgmS3e/e9WhERYV5f16cRFBISopKS\nErVp00aVlZWSpBdffFF9+/bVqVOn9M4772jKlCnuuyIdO3ZUZGSkUlJSFBoaqqysLLVr106pqaka\nMGCA4uLitGfPHr333nv/da3vvjRWUFCgrKws/f73v1dpaan+9a9/6aabblJ5eblyc3PVpUsXSefe\njZGknTt36uTJk0pKSlJRUZH+9re/nXPeK6+8UpmZmXI6nTp16pROnjzZ6O9BUNA3T2rr1q115swZ\nWZalsrIyHT9+3P0135/j+z7//HNVV1fL4XDo448/1pgxYxq9LgAAzanHrxZ4dVx5cb5WTLtW0dEx\nXh1/+nSZ1yHk0wgaPHiwVq5cqccee0wJCQnKzMxUr169tHv3bpWUlGjx4sU6ePCgOwJsNpseeeQR\nTZs2TfX19QoLC1NiYqIkadmyZcrMzFRYWJiCg4PlcrnOiYfMzEz9+9//dkfHihUrdMkll2j8+PFa\ntGiR7r77blVXV2vGjBnq1KnTD4ZH//79lZKSIqfTKUmKioo65+WwX/3qV9q1a5cmTJigyMjIC3rT\n9aWXXqrBgwdrzJgxioqK0hVXXOHxscHBwZo1a5YKCwt16623qlevXh4fCwAAfpjNOt+7i5tIUlKS\nzp49q4SEBLVv3979+by8PC1btkxTp07VkCFDfDlCQMvPz9dDDz2k9PR0j4+5cWqKQjt29uFUAAA0\njYu9EyR5/5KYz79Ffs6cOdqxY4fmzp2rqqoq2Ww21dXVKTIyUvPnz1dMjPdLAwAAeKtZfljiiBEj\nNGLEiOa4VIvTuXPnC7oLBAAAPMNfmwEAAIxEBAEAACMRQQAAwEhEEAAAMBIRBAAAjEQEAQAAIxFB\nAADASEQQAAAwEhEEAACMRAQBAAAjEUEAAMBIRBAAADASEQQAAIxEBAEAACMRQQAAwEhEEAAAMBIR\nBAAAjEQEAQAAIxFBAADASEQQAAAwEhEEAACMRAQBAAAjEUEAAMBIRBAAADASEQQAAIxEBAEAACMR\nQQAAwEhEEAAAMBIRBAAAjEQEAQAAIxFBAADASA5/D4CmV1lS4O8RAADwiD//nWWzLMvy29XhE4cO\nHVJRUbm/x/CZTp1C2S9AteTdJPYLdOznP9269ZDdbvf6+IiIMK+OI4JaqNOny/w9gs9ERISxX4Bq\nybtJ7Bfo2C9weRtBvCcIAAAYiQgCAABGIoIAAICRiCAAAGAkIggAABiJCAIAAEYiggAAgJGIIAAA\nYCQiCAAAGIkIAgAARiKCAACAkYggAABgJCIIAAAYiQgCAABGIoIAAICRiCAAAGAkIggAABiJCAIA\nAEYiggAAgJGIIAAAYCQiCAAAGIkIAgAARiKCAACAkYggAABgJCIIAAAYiQgCAABGIoIAAICRiCAA\nAGAkIggAABiJCAIAAEYiggAAgJGIIAAAYCQiCAAAGIkIAgAARiKCAACAkYggAABgJCIIAAAYyWZZ\nluXvIQAAAJobd4IAAICRiCAAAGAkIggAABiJCAIAAEYiggAAgJGIIAAAYCSHvweAdyzL0uOPP66D\nBw+qVatWevLJJxUVFeV+PCsrSykpKXI4HBozZozGjRvnx2kvXGP7SVJVVZWmTp2q5cuXq3v37n6a\n1DuN7ZeRkaG0tDQ5HA717NlTjz/+uP+G9UJj+23btk0vvviigoKCFBsbq/j4eD9Oe+E8+fMpSYsX\nL1aHDh2UkJDghym909hu69ev12uvvaZOnTpJkpYuXapu3br5adoL19h+H3/8sRITEyVJl156qVat\nWqVWrVr5a9wL1tB+hYWFmjNnjmw2myzL0meffaa5c+dqwoQJfp7ac409f1u3btX69etlt9s1evRo\nTZw4sdETIgBt377dmj9/vmVZlrV//37rd7/7nfuxmpoaa8SIEVZZWZnlcrmsMWPGWGfOnPHXqF5p\naD/LsqxPPvnEGj16tDVkyBDryJEj/hjxojS039dff22NGDHCqq6utizLshISEqysrCy/zOmthvar\nq6uzbr75Zqu8vNyqq6uzbrnlFqu4uNhfo3qlsT+flmVZr7zyijVhwgTrmWeeae7xLkpju82dO9c6\ncOCAP0ZrEo3tN3LkSOvLL7+0LMuyXn31VSsvL6+5R7wonvzZtCzL2rdvnzV58mSrvr6+Oce7aI3t\nN2TIEKu0tNRyuVzWiBEjrNLS0gbPx8thAerDDz/U9ddfL0nq37+/Pv30U/djhw8f1hVXXKHQ0FAF\nBwdr4MCB2rt3r79G9UpD+0lSTU2NUlJS1KNHD3+Md9Ea2q9Vq1ZKT093/9dnbW2tWrdu7Zc5vdXQ\nfkFBQXrrrbfUrl07FRcXy7IsBQcH+2tUrzT253Pfvn365JNPFBcX54/xLkpjux04cEAvvPCC7r77\nbq1du9YfI16UhvbLy8tThw4dlJqaKqfTqZKSkoC6yyU1/vx964knntCSJUtks9mac7yL1th+vXv3\nVklJiaqrqyWp0f2IoABVXl6usLAw968dDofq6+t/8LF27dqprKys2We8GA3tJ0kDBgzQ5ZdfLitA\nf+B5Q/vZbDb3Sw0bNmxQVVWVBg8e7Jc5vdXY8xcUFKQdO3Zo5MiRGjRokNq2beuPMb3W0H6nT59W\ncnKyFi9eHJB/Pht77m6//XYtWbJEaWlp+vDDD/Xee+/5Y0yvNbRfcXGx9u/fL6fTqdTUVL3//vva\ns2ePv0b1SmPPn/TN2yV69uypK664ornHu2iN7RcTE6MxY8bojjvu0A033KDQ0NAGz0cEBajQ0FBV\nVFS4f11fX6+goCD3Y+Xl5e7HKioqdMkllzT7jBejof1agsb2syxLiYmJ2r17t5KTk/0x4kXx5Pkb\nMWKEsrOz5XK5tGXLluYe8aI0tN/bb7+ts2fP6r777tPatWuVkZERUPs19txNnjxZHTp0kMPh0LBh\nw/Sf//zHH2N6raH9OnTooK5du6p79+5yOBy6/vrrz3sn5cfKk//tbd26VePHj2/u0ZpEQ/sdPHhQ\n7777rrKyspSVlaUzZ85o27ZtDZ6v5fxbxTA///nP3f8Ftn//fvXs2dP9WHR0tI4eParS0lK5XC7t\n3btXV199tb9G9UpD+7UEje23aNEi90t+gfSmzG81tF95ebmcTqdcLpckKSQkJOBuyTe0n9Pp1ObN\nm5WWlqZp06YpNjZWo0aN8teoF6yx5y42NlZVVVWyLEv//Oc/1bdvX3+N6pWG9ouKilJlZaWOHTsm\n6ZuXXq688kq/zOktT/6/89NPP9WAAQOae7Qm0dB+YWFhCgkJUatWrdx31EtLSxs8H3+BaoCyvvMO\neUlasWKFDhw4oKqqKo0bN07vvvuukpOTZVmWxo4d2/g75H9kGtvvW/Hx8VqyZElAf3eYdO5+ffv2\n1dixYzVw4EBJ37w8Fh8fr5tuusmfI1+Qxp6/V199Va+++qqCg4PVq1cvLVq0KKBCyNM/n2+88Yby\n8vIC9rvDpP/ebevWrUpLS1Pr1q113XXXacaMGX6e+MI0tt+ePXv09NNPS/rmZfeFCxf6c9wL1th+\nRUVF+u1vf6s33njDz5N6p7H90tPTtXnzZrVq1Updu3bVE088IYfj/N8ITwQBAAAj8XIYAAAwEhEE\nAACMRARQEVZ+AAAAJ0lEQVQBAAAjEUEAAMBIRBAAADASEQQAAIxEBAEAACMRQQAAwEj/Hx8EHz/2\nL9YNAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x1181b1eb8>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "sentiment_pd.plot(x='handle', y='prediction', xerr='err', kind='barh');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But let's examine some \"negative\" tweets by Trump"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Row(review='Leaving West Palm Beach, Florida now - heading to St. Augustine for a 3pm rally. Will be in Tampa at 7pm - join me:… https://t.co/eLunEQRxZq'),\n",
       " Row(review=\"'The Clinton Foundation’s Most Questionable Foreign Donations'\\n#PayToPlay #DrainTheSwamp\\nhttps://t.co/IkeqMRjX5z\"),\n",
       " Row(review='Why has nobody asked Kaine about the horrible views emanated on WikiLeaks about Catholics? Media in the tank for Clinton but Trump will win!'),\n",
       " Row(review='Major story that the Dems are making up phony polls in order to suppress the the Trump . We are going to WIN!'),\n",
       " Row(review='Thank you Naples, Florida! Get out and VOTE #TrumpPence16 on 11/8. \\nLets #MakeAmericaGreatAgain! \\nFull Naples rally… https://t.co/5ZbteSJ00K')]"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "best_model.\\\n",
    "    transform(tweets_df.withColumnRenamed('text', 'review')).\\\n",
    "    where(fn.col('handle') == '@realDonaldTrump').\\\n",
    "    where(fn.col('prediction') == 0).\\\n",
    "    select('review').\\\n",
    "    take(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And Clinton"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Row(review='When Trump trivializes the sacrifice of our military and veterans, he makes it clear: He has no idea what service t… https://t.co/taRFZh6Ny5'),\n",
       " Row(review='Good question. https://t.co/wrd7SUI4cI https://t.co/Gpio1LA5Z8'),\n",
       " Row(review='Last night, Trump called a military effort to push terrorists out of Mosul a “total disaster.”\\n\\nThat’s dangerous. https://t.co/1MzyauM3Nw'),\n",
       " Row(review='RT @dougmillsnyt: .@SenWarren with @HillaryClinton during a campaign rally at Saint Anselm College in Manchester, NH https://t.co/ZsCfgVPKoz'),\n",
       " Row(review='\"Donald Trump aggressively disrespects more than half the people in this country.” —@ElizabethForMA https://t.co/Lvsb5mkLSt')]"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "best_model.\\\n",
    "    transform(tweets_df.withColumnRenamed('text', 'review')).\\\n",
    "    where(fn.col('handle') == '@HillaryClinton').\\\n",
    "    where(fn.col('prediction') == 0).\\\n",
    "    select('review').\\\n",
    "    take(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, there are lots of room for improvement."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: Test yourself"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. From the IMDB dataframe (`imdb_reviews_df`), compute the average review length between positive and negative reviews. Hint: use the spark sql function `length`. In particular, as we imported the funcions with the name `fn` (using `from pyspark.sql import function as fn`), use `fn.length` with the name of the column.\n",
    "2. In the IMDB review database, are positive reviews longer than negative reviews?\n",
    "3. Using the sentiment dataframe `sentiments_df`, find the imdb reviews with the most number of negative words. __Hint__: You need to tokenize the `review` field in `imdb_review_df` and then join with `sentiments_df`. Finally, perform selection and summary query\n",
    "4. Similar to 3, find the imdb review with the most number of positive words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 3: On our own\n",
    "\n",
    "1) Using the best model fitted (`best_model`), estimate the generalization error in the testing set (`testing_df`)\n",
    "\n",
    "2) One way of analyzing what is wrong with a model is to examine when they fail the hardest. In our case, we could do this by looking at cases in which logistic regression is predicting with high probability a positive sentiment when in fact the actual sentiment is negative. \n",
    "\n",
    "To extract the probability of positive sentiment, however, we must extract it from the prediction with a custom function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+--------------------+-----+--------------------+\n",
      "|       id|              review|score|probability_positive|\n",
      "+---------+--------------------+-----+--------------------+\n",
      "|    neg_1|Robert DeNiro pla...|  0.0| 0.06587158723273108|\n",
      "|neg_10023|Shame on Yash Raj...|  0.0|0.009391984685694108|\n",
      "|neg_10032|The storyline was...|  0.0|0.014166856725052536|\n",
      "|neg_10049|I love the freque...|  0.0| 0.43207946659537233|\n",
      "|neg_10140|This movie should...|  0.0|  0.3240853568563492|\n",
      "|neg_10179|This movie is a d...|  0.0|  0.3965396032278934|\n",
      "|  neg_102|My girlfriend onc...|  0.0|  0.1672654011976563|\n",
      "| neg_1026|Wow. I do not thi...|  0.0|0.009696097987398388|\n",
      "|neg_10260|This movie was te...|  0.0|0.031003726981102722|\n",
      "|neg_10370|I regret every si...|  0.0|  0.6504615507599262|\n",
      "|neg_10392|I'm not going to ...|  0.0|7.405368733500202E-4|\n",
      "|neg_10404|This, the direct-...|  0.0| 0.08210968623231926|\n",
      "|neg_10415|2 stars out of a ...|  0.0|0.015189026075388143|\n",
      "|neg_10446|B movie at best. ...|  0.0| 0.16661889108274267|\n",
      "|neg_10478|I sincerely consi...|  0.0|  0.0340540263906761|\n",
      "|neg_10487|OMG! The only rea...|  0.0|0.002294097751429...|\n",
      "|neg_10504|I couldn't watch ...|  0.0|0.004374382075886643|\n",
      "|neg_10517|From the start th...|  0.0|2.473302768274908E-4|\n",
      "|neg_10529|1st watched 12/6/...|  0.0|  0.2774598198530337|\n",
      "|neg_10536|William Cooke and...|  0.0|  0.7412655949318276|\n",
      "+---------+--------------------+-----+--------------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql import types\n",
    "\n",
    "def probability_positive(probability_column):\n",
    "    return float(probability_column[1])\n",
    "func_probability_positive = fn.udf(probability_positive, types.DoubleType())\n",
    "\n",
    "prediction_probability_df = best_model.transform(validation_df).\\\n",
    "    withColumn('probability_positive', func_probability_positive('probability')).\\\n",
    "    select('id', 'review', 'score', 'probability_positive')\n",
    "prediction_probability_df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Analyze the worst predictions that are __very__ wrong and suggest some ways of improving the model. __Hint__: Do a query that would get the highest `probability_positive` values for cases where `score` is `0`, and vice versa."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3) Using the best model (`best_model`), predict the sentiment of the following sentences.\n",
    "\n",
    "a) \"Make America great again\"\n",
    "\n",
    "b) \"Cats are not always the best companion\"\n",
    "\n",
    "c) \"This sentence is not a sentence\""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}