{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 3 - Spark MLlib\n",
    "\n",
    "\"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E\"\n",
    "-Tom M. Mitchell\n",
    "\n",
    "Machine Learning - the science of getting computers to act without being explicitly programmed\n",
    "\n",
    "MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering (this example!), dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.\n",
    "\n",
    "It divides into two packages:\n",
    "1. spark.mllib contains the original API built on top of RDDs.\n",
    "2. spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.\n",
    "\n",
    "\n",
    "Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. But we will keep supporting spark.mllib along with the development of spark.ml. Users should be comfortable using spark.mllib features and expect more features coming.\n",
    "\n",
    "http://spark.apache.org/docs/latest/mllib-guide.html\n",
    "\n",
    "## Online Purchase Recommendations\n",
    "\n",
    "Learn how to create a recommendation engine using the Alternating Least Squares algorithm in Spark's machine learning library\n",
    "\n",
    "<img src='https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/ALS.png' width=\"70%\" height=\"70%\"></img>\n",
    "\n",
    "## The data\n",
    "\n",
    "This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.  The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.\n",
    "\n",
    "http://archive.ics.uci.edu/ml/datasets/Online+Retail\n",
    "\n",
    "<img src='https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/FullFile.png' width=\"80%\" height=\"80%\"></img>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Create an RDD from the CSV File \n",
    "### 1.1 - Download the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2016-10-17 17:01:52--  https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/OnlineRetail.csv.gz\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 7483128 (7.1M) [application/octet-stream]\n",
      "Saving to: 'OnlineRetail.csv.gz'\n",
      "\n",
      "100%[======================================>] 7,483,128   --.-K/s   in 0.1s    \n",
      "\n",
      "2016-10-17 17:01:54 (69.4 MB/s) - 'OnlineRetail.csv.gz' saved [7483128/7483128]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "#Download the data from github to the local directory\n",
    "!rm 'OnlineRetail.csv.gz' -f\n",
    "!wget https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/OnlineRetail.csv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 - Put the csv into an RDD (at first, each row in the RDD is a string which correlates to a line in the csv) and show the first three lines.\n",
    "<br>\n",
    " <div class=\"panel-group\" id=\"accordion-12\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-1\" href=\"#collapse1-12\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-12\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the Spark context (sc) to get the list of possible methods.  <i>sc.&lt;TAB&gt;</i></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-12\" href=\"#collapse2-12\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-12\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the <i>textFile()</i> method</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-12\" href=\"#collapse3-12\">\n",
    "        Hint 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-12\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:<br>\n",
    "loadRetailData = sc.textFile(\"OnlineRetail.csv.gz\")<br>\n",
    "loadRetailData.take(3)<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Prepare and shape the data:  \"80% of a Data Scientists  job\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 - Remove the header from the RDD and split the remaining lines by comma.\n",
    "<br>\n",
    " <div class=\"panel-group\" id=\"accordion-21\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-21\" href=\"#collapse1-21\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-21\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">The header is the first line in the RDD -- use <i>first()</i> to obtain it.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-21\" href=\"#collapse2-21\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-21\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the <i>filter()</i> method to filter out all lines which are not equal to the header line.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-21\" href=\"#collapse3-21\">\n",
    "        Hint 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-21\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Map the <i>split()</i> method to the remaining lines to split on \",\"</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-21\" href=\"#collapse4-21\">\n",
    "        Hint 4</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse4-21\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:<br>\n",
    "\n",
    "header = loadRetailData.first()<br>\n",
    "splitColumns = loadRetailData.filter(lambda line: line != header).map(lambda l: l.split(\",\"))</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 - Filter the remaining lines using <a href=\"https://docs.python.org/2.6/howto/regex.html\">regular expressions</a>\n",
    "The original file at UCI's Machine Learning Repository has commas in the product description.  Those have been removed to expediate the lab.\n",
    "Only keep rows that have a quantity greater than 0, a non-empty customerID, and a non-blank stock code after removing non-numeric characters.<br><br>\n",
    " <div class=\"panel-group\" id=\"accordion-22\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-22\" href=\"#collapse1-22\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-22\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Examine the header to determine which fields need to be used to filter the data.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-22\" href=\"#collapse2-22\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-22\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the <i>filter()</i> method for the first two requirements.   Note -- you may have to cast values.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-22\" href=\"#collapse3-22\">\n",
    "        Hint 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-22\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Look at the <i><a href=\"https://docs.python.org/2.6/howto/regex.html\">re.sub()</a></i> method</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-22\" href=\"#collapse4-22\">\n",
    "        Hint 4</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse4-22\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:<br>\n",
    "import re<br>\n",
    "filteredRetailData = splitColumns.filter(lambda l: int(l[3]) > 0 and len(re.sub(\"\\D\", \"\", l[1])) != 0 and l[6] != \"\")</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  2.3 - Map each line to a SQL Row and create a Dataframe from the result.   Register the Dataframe as an SQL temp table.\n",
    "<br>\n",
    "Use the following for the Row column names: inv, stockCode, description, quant, invDate, price, custId, country.   inv, stockCode, quant and custId should be integers.   \n",
    "price is a float.  description and country are strings (the default).\n",
    "<br><br>\n",
    "Hint: When you replaced non-digit characters using the regular expression above, you replaced them in the context of a test.  You'll have to do it again when creating the stockCode Row value.\n",
    "<br><br>\n",
    " <div class=\"panel-group\" id=\"accordion-23\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-23\" href=\"#collapse1-23\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-23\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">We haven't used SQLContext or Row in this notebook, so you will have to import them from the pyspark.sql package and then create a <i>SQLContext</i>.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-23\" href=\"#collapse2-23\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-23\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">You can create a <i>Row</i> using a <i>map()</i>.   For example:<br>\n",
    "      example = myRDD.map(lambda x: Row(v1=x[1], v2=int(x[2]), v3=float(x[3]))<br>\n",
    "      Note how we set the column names this way.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-23\" href=\"#collapse3-23\">\n",
    "        Hint 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-23\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">use <i>createDataFrame()</i> in your <i>SQLContext</i>.   Then register the dataframe with <i>registerTempTable()</i></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-23\" href=\"#collapse4-23\">\n",
    "        Hint 4</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse4-23\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:<br>\n",
    "from pyspark.sql import SQLContext, Row<br>\n",
    "sqlContext = SQLContext(sc)<br>\n",
    "\n",
    "retailRows = filteredRetailData.map(lambda l: Row(inv=int(l[0]), stockCode=int(re.sub(\"\\D\", \"\", l[1])), description=l[2], quant=int(l[3]), invDate=l[4], price=float(l[5]), custId=int(l[6]), country=l[7]))<br>\n",
    "\n",
    "retailDf = sqlContext.createDataFrame(retailRows)<br>\n",
    "retailDf.registerTempTable(\"retailPurchases\")</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from pyspark.sql import SQLContext, Row\n",
    "sqlContext = SQLContext(sc)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 - Keep only the data we need (custId, stockCode, and rank)\n",
    "<br>\n",
    "The Alternating Least Squares algorithm requires three values.   In this case, we're going to use the Customer ID (custId), stock code (stockCode) and a ranking value.   In this situation there is not a ranking value within the data, so we will create one.   We will set a value of 1 to indicate a purchase since these are all actual orders.   Set that value to \"purch\".\n",
    "<br><br>\n",
    "After doing the select, group by custId and stockCode.\n",
    "<br><br>\n",
    " <div class=\"panel-group\" id=\"accordion-24\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-24\" href=\"#collapse1-24\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-24\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">To add a fixed value within a select statement, use something like <i>select x,y,1 as purch from z</i></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-24\" href=\"#collapse2-24\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-24\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the <i>group by</i> statement to group results.  To group by two values, separate them by commas (i.e. <i>group by x,y</i>)</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-24\" href=\"#collapse3-24\">\n",
    "        Hint 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-24\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:\n",
    "\n",
    "query = \"\n",
    "SELECT \n",
    "    custId, stockCode, 1 as purch\n",
    "FROM \n",
    "    retailPurchases \n",
    "group \n",
    "    by custId, stockCode\"<br>\n",
    "uniqueCombDf = sqlContext.sql(query)</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.5 - Randomly split the data into a testing set (10% of the data), a cross validation set (10% of the data) a training set (80% of the data)\n",
    "<br><br>\n",
    "We wish to split up the data into three parts.   A training set (80%) to train the algorithm, a testing set (10%) and a cross-validation set (10%).   The data for each set should be randomly selected.\n",
    "<br><br>\n",
    " <div class=\"panel-group\" id=\"accordion-25\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-25\" href=\"#collapse1-25\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-25\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the <i><a href=\"https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame.randomSplit\">randomSplit()</a></i> method</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-25\" href=\"#collapse2-25\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-25\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:<br>\n",
    "      testDf, cvDf, trainDf = uniqueCombDf.randomSplit([.1,.1,.8])<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-25\" href=\"#collapse3-25\">\n",
    "        Advanced Optional 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-25\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\"><i>randomSplit()</i> takes an optional seed parameter.  At the end of the exercise give a random seed and see whether the results change.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Build recommendation models\n",
    "\n",
    "### 3.1 - Use the training dataframe to train a model with Alternating Least Squares using the <i><a href=\"https://spark.apache.org/docs/1.6.1/api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS\">ALS</a></i> class\n",
    "<br>\n",
    "ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called ‘factor’ matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix.\n",
    "<br><br>\n",
    "Latent Factors / rank<br>\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;The number of columns in the user-feature and product-feature matricies<br>\n",
    "Iterations / maxIter<br>\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;The number of factorization runs<br><br>\n",
    "To use the ALS class type:\n",
    "<br>\n",
    "from pyspark.ml.recommendation import ALS<br>\n",
    "<br>\n",
    "When running ALS, we need to create two separate instances.   For both instances userCol is custId, itemCol is stockCode and ratingCol is purch.<br><br>\n",
    "For the first instance, use a rank of 15 and set iterations to 5.<br>\n",
    "For the second instance, use a rank of 2 and set iterations to 10.<br>\n",
    "Run <i><a href=\"https://spark.apache.org/docs/1.6.1/api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS.fit\">fit()</a></i> on both instances using the training dataframe.<br>\n",
    "<br>\n",
    " <div class=\"panel-group\" id=\"accordion-31\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-31\" href=\"#collapse1-31\">\n",
    "        Advanced Optional 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-31\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Create an emply instance of the <i>ALS</i> class and run the <i>explainParams</i> method on it to see the default values.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-31\" href=\"#collapse2-31\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-31\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">als1 = ALS(rank=15, maxIter=5, userCol=\"custId\", itemCol=\"stockCode\", ratingCol=\"purch\")</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-31\" href=\"#collapse3-31\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-31\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">model1 = als1.fit(trainDf)</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-31\" href=\"#collapse4-31\">\n",
    "        Hint 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse4-31\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:\n",
    "<br>\n",
    "from pyspark.ml.recommendation import ALS<br>\n",
    "\n",
    "als1 = ALS(rank=15, maxIter=5, userCol=\"custId\", itemCol=\"stockCode\", ratingCol=\"purch\")<br>\n",
    "model1 = als1.fit(trainDf)<br>\n",
    "\n",
    "als2 = ALS(rank=2, maxIter=10, userCol=\"custId\", itemCol=\"stockCode\", ratingCol=\"purch\")<br>\n",
    "model2 = als2.fit(trainDf)\n",
    "</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.recommendation import ALS\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Test the models\n",
    "\n",
    "Use the models to predict what the user will rate a certain item.  The closer our model is to 1 for an item a user has already purchased, the better.\n",
    "\n",
    "### 4.1 - Evaluate the model with the cross validation dataframe by using the transform function.\n",
    "\n",
    "Some of the users or purchases in the cross validation data may not have been in the training data.  Let's remove the ones that aren't.   To do this obtain all the the custId and stockCode values from the training data and filter out any lines with those values from the cross-validation data.\n",
    "<br><br>\n",
    "At the end, print out how many cross-validation lines we had at the start -- and the new number afterwords.\n",
    "<br><br>\n",
    " <div class=\"panel-group\" id=\"accordion-41\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-41\" href=\"#collapse1-41\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-41\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use <i>map()</i> to return a specific value (i.e. foo = foo.map(lambda x: x.value)) and put them all in a set (i.e. foo1 = set(foo))<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-41\" href=\"#collapse2-41\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-41\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">You need all the returned values (remember they might be spread all across the cluster!) so run collect() on the results of the map(). (i.e. foo1 = set(foo.collect()))</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-41\" href=\"#collapse3-41\">\n",
    "        Hint 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-41\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the <i>filter()</i> to filter out any values in the cross-validation dataframe which are in the stockCode or custId sets.   Use <i>toDF()</i> to change the results to a dataframe.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-41\" href=\"#collapse4-41\">\n",
    "        Hint 4</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse4-41\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:<br>\n",
    "customers = set(trainDf.rdd.map(lambda line: line.custId).collect())<br>\n",
    "stock = set(trainDf.rdd.map(lambda line: line.stockCode).collect())<br>\n",
    "\n",
    "filteredCvDf = cvDf.rdd.filter(lambda line: line.stockCode in stock and line.custId in customers).toDF()<br>\n",
    "\n",
    "print cvDf.count()<br>\n",
    "print filteredCvDf.count()<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 4.2 - Make Predictions using <i><a href=\"https://spark.apache.org/docs/1.6.1/api/python/pyspark.ml.html#pyspark.ml.recommendation.ALSModel.transform\">transform()</a></i>\n",
    "\n",
    "Type:\n",
    "\n",
    "predictions1 = model1.transform(filteredCvDf)<br>\n",
    "predictions2 = model2.transform(filteredCvDf)\n",
    "</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3 - Calculate and print the Mean Squared Error.   For all ratings, subtract the prediction from the actual purchase (1), square the result, and take the mean of all of the squared differences.\n",
    "\n",
    "The lower the result number, the better the model.\n",
    "\n",
    "Type:\n",
    "\n",
    "meanSquaredError1 = predictions1.map(lambda line: (line.purch - line.prediction)\\*\\*2).mean()<br>\n",
    "meanSquaredError2 = predictions2.map(lambda line: (line.purch - line.prediction)\\*\\*2).mean()<br><br>\n",
    "    \n",
    "print 'Mean squared error = %.4f for our first model' % meanSquaredError1<br>\n",
    "print 'Mean squared error = %.4f for our second model' % meanSquaredError2\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.4 - Confirm the model by testing it with the test data and the best hyperparameters found during cross-validation\n",
    "\n",
    "Filter the test dataframe (testDf) the same way as the cross-validation dataframe.   Then run the transform() and calculate the mean squared error.   It should be the same as the value calcuated above.\n",
    "\n",
    "Type:\n",
    "\n",
    "filteredTestDf = testDf.rdd.filter(lambda line: line.stockCode in stock and line.custId in customers).toDF()<br>\n",
    "predictions3 = model2.transform(filteredTestDf)<br>\n",
    "meanSquaredError3 = predictions3.map(lambda line: (line.purch - line.prediction)\\*\\*2).mean()<br><br>\n",
    "    \n",
    "print 'Mean squared error = %.4f for our best model' % meanSquaredError3\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5 - Implement the model\n",
    "\n",
    "### 5.1 - First, create a dataframe in which each row has the user id and an item id.\n",
    "<br>\n",
    "Use the <i><a href=\"https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame\">Dataframe</a></i> methods to create a Dataframe with a specific user and that user's purchased products.<br>\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;First, use the Dataframe <i><a href=\"https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame.filter\">filter()</a></i> to filter out all custId's but 15544.<br>\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;Then use the <i><a href=\"https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame.select\">select()</a></i> to only return the <i>custId</i> column.<br>\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;Now use <i><a href=\"https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame.distinct\">distinct()</a></i> to ensure we only have the single custId.<br>\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;Do a <i><a href=\"https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join\">join()</a></i> with the distinct values from the stockCode column.\n",
    "<br><br>\n",
    " <div class=\"panel-group\" id=\"accordion-51\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-51\" href=\"#collapse1-51\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-51\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the Dataframe <i>filter()</i> method to filter out all users but 15544<br>\n",
    "      user = trainDf.filter(trainDf.custId == 15544)<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-51\" href=\"#collapse2-51\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-51\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the Dataframe <i>select()</i> method to only select the custId column<br>\n",
    "      userCustId = user.select(\"custId\")</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-51\" href=\"#collapse3-51\">\n",
    "        Hint 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-51\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the Dataframe <i>distinct()</i> method to only return unique rows.<br>\n",
    "      userCustIdDistinct = userCustId.distinct()</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-51\" href=\"#collapse4-51\">\n",
    "        Hint 4</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse4-51\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use the Dataframe <i>join()</i> method to join the results with distinct stockCodes</div>\n",
    "    </div>\n",
    "  </div>\n",
    "    <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-51\" href=\"#collapse5-51\">\n",
    "        Hint 5</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse5-51\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:<br>\n",
    "user = trainDf.filter(trainDf.custId == 15544)<br>\n",
    "userCustId = user.select(\"custId\")<br>\n",
    "userCustIdDistinct = userCustId.distinct()<br>\n",
    "stockCode = trainDf.select(\"stockCode\")<br>\n",
    "stockCodeDistinct = stockCode.distinct()<br>\n",
    "userItems = userCustIdDistinct.join(stockCodeDistinct)<br>\n",
    "OR\n",
    "userItems = trainDf.filter(trainDf.custId == 15544).select(\"custId\").distinct().join( trainDf.select(\"stockCode\").distinct())<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 - Use 'transform' to rate each item.\n",
    "\n",
    "Type:\n",
    "\n",
    "bestRecsDf = model2.transform(userItems)<br>\n",
    "bestRecsDf.first()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  5.3 - Print the top 5 recommendations sorted on prediction.\n",
    "\n",
    " <div class=\"panel-group\" id=\"accordion-53\">\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-53\" href=\"#collapse1-53\">\n",
    "        Hint 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse1-53\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">In order to print the top five recommendations, we need to <i><a href\"https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort\">sort()</a></i> them in descending order<br></div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-53\" href=\"#collapse2-53\">\n",
    "        Hint 2</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse2-53\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Use <i>take()</i> to get the top 5 values.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-53\" href=\"#collapse3-53\">\n",
    "        Hint 3</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse3-53\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">Type:<br>\n",
    "      print bestRecsDf.sort(\"prediction\",ascending=False).take(5)</div>\n",
    "    </div>\n",
    "  </div>\n",
    "  <div class=\"panel panel-default\">\n",
    "    <div class=\"panel-heading\">\n",
    "      <h4 class=\"panel-title\">\n",
    "        <a data-toggle=\"collapse\" data-parent=\"#accordion-53\" href=\"#collapse4-53\">\n",
    "        Advanced Optional 1</a>\n",
    "      </h4>\n",
    "    </div>\n",
    "    <div id=\"collapse4-53\" class=\"panel-collapse collapse\">\n",
    "      <div class=\"panel-body\">select from the retailPurchases temp table on stockCode to see some of selections recommended.</div>\n",
    "    </div>\n",
    "  </div>\n",
    "</div> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look up this user and the recommended product ID's in the excel file...\n",
    "\n",
    "<img src='https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/user.png' width=\"80%\" height=\"80%\"></img>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "This user seems to have purchased a lot of childrens gifts and some holiday items.  The recommendation engine we created suggested some items along these lines\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#####  Citation\n",
    "Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197â€“208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2 with Spark 1.6",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}