{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Project 3\n",
    "\n",
    "## <em> Classification and Inference with Machine Learning</em>\n",
    "<br>\n",
    "This notebook is arranged in cells. Texts are usually written in the markdown cells, and here you can use html tags (make it bold, italic, colored, etc). You can double click on this cell to see the formatting.<br>\n",
    "<br>\n",
    "The ellipsis (...) are provided where you are expected to write your solution but feel free to change the template (not over much) in case this style is not to your taste. <br>\n",
    "<br>\n",
    "<em>Hit \"Shift-Enter\" on a code cell to evaluate it.  Double click a Markdown cell to edit. </em><br>\n",
    "\n",
    "<span style=\"color:blue\"><i> Write your partner's name here (if you have one). </i></span> <br><br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "### Link Okpy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from client.api.notebook import Notebook\n",
    "ok = Notebook('project3.ok')\n",
    "_ = ok.auth(inline = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "from matplotlib.colors import LogNorm\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this project, we will get acquainted with some of the well known machine learning techniques and use them for classification and regression. Specifically, we will use\n",
    "- Linear models\n",
    "- k-Nearest Neighbors\n",
    "- Random Forests\n",
    "- Support Vector Machines\n",
    "- Gaussian Process\n",
    "- Neural Networks\n",
    "\n",
    "The performance of these algorithms does depend on 'hyperparameters' which need to be tuned to get optimal. This is primarily what we will investigate. Since a lot of these tunings are common to different algorithms, to avoid repition, we will not investigate all in each of them. That being said, still a lot of the knobs will be repeated and its recommended to write utility functions to make plots and minimize manual labor (copy-pasting)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data\n",
    "\n",
    "The data is provided in the file <b>\"specz_data.txt\"</b>. The columns of the file (length of 13) correspond to - <br>\n",
    "spectroscopic redshift ('zspec'), RA, DEC, magnitudes in 5 bands - u, g, r, i, z (denoted as 'mu,' 'mg,' 'mr,' 'mi,' 'mz' respectively); Exponential and de Vaucouleurs model magnitude fits ('logExp' and 'logDev' http://www.sdss.org/dr12/algorithms/magnitudes/); zebra fit ('pz_zebra); Neural Network fit ('pz_NN') and its error estimate ('pz_NN_Err') <br>\n",
    "\n",
    "We will undertake 2 exercises  - \n",
    "- Regression\n",
    "    - We will use the magnitude of object in different bands ('mu, mg, mr, mi, mz') and do a regression exercise to estimate the redshift of the object. Hence our feature space is 5.\n",
    "    - The correct redshift is given by 'zspec', which is the spectroscopic redshift of the object. We will use this for training and testing purpose. \n",
    "    \n",
    "    Sidenote: Photometry vs. Spectroscopy\n",
    "    \n",
    "    <i>&nbsp; &nbsp; The amount of energy we receive from celestial objects – in the form of radiation – is called the flux, and an astro- nomical technique of measuring the flux is photometry. Flux is usually measured over broad wavelength bands, and with the estimate of the distance to an object, it can infer the object’s luminosity, temperature, size, etc. Usually light is passed through colored filters, and we measure the intensity of the filtered light. \n",
    "    \n",
    "    &nbsp; &nbsp; On the other hand, spectroscopy deals with the spectrum of the emitted light. This tells us what the object is made of, how it is moving, the pressure of the material in it, etc. Note that for faint objects making photometric observation is much easier.\n",
    "    \n",
    "    &nbsp; &nbsp; Photometric redshift (photoz) is an estimate of the distance to the object using photometry. Spectroscopic redshift observes the object’s spectral lines and measures their shifts due to the Doppler effect to infer the distance.</i>\n",
    "    \n",
    "\n",
    "- Classification\n",
    "    - We will use the same magnitudes and now also the redshift of the object  ('zspec') to classify the object as either Elleptical or Spiral. Hence our feature space is now 6.\n",
    "    - The correct class is given by compring 'logExp' and 'logDev' which are the fits for Exponential and Devocular profiles. If logExp > logDev, its a spiral and vice-versa. We will use this for training and testing purpose. Since the classes are not explicitly given, generate a column for those (Classes can be $\\pm 1$. If it is $0$, it does not belong to either of the class.)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:blue\"><i><b> Prep 1. Cleaning </b></i></span>\n",
    "\n",
    "Read in the files to create the data (X and Y) for both regression and classification. <br>\n",
    "You will have to clean the data - \n",
    "- Drop the entries that are nan or infinite\n",
    "- Drop the unrealistic numbers such as 999, -999; and magnitudes that are unrealistic. Since these are absolute magnitudes, they should be positive and high. Lets choose a magnitude limit of 15 as safe bet.\n",
    "- For classification, drop the entries that do not belong to either of the class\n",
    "\n",
    "For regression, X and Y data is cleaned magnitudes (5 feature space) and spectroscopic redshifts respectively.\n",
    "For classification, X and Y data is cleaned magnitudes+spectroscopic redshifts respectively (6 feature space) and classees respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Read in and create data\n",
    "\n",
    "fname = 'specz_data.txt'\n",
    "spec_dat=np.genfromtxt(fname,names=True)\n",
    "print(spec_dat.dtype.fields.keys())\n",
    "\n",
    "#convenience variable\n",
    "zspec = spec_dat['zspec']\n",
    "logExp, logDev = spec_dat['logExp'], spec_dat['logDev']\n",
    "...\n",
    "\n",
    "#Cleaning data for Regression\n",
    "\n",
    "#Cleaning data for Classification\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** What is the size of your data (number of objects) before and after cleaning? (For both regression and classification)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:blue\"><i><b> Prep 2. Visualization </b></i></span>\n",
    "\n",
    "The next step should be to visualize the data. <br>\n",
    "For regression\n",
    "- Make a histogram for the distribution of the data (spectroscopic redshift). \n",
    "- Make 5 2D histograms of the distribution of the magnitude as function of redshift (Hint: https://matplotlib.org/devdocs/api/_as_gen/matplotlib.axes.Axes.hist2d.html)\n",
    "\n",
    "For classification <br>\n",
    "- Make 6 1-d histogram for the distribution of the data (6 features - zspec and 5 magnitudes) for both class 1 and -1 separately "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Redshift distribution of objects and colors \n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Redshift distribution of objects and colors based on the class\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:blue\"><i><b> Prep 3. Preprocessing  </b></i></span>\n",
    "\n",
    "- Next, split the sample into training data and the testing data. We will be using the training data to train different algorithms and then compare the performance over the testing data. In this project, keep 80% data as training data and uses the remaining 20% data for testing.  <br>\n",
    "- Often, the data can be ordered in a specific manner, hence shuffle the data prior to splitting it into training and testing samples. <br>\n",
    "- Many algorithms are also not scale invariant, and hence scale the data (different features to a uniform scale). All this comes under preprocessing the data.\n",
    "http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing <br>\n",
    "Use StandardScaler from sklearn (or write your own routine) to center the data to 0 mean and 1 variance. Note that you only center the training data and then use its mean and variance to scale the testing data before using it. <br><br>\n",
    "(Hint: How to get a scaled training data: <br>\n",
    "Let the training data be: train = (\"training X data\", \"training Y data\")<br>\n",
    "You can first define a StandardScaler: scale_xdata, scale_ydata = preprocessing.StandardScaler(), preprocessing.StandardScaler()<br>\n",
    "Then, do the fit: scale_xdata.fit(train[0]), scaley.fit(train[1])<br>\n",
    "Next, transform: scaled_train_data = (scale_xdata.fit_transform(train[0]), scale_ydata.fit_transform(train[1]))<br><br>\n",
    "- For some algorithms, the size of this training data will be unweildy (for eg. for GP, we will have to use ake inverses of matrices). For this purpose, create a smaller training set of 2000 samples. Again, create another copy of this which is normalized to standard scale.\n",
    "- Do this for both, classification and regression.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def prepdata(...): \n",
    "    '''Function to prepare train, test and validation data given X and Y data set given\n",
    "    '''\n",
    "    ...\n",
    "    \n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Training and validation fraction\n",
    "tf, vf = 0.8, 0\n",
    "\n",
    "#Create the data (and subsized data) for Regression\n",
    "...\n",
    "\n",
    "#scale the data\n",
    "scalex, scaley = preprocessing.StandardScaler(), preprocessing.StandardScaler()\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Create the data (and subsized data) for Classification\n",
    "...\n",
    "\n",
    "#scale the data\n",
    "scalexc = preprocessing.StandardScaler()\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:blue\"><i><b> Prep  4. Metrics  </b></i></span>\n",
    "\n",
    "The last remaining preperatory step is to write metric for gauging the performance of the algorithm. Write a function to calculate the 'RMS' error given (y_predict, y_truth) to gauge regression and another function to evaluate accuracy of classification. <br>\n",
    "In addition, for classification, we will also use confusion matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "def rms(y1, y2, ...):\n",
    "    '''Calculate the RMS error given the truth and the prediction\n",
    "    '''\n",
    "\n",
    "def accuracy(y1, y2, ...):\n",
    "    '''Calculate the accuracy given the truth and the prediction\n",
    "    '''\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:blue\"><i><b> 1. Linear Regression  </b></i></span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Try to fit a linear regression model to the data and answer the following questions* - \n",
    "- What is the error (rms) for the training sample and the testing sample? Make a scatter plot of the truth against the predictions. (Though left unasked hereafter, you should do this for every algorithm and after doing any kind of regression)\n",
    "- Does the answer change if you use preprocessed vs raw data? Should it?\n",
    "- Look at the coefficients best fit by the linear model, Does the order of importance agree with your intuition based on the previous visualization of the data?\n",
    "\n",
    "(Hint: <br>\n",
    "Let \"lin  = LinearRegression()\" (This is our model) <br>\n",
    "Also, let testN and trainingN be our test and training data (either scaled or unscaled). testN = (\"test X data\", \"test Y data\")<br>\n",
    "First, do the fit using the training data: lin.fit(*trainN)<br>\n",
    "Then, predict: predict = lin.predict(testN[0]) where testN[0] is test X data.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "# http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Linear Regression\n",
    "print('For linear regression\\n')\n",
    "\n",
    "lin  = LinearRegression()\n",
    "\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('Linear coefficients')\n",
    "lin.coef_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Observation\n",
    "\n",
    "...\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classification\n",
    "\n",
    "Use logistic regression (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html) from linear model to perform classification and calculate the accuracy. (You can use LogisticRegressionCV().fit(...) and LogisticRegressionCV().predict(...)) Check the accuracy by measuring the same from confusion matrix as well. (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegressionCV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "...\n",
    "\n",
    "print('Confusion Matrix\\n', ...)\n",
    "print('Accuracy\\n', ...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:blue\"><i><b> 2. Quadratic Regression  </b></i></span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The simplest extension is fitting a polynomial model to the data where we take combinations of features to 'n'th order. Try to fit the quadratic model to the previous data and answer the same questions again. <br>\n",
    "\n",
    "Use the Pipeline and PolynomialFeatures from sklearn to create quadratic polynomial from the features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "from sklearn.pipeline import Pipeline\n",
    "# http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#For Quadratic Regression\n",
    "\n",
    "qmodel = Pipeline([('poly', PolynomialFeatures(degree=2, interaction_only=False)),\n",
    "                      ('linear', LinearRegression(fit_intercept=False))])\n",
    "\n",
    "...\n",
    "\n",
    "#Scatter plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('Quadratic coefficients')\n",
    "qmodel.steps[1][1].coef_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Observation\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classification\n",
    "\n",
    "*Do a classification in a similar fashion and see if accuracy improves.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Classification\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hyperparamter methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the following sections algorithms, we will be varying hyperparameters to get the best model and build some intuition. There are various ways to do this and we will use 'Grid Search' methodology which simply tries all the combinations along with some cross-validation scheme. For most part, we will use 4-fold cross validation. <br>\n",
    "Sklearn provides GridSearchCV functionality for this purpose. <br>\n",
    "\n",
    "<br>\n",
    "*Do not overwrite these grid search-ed variables (and not only their result) since we will compare all the models together in the end*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV, RandomizedSearchCV\n",
    "# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html\n",
    "# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:blue\"><i><b> 3. k Nearest Neighbors  </b></i></span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For regression, lets play with grid search using knn to tune hyperparmeters. Lets consider the following 3 hyperparameters - \n",
    "- Number of neighbors (vary this between 2-100, say)\n",
    "- Weights of leaves (Uniform or Inverse Distance weighing)\n",
    "- Distance metric (Eucledian or Manhattan distance - parameter 'p')\n",
    "\n",
    "\n",
    "Do a grid search on these parameters using 4 fold cross validation. Identify top 10 models and plot their mean scores, along the standard deviation. <br>\n",
    "Answer the following questions- \n",
    "- Is it always better to use more neighbors?\n",
    "- Is it better to weigh the leaves, if yes, which distance metric performs better?\n",
    "- For every parameter, make plots for the mean test score while marginalizing over other parameters. Which parameters seem to affect the performance most. and try to see which parameter is more important than others (we will do this for each and every method...so spend some time to see the format of output and write a function to do so)<br>\n",
    "- GridCV returns fitting and scoring time for every combination. You will find that scoring time is higher than training time. Why do you think is that the case?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsRegressor\n",
    "# http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hint: (Read the documentations carefully for more detail.)\n",
    "\n",
    "First, define the hyperparameters: parameters = {'n_neighbors':[2, 3, 5, 10, 15, 20, 25, 50, 100], 'weights':['uniform', 'distance'], 'p':[1, 2]}\n",
    "\n",
    "Specify the algorithm you want to use: e.g. knnr = KNeighborsRegressor() \n",
    "\n",
    "Then, Do a grid search on these parameters using 4 fold cross validation: gcknn = GridSearchCV(knnr, parameters, cv=4)\n",
    "\n",
    "Do the fit: gcknn.fit(*train) \n",
    "\n",
    "(Let \"train\" be the training data where \"train = (\"train X data\", \"train Y data\")\"\n",
    "\n",
    "Get results: $results = gcknn.cv_results_$\n",
    "\n",
    "$cv_results_$ has the following dictionaries: \"rank_test_score,\" \"mean_test_score,\" \"std_test_score,\" and \"params\" (See http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) \n",
    "\n",
    "Then, you can identify top 10 models based on \"rank_test_score\" and print out their \"params,\" along with their \"mean_test_score\" and \"std_test_score\". Plot their mean scores, along the standard deviation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#An example of parameters\n",
    "parameters = {'n_neighbors':[2, 3, 5, 10, 15, 20, 25, 50, 100], 'weights':['uniform', 'distance'], 'p':[1, 2]}\n",
    "knnr = KNeighborsRegressor()\n",
    "\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Define some useful functions here\n",
    "You might want to return here after the first algorithm to write some utility functions ad avoid copy pasting for finding the best models, creating plots etc. Declaration for 2 functions that might be useful are given, feel free to make more (or less)\n",
    "\n",
    "Its recommended to spend some time to go through output format of GridSearchCV and write some utility functions to make the recurring plots for every parameter. <br>\n",
    "Grid Search returns a dictionary with self explanatory keys for the most part. Mostly, the keys correspond to (masked) numpy arrays of size = #(all possible combination of parameters). The value of individual parameter in every combination is given in arrays with keys starting from 'param_\\*' and this should help you to match the combination with the corresponding scores. <br>\n",
    "For masked arrays, you can access the data values by using \\*.data\n",
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(Hint:\n",
    "\n",
    "Try this:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "for i, key in enumerate(parameters.keys()):\n",
    "    order = results['param_%s'%key]\n",
    "    print(key)\n",
    "    print(order.data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What does it print out? Think about how you can use this to make plots for the mean test score while marginalizing over other parameters)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# This is only a suggestion. Do it as you like.\n",
    "def topN(results, n=10, plot=True):\n",
    "    '''Parse the result of CV and return top N results based on the score\n",
    "    '''\n",
    "    args = np.argsort(results['rank_test_score'])\n",
    "    for i in range(n):\n",
    "        ...\n",
    "        \n",
    "def plotparams(results, parameters):\n",
    "    '''Parse the result of CV and plot the score by varying a single parameter \n",
    "    '''\n",
    "    ...\n",
    "    \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Identify top 10 models and plot their mean scores, along the standard deviation. \n",
    "topN(results)\n",
    "\n",
    "# For each hyperparameter, make plots for the mean test score while marginalizing over other parameters\n",
    "plotparams(results, parameters)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Plot timings for fitting and scoring*\n",
    "\n",
    "Hint: Assume that you got results from: $results = gcknn.cv_results_$\n",
    "\n",
    "Then, get the scoring time: results['mean_score_time']\n",
    "\n",
    "and the fitting time: results['mean_fit_time']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Time for fitting (For each hyperparameter, make plots for the mean_fit_time while marginalizing over other parameters))\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Time for scoring (For each hyperparameter, make plots for the mean_score_time while marginalizing over other parameters))\n",
    "...\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####     Observations\n",
    "\n",
    "...\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('RMS error on the training data set is ')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "# http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we will look at 4 different type of cross-validation schemes - \n",
    "- Kfold\n",
    "- Stratified Kfold\n",
    "- Shuffle Split\n",
    "- Stratified Shuffle Split\n",
    "\n",
    "Do 4 different grid searches, one for each of these cross validation schemes, and identify top 3 models for each. remember to initiate each model with same random state <br>\n",
    "Answer the following questions-\n",
    "- Do the conclusions for any parameter from the regression case?\n",
    "- Does the mean accuracy change?\n",
    "- Does the variance in mean accuracy change?\n",
    "\n",
    "Give justification for these results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import KFold, StratifiedKFold, ShuffleSplit, StratifiedShuffleSplit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "parameters = {'n_neighbors':[2, 3, 5, 10, 15, 20, 25, 50, 100], 'weights':['uniform', 'distance'], 'p':[1, 2]}\n",
    "knnc = KNeighborsClassifier()\n",
    "\n",
    "#Grid Search\n",
    "gc = GridSearchCV(knnc, parameters, cv=KFold(4, random_state=100))\n",
    "#Do the fit\n",
    "...\n",
    "\n",
    "gc2 = GridSearchCV(knnc, parameters, cv=StratifiedKFold(4, random_state = 100))\n",
    "#Do the fit\n",
    "...\n",
    "\n",
    "gc3 = GridSearchCV(knnc, parameters, cv=ShuffleSplit(4, 0.1, random_state = 100))\n",
    "#Do the fit\n",
    "...\n",
    "\n",
    "gc4 = GridSearchCV(knnc, parameters, cv=StratifiedShuffleSplit(4, 0.1, random_state = 100))\n",
    "#Do the fit\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Comparing different cross-validation schemes "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Make plot for differet schemes (just as in regression)\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Observation\n",
    "\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('The accuracy for the testing data set is ')\n",
    "print('For KFold\\n',)\n",
    "...\n",
    "print('For Stratified shuffle split\\n', )\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Best model\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Henceforth, which cross validation scheme should be used for classification?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:blue\"><i><b> 4. Random Forests  </b></i></span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestRegressor\n",
    "# http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most important feature of the random forest is the number of trees in the ensemble. We will also play with the maximum depth of the trees. \n",
    "- Do a combined grid search on both these parameters and identify top 10 models. \n",
    "- Are the scores of these models statistically different? Based on this, which architecture will you choose for your model?\n",
    "- For every parameter, make the plot for fitting time. Based on this and the previous question, how many trees do you recommend keeping in the ensemble?\n",
    "- Random forest also gives the importance of different parameters. See how this compares with the coefficients given by the linear model and your expectations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Grid Search\n",
    "# This will take few minutes\n",
    "rf = RandomForestRegressor()\n",
    "parameters = {'n_estimators':[10, 50, 150, 200, 300], 'max_depth':[10, 50, 100]}\n",
    "\n",
    "gcrf = GridSearchCV(rf, parameters, cv=5)\n",
    "# Do the fit\n",
    "...\n",
    "results = gcrf.cv_results_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Identify top 10 models and plot their mean scores, along the standard deviation. \n",
    "...\n",
    "\n",
    "# For each hyperparameter, make plots for the mean test score while marginalizing over other parameters\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Time scaling for different parameters*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Time for fitting (For each hyperparameter, make plots for the mean_fit_time while marginalizing over other parameters))\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('RMS error on the data is')\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('Importance of feautres')\n",
    "gcrf.best_estimator_.feature_importances_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Observation\n",
    "Based on the above, we recommend using 100 trees"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classification\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "# http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Grid search (This will take few minutes)\n",
    "\n",
    "rfc = RandomForestClassifier()\n",
    "parameters = {'n_estimators':[10, 50, 150, 200, 300], 'max_depth':[10, 50, 100]}\n",
    "\n",
    "gcrfc = GridSearchCV(rfc, parameters, cv=StratifiedShuffleSplit(4, 0.1, random_state = 100))\n",
    "\n",
    "#Do the fit\n",
    "...\n",
    "\n",
    "results = gcrfc.cv_results_\n",
    "\n",
    "# Identify top 10 models and plot their mean scores, along the standard deviation. \n",
    "...\n",
    "\n",
    "# For each hyperparameter, make plots for the mean test score while marginalizing over other parameters\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('The accuracy for the testing data set is ')\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('Importance of feautres')\n",
    "gcrfc.best_estimator_.feature_importances_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:blue\"><i><b> 5. Support Vector Machines  </b></i></span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.svm import SVR\n",
    "# http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since SVMs involve evaluating a kernel as well, which under current implementation of sklearn scales as atleast n^2 samples, we subsize our taining data to 2000 samples for grid search. Then we will evaluate the best model on the full training set<br>\n",
    "\n",
    "Further, since SVM are not scale invariant, hence use the scaled data.<br>\n",
    "\n",
    "The most important feature is kernel. We will try 3 kernels - linear, rbf and polynomial. \n",
    "\n",
    "Since all three have different parameters, we will use different grid searches for all three with 3fold CV.\n",
    "- For polynomial kernel, use gamma, C, coef0 as parameters (this will be slow, so do not spam parameter space)\n",
    "- For linear kernel, use epsilon, C as parameters\n",
    "- For RBF kernel, use gamma, C as parameters. **Here, choose the values to be used on log-scale instead of linear scale**. This is one of the recommended practices for RBF kernel and SVM.\n",
    "\n",
    "It is instructive to read the documentation to see hwo different parameters affect behavior (and hence which are more important), and to first change a couple parameters manually to find the reasonable limits and the time taken in fitting. \n",
    "\n",
    "*For each, print the top 5 models and plot the scores marginalizing over parameters to identify the most import parameters *\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "parametersp = {\"gamma\":[0.1, 1], 'C':[0.1, 1, 2], 'coef0':[0, 1]}\n",
    "parametersl = {'C':[0.1, 1, 2], \"epsilon\":[0.005, 0.01, 0.1, 0.5]}\n",
    "\n",
    "C_range = np.logspace(-2, 2, 5)\n",
    "gamma_range = np.logspace(-2, 2, 5)\n",
    "parametersf = {\"gamma\":gamma_range, 'C':C_range}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Grid Search for polynomial kernel (This will take few minutes)\n",
    "svrp = SVR(kernel='poly')\n",
    "gcsvrp = GridSearchCV(svrp, parametersp, cv=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Grid Search for linear kernel (This will take few minutes)\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Grid Search for RBF kernel (This will take few minutes)\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# (This will take few minutes)\n",
    "# Do the fit \n",
    "...\n",
    "\n",
    "# Get the result for polynomial kernel\n",
    "results1 = gcsvrp.cv_results_\n",
    "# Get the result for linear kernel\n",
    "...\n",
    "# Get the result for RBF kernel\n",
    "...\n",
    "\n",
    "# Identify top 10 models and plot their mean scores, along the standard deviation. \n",
    "...\n",
    "\n",
    "# For each hyperparameter, make plots for the mean test score while marginalizing over other parameters\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('RMS error for best fit method of different kernels is ')\n",
    "print('For polynomial, %0.3f'%)\n",
    "print('For linear, %0.3f'%)\n",
    "print('For rbf, %0.3f'%)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Make a 2-D heat map using pyplot.pcolor to see how the score changes with gamma and C. Do you see a trend in the values of gamma and C? Based on what you know about these parameters, does this make sense?*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def make2dkey(dt, k1, k2):\n",
    "    k1 = 'param_%s'%k1\n",
    "    k2 = 'param_%s'%k2\n",
    "    l1 = np.unique(dt[k1])\n",
    "    l2 = np.unique(dt[k2])\n",
    "    w = np.zeros([l1.size, l2.size])\n",
    "    for i, iv in enumerate(l1):\n",
    "        for j, jv in enumerate(l2):\n",
    "            w[i, j] = dt['mean_test_score'][(dt[k1] == iv) & (dt[k2] == jv)]\n",
    "    return l1, l2, w\n",
    "\n",
    "def plot2dkey(l1, l2, w, k1=False, k2=False):\n",
    "\n",
    "    fig, ax = plt.subplots()\n",
    "    im = ax.pcolor(w)\n",
    "    plt.colorbar(im)\n",
    "    ax.set_xticklabels(l2)\n",
    "    ax.set_yticklabels(l1)\n",
    "    ax.set_yticks(np.arange(w.shape[0]) + 0.5, minor=False)\n",
    "    ax.set_xticks(np.arange(w.shape[1]) + 0.5, minor=False)\n",
    "    ax.invert_yaxis()\n",
    "    if k1:\n",
    "        ax.set_ylabel(k1)\n",
    "    if k2:\n",
    "        ax.set_xlabel(k2)\n",
    "        \n",
    "    return fig, ax"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Make a 2D heat map using the above routine\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*It is also instructive to do this exercise without normalizing the data. Do so for the polynomial and the rbf kernel. Report on the difference you find. <br>\n",
    "Again make the pcolor map between gamma and C. Has the trend changed, and is it in line with your expectations*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Grid search\n",
    "svrf2 = SVR(kernel='rbf')\n",
    "gcsvrf2 = GridSearchCV(svrf2, parametersf, cv=3)\n",
    "\n",
    "# Do the fit\n",
    "...\n",
    "# Get the result\n",
    "results = gcsvrf2.cv_results_\n",
    "\n",
    "# Identify top 10 models and plot their mean scores, along the standard deviation. \n",
    "...\n",
    "\n",
    "# For each hyperparameter, make plots for the mean test score while marginalizing over other parameters\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Make a 2D heat map\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('RMS error for best fit method of different kernels is ')\n",
    "print('For rbf kernel with normalized data, %0.4f')\n",
    "print('For rbf kernel with unnormalized data, %0.4f')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Observations\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Try this using class weights and without. What do you naively expect and what do you get? <br>\n",
    "**For some reason putting weights seems to make the performance worse**\n",
    "Find the best fit values of other parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.svm import SVC\n",
    "# http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Grid search and plot\n",
    "svc = SVC()\n",
    "parameters = {'C':[0.1, 1, 2], 'gamma':[0.1, 1]}\n",
    "C_range = np.logspace(-2, 2, 5)\n",
    "gamma_range = np.logspace(-2, 2, 5)\n",
    "parameters = {\"gamma\":gamma_range, 'C':C_range}\n",
    "\n",
    "gsvcn = GridSearchCV(svc, parameters, cv=4)\n",
    "# Do the fit\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:blue\"><i><b> 6. Gaussian Process </b></i></span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "GPs are another kernel method of regression and classification. Here, since we need to invert the kernel matrix which is an N^3 process, it is not possible for us to work with full data as such. Hence we need methods to lower the rank of the kernel. <br>\n",
    "The most trivial way is to use less training data. We will do this and used reduced training data for the sklearn algorithm. \n",
    "\n",
    "Another rank reduction technique is to decompose kernel matrix. This is based on the section 8.1 of http://www.gaussianprocess.org/gpml/chapters/RW8.pdf <br>\n",
    "However since this process can lead to numerical instabilities, we will follow the algorithm in https://pubs.giss.nasa.gov/docs/2009/2009_Foster_fo04000r.pdf, the V method elaborated in secion 5.2. You will need to go through section 2 and 3 as well to develop notation. <br>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on the above, write a class GPsub with fit, reduce and predict function to model this reduced rank GP. The structure (functions in the classe) should broadly be as follows - \n",
    "- __init__ - take in the value of lower rank, regularizer (alpha), error(sigma) on values, kernel and its assoicated parameters from the user.\n",
    "- fit function - create the kernel matrix given the X and Y data\n",
    "- predict - predict the mean values for the new data set X\n",
    "- reduce - implement the V method form the paper above\n",
    "\n",
    "You do not need to define kernels, you can instead use them from the inbuilt GP class. We will try 2 kernels, polynomial (dot) and rbf. \n",
    "\n",
    "Setup this class and make a scatter plot of the prediciton vs the truth for the training data set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.gaussian_process import kernels\n",
    "# http://scikit-learn.org/stable/modules/gaussian_process.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# This is only a suggestion\n",
    "class GPsub:\n",
    "    \n",
    "    def __init__(self, sigma=2, alpha=1e-10, sub=100, l=1, kernel='poly'):\n",
    "        \n",
    "        rbf = kernels.RBF(length_scale=l)\n",
    "        dot = kernels.DotProduct(sigma_0=sigma)\n",
    "        if kernel is 'poly':\n",
    "            self.ker = dot\n",
    "        elif kernel is 'rbf':\n",
    "            self.ker = rbf\n",
    "        self.m = sub\n",
    "        self.alpha = alpha\n",
    "        \n",
    "    def fit(self, X, Y):\n",
    "        ...\n",
    "        \n",
    "    def reduce(self):\n",
    "        ...\n",
    "        \n",
    "    def predict(self, Xt):\n",
    "        ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Grid Search\n",
    "One this is set up, do the hyperparameter search for both the kernels. The parameters that need to be varied are \n",
    "- For rbf kernel, the rank to which the kernel is lowered and the length_scale\n",
    "- For polynomial kernel, the rank to which the kernel is lowered and 'sigma'\n",
    "\n",
    "Make plot for the score on the test data set to judge the importance of every parameter.\n",
    "- How does increasing the rank of the parameter affect the score?\n",
    "- What is the best length-scale. Is this in the ballpark of where you would naively expect it to be? Why or why not?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Trial\n",
    "gpsub = GPsub(l=20, kernel='rbf', sub=50)\n",
    "# Let train = (train_xdata, train_ydata) be the training data\n",
    "gpsub.fit(train[0], train[1])\n",
    "# Let test = (test_xdata, test_ydata) be the test data\n",
    "yym = gpsub.predict(test[0])\n",
    "plt.plot(test[1], yym, 'b.')\n",
    "plt.plot(test[1], test[1], 'k.')\n",
    "plt.show()\n",
    "print(rms(test[1], yym))\n",
    "\n",
    "# Do a manual grid search for both the kernels and make an example scatter plot\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Best model\n",
    "gpsub = GPsub(l=..., kernel='rbf', sub=...)\n",
    "# Do the fit\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inbuilt GP\n",
    "\n",
    "*Compare this with the GP from sklearn. Use the subsized training set with this model. Calculate the rms error and comment if our decomposition of the kernel helped us to improve in RMS error.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.gaussian_process import GaussianProcessRegressor\n",
    "# http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "kerdot = kernels.DotProduct(sigma_0=2)\n",
    "gpdot = GaussianProcessRegressor(kernel=kerdot)\n",
    "gpdot.fit(...)\n",
    "zgpdot = scalesuby.inverse_transform(gpdot.predict(...))\n",
    "\n",
    "kerrbf = kernels.RBF(length_scale=0.1)\n",
    "gprbf = GaussianProcessRegressor(kernel=kerrbf, alpha=1e-1)\n",
    "gprbf.fit(...)\n",
    "zgprbf = (gprbf.predict(...))\n",
    "\n",
    "# Make plot\n",
    "...\n",
    "\n",
    "# Calculate the rms\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Comaprison\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*For this, feel free to extend the above GPsub class to include classification or simply use the inbuilt GPClassifier with subsized data*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.gaussian_process import GaussianProcessClassifier\n",
    "# http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# This will take 30-40 minutes!\n",
    "kerrbf = kernels.RBF(length_scale=10)\n",
    "gprbfc = GaussianProcessClassifier(kernel=kerrbf)\n",
    "gprbfc.fit(...)\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:blue\"><i><b> 7. Neural Network </b></i></span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.neural_network import MLPRegressor\n",
    "# http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html\n",
    "from sklearn.neural_network import MLPClassifier\n",
    "# http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a neural network, the important hyperparameters  are - \n",
    "- Number and size of layers\n",
    "- Activation function\n",
    "- Strength of regularization\n",
    "- Batch size for training\n",
    "- Training algorithm\n",
    "\n",
    "There are other parameters such as initial learning rate, parameters corresponding to the training algorithm etc., however we will not bother with these at the moment.\n",
    "\n",
    "First, lets make a decision on the activation function and training algorithm since those are finite in numbers and then do the grid search on other parameters that are unconstrained\n",
    "\n",
    "For neural network, we will only be using the normalized data set\n",
    "\n",
    "*Train a double layer network of size [30, 15] (arbitrary) using all the available activation functions and calculate the rms error. Based on this, decide on an activation function*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nnr = MLPRegressor([30, 10], max_iter = 1000)\n",
    "parameters = {'activation':['relu', 'logistic', 'identity']}\n",
    "gcnn = GridSearchCV(nnr, parameters, cv=4)\n",
    "\n",
    "# Do the fit\n",
    "...\n",
    "# Get the result\n",
    "results = gcnn.cv_results_\n",
    "\n",
    "print(results['mean_test_score'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Training Algorithm\n",
    "Using the best activation function from above, try the 3 available algorithms - adam, SGD (without Nestrov Momentum), SGD (with Nestrov Momentum). Use starting learning rate = 0.001. Plot the loss_curve for all three algorithms, as well as see the wall clock time (use time.time package or %timeit functionality) for all three. To keep things consistent, start from same random state for all algorithms. <br>\n",
    "*Which is the best algorithm?*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from time import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#train 3 networks and output time for them. Make a plot for loss_curve\n",
    "\n",
    "lrate = [1e-3]\n",
    "nnl = [[], [], []]\n",
    "\n",
    "for lr in lrate:\n",
    "    nnr1 = MLPRegressor([50, 30], max_iter = 1000, solver='adam', random_state=100, learning_rate_init=lr)\n",
    "    %timeit nnr1.fit(...)\n",
    "    nnl[0].append(nnr1)\n",
    "    \n",
    "    nnr2 = MLPRegressor([50, 30], max_iter = 1000, solver='sgd', random_state=100, learning_rate_init=lr)\n",
    "    %timeit nnr2.fit(...)\n",
    "    nnl[1].append(nnr2)\n",
    "\n",
    "    nnr3 = MLPRegressor([50, 30], max_iter = 1000, solver='sgd', nesterovs_momentum=False, random_state=100, \\\n",
    "                        learning_rate_init=lr)\n",
    "    %timeit nnr3.fit(...)\n",
    "    nnl[2].append(nnr3)\n",
    "\n",
    "# Make plot\n",
    "\n",
    "i = 0\n",
    "plt.plot(nnl[0][i].loss_curve_,'r', label='Adam')\n",
    "plt.plot(nnl[1][i].loss_curve_,'b', label='SGD+nestrov')\n",
    "plt.plot(nnl[2][i].loss_curve_, 'g', label='SGD')\n",
    "\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Batch Size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For different batch sizes, ranging from 10 to the size of training sample, plot the loss_curve as well as wall clock time. Again, remember to start from the same random state. <br>\n",
    "*Explain the trend (roughly) seen in the loss curve and the wall clock time as a function of batch size*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "batches = np.logspace(3, 8.2, 10, dtype=int, base=np.e)\n",
    "batches\n",
    "\n",
    "nnrb = []\n",
    "times = []\n",
    "for batch in batches:\n",
    "    nnr1 = MLPRegressor([50, 30], max_iter = 1000, solver='adam', random_state=100, batch_size=int(batch), \\\n",
    "                        early_stopping=False)\n",
    "    start = time()\n",
    "    nnr1.fit(...)\n",
    "    end = time()\n",
    "    times.append(end - start)\n",
    "    nnrb.append(nnr1)\n",
    "    \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 2, figsize = (12, 5))\n",
    "\n",
    "...\n",
    "\n",
    "plt.suptitle('Effect of batch size')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Observations\n",
    "Upon increasing the batch size the number of iterations decreases because the weights are getting updated more often. The wall clock time however first decreases and then increases because now we are inverting biggere matrices, however "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Grid Search\n",
    "Do a grid search on different architecture of layers (number and sizes), batch sizes, and regularizing strength and find the top 10 models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# This will take few minutes.\n",
    "nnr = MLPRegressor(max_iter = 1000)\n",
    "parameters = {'hidden_layer_sizes':[5, 100, [5, 10], [100, 50]], \n",
    "              'alpha':[1e-1, 1e-3, 1e-5], \n",
    "            'batch_size':[50, 500, 2000]}\n",
    "\n",
    "gcnn = GridSearchCV(nnr, parameters, cv=4)\n",
    "\n",
    "# Do the fit\n",
    "gcnn.fit(...)\n",
    "results = gcnn.cv_results_\n",
    "\n",
    "# Identify top 10 models and plot their mean scores, along the standard deviation. \n",
    "...\n",
    "\n",
    "# For each hyperparameter, make plots for the mean test score while marginalizing over other parameters\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('RMS error')\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*First, Lets confirm which activation function works the best for classification. Is the difference as significant as for the regression problem?*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "mlpc = MLPClassifier([30, 10], max_iter = 1000)\n",
    "\n",
    "parameters = {'activation':['relu', 'logistic', 'identity']}\n",
    "gcnnc = GridSearchCV(mlpc, parameters, cv=4)\n",
    "\n",
    "# Do the fit\n",
    "gcnnc.fit(...)\n",
    "results = gcnnc.cv_results_\n",
    "print(results['mean_test_score'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Grid Search\n",
    "Do a grid search on hidden layers, batch and regularization to get the best model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# This will take few minutes\n",
    "mlpc = MLPClassifier(max_iter = 1000)\n",
    "\n",
    "parameters = {'hidden_layer_sizes':[5, 100, [5, 10], [100, 50]], \n",
    "              'alpha':[1e-1, 1e-3, 1e-5], \n",
    "            'batch_size':[50, 500, 2000]}\n",
    "\n",
    "gcnnc = GridSearchCV(mlpc, parameters, cv=4)\n",
    "# Do the fit\n",
    "gcnnc.fit(...)\n",
    "results = gcnnc.cv_results_\n",
    "\n",
    "# Identify top 10 models and plot their mean scores, along the standard deviation. \n",
    "...\n",
    "\n",
    "# For each hyperparameter, make plots for the mean test score while marginalizing over other parameters\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:blue\"><i><b> 8. Compare! </b></i></span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Make a plot for the RMS error for regression using different algorithms on the testing data set using the best model from grid search\n",
    "- Make a plot for the accuracy for classification using  different algorithms on the testing data set using the best model from grid search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Testing error for regression\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Testing error for classification\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "## To Submit\n",
    "Execute the following cell to submit.\n",
    "If you make changes, execute the cell again to resubmit the final copy of the notebook, they do not get updated automatically.<br>\n",
    "__We recommend that all the above cells should be executed (their output visible) in the notebook at the time of submission.__ <br>\n",
    "Only the final submission before the deadline will be graded. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "_ = ok.submit()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}