{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This workshop is divided into 5 sections.  \n",
    "- Section 1: Data Preprocessing\n",
    "- Section 2: Regression\n",
    "- Section 3: Dimensionality Reduction Algorithms\n",
    "- Section 4: Classification\n",
    "\n",
    "All these topics will be taught in Python and use the machine learning package, [scikit-learn](http://scikit-learn.org/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For sections 1 and 2, I will be using a dataset of avocado prices and volume sold in U.S. cities.\n",
    "\n",
    "For sections 3 and 4, I will be using a dataset of antibotic resistance in gonorrhea strains.\n",
    "\n",
    "**Note: In addition to sklearn, this workshop requires the pandas, numpy, pickle and matplotlib package. Although not required, it would be great if you had the [graphviz](https://graphviz.gitlab.io/download/) package.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy\n",
    "import pandas as pd\n",
    "import pickle\n",
    "import matplotlib\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Avocado U.S. Cities Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset contains avocado prices and total volume sold in U.S. cities. I obtain the data from the website, [Kaggle](https://www.kaggle.com/neuromusic/avocado-prices-across-regions-and-seasons/data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "avocado_path = 'avocado.csv' #please make sure avacado.csv is in the same directory as the iPython notebook\n",
    "avocado_df = pd.read_csv(avocado_path,header=0)\n",
    "avocado_df.drop('Unnamed: 0', axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The avocado dataset is stored as a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html), **avocado_df**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('The DataFrame has columns: ',avocado_df.columns.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From Kaggle, the description of the columns are:\n",
    "\n",
    "- Date - The date of the observation\n",
    "- AveragePrice - the average price of a single avocado\n",
    "- type - conventional or organic\n",
    "- year - the year\n",
    "- Region - the city or region of the observation\n",
    "- Total Volume - Total number of avocados sold\n",
    "- 4046 - Total number of avocados with PLU 4046 sold\n",
    "- 4225 - Total number of avocados with PLU 4225 sold\n",
    "- 4770 - Total number of avocados with PLU 4770 sold"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I load the dataset below and add a column for month of observation. I also print the `head()` of the dataset. The `head` is the first `n` (default `n=5`) entries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "month = [int(date[5:7]) for date in avocado_df['Date'].values]\n",
    "avocado_df['Month'] = month\n",
    "print(avocado_df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, we will be apply the preprocessing techniques:\n",
    "    \n",
    "- Normalization\n",
    "- Standardization\n",
    "- Label Encoding\n",
    "- Training-Test Split\n",
    "\n",
    "to the avocado dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's work through the cells below!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Normalization and Standardization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I create the `total_volume_and_bags` variable to store the Total Volume and Total Bags columns. \n",
    "\n",
    "We'll normalize and standardize these columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_volume_and_bags = avocado_df.loc[:,['Total Volume','Total Bags']]\n",
    "print('The first 5 entries of total volume are:\\n',total_volume_and_bags.head()) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1.1\n",
    "\n",
    "Below is an example to normalize `total_volume_and_bags` with an instance of sklearn's [Normalizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html) class and print the normalized values.\n",
    "\n",
    "**As an exercise**: Standardize `total_volume_and_bags` using an instance of [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) class. Print the standardized values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.preprocessing import Normalizer\n",
    "\n",
    "normalizer_scaler = Normalizer()\n",
    "normalized_volume_bags = normalizer_scaler.fit_transform(total_volume_and_bags)\n",
    "print('The head of the normalized Total Volume and Total Bags are:\\n',normalized_volume_bags[0:5,:])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Solution\n",
    "standardized_scaler = StandardScaler()\n",
    "standardized_volume_bags = standardized_scaler.fit_transform(total_volume_and_bags)\n",
    "print('The standardized Total Volume and Total Bags are:\\n',standardized_volume_bags[0:5,:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Label Encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use the categorical variables `region` and `type` in regression, we must encode them into integer values. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `region_categories` and `type_categories variables` store the [unique](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) categories in the `region` and `type` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "region_categories = avocado_df['region'].unique()\n",
    "type_categories = avocado_df['type'].unique()\n",
    "\n",
    "print('Region categories are: \\n',region_categories,'\\n')\n",
    "print('Type categories are: \\n',type_categories,'\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1.2\n",
    "\n",
    "Using code for encoding `region_categories` as an example below, encode the `type_categories` using an instance of [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) and print the encoded type categories. \n",
    "\n",
    "Note: you must use different instances of [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) to encode `region_categories` and `type_categories`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "region_encoder = LabelEncoder()\n",
    "encoded_region_cats = region_encoder.fit_transform(region_categories)\n",
    "print('The encoded region categories are:', encoded_region_cats)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Solution\n",
    "type_encoder  = LabelEncoder()\n",
    "encoded_type_cats = type_encoder.fit_transform(type_categories)\n",
    "print('The encoded types are:', encoded_type_cats)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To encode the `region` and `type` column run the cells below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Before the region column was: \\n', avocado_df['region'].head())\n",
    "print('\\n')\n",
    "print('Before the type column was: \\n', avocado_df['type'].head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## RUN THIS ONLY ONCE. IF YOU RUN IT TWICE OR MORE, YOU WILL GET AN ERROR ##\n",
    "\n",
    "avocado_df['region'] = region_encoder.transform(avocado_df['region'])\n",
    "avocado_df['type'] = type_encoder.transform(avocado_df['type'])\n",
    "\n",
    "## RUN THIS ONLY ONCE. IF YOU RUN IT TWICE OR MORE, YOU WILL GET AN ERROR ##"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('After the region column is now: \\n', avocado_df['region'].head())\n",
    "print('\\n')\n",
    "print('After the type column is now: \\n', avocado_df['type'].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train-Test Split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will be attempting to predict price of an avocado given the demand, time of year and place of purchase. We will then using the columns \n",
    "```\n",
    "'4046', '4225', '4770', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type', 'year', 'region', 'Month'\n",
    "```\n",
    "as *explanatory* variables and the price as the *response* variable.\n",
    " \n",
    "I use `avocado_explanatory_variables` to store the explanatory variables and `avocado_response_variables` stores the response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "avocado_response_variables = avocado_df['AveragePrice']\n",
    "avocado_explanatory_variables = avocado_df.drop(['Date','Total Volume','Total Bags','AveragePrice'], axis=1)\n",
    "\n",
    "print('avocado_explanatory_variables stores: \\n',avocado_explanatory_variables.head())\n",
    "print('\\n')\n",
    "print('avocado_response_variables stores: \\n',avocado_response_variables.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code standardizes the columns of the `avocado_explanatory_variables`, **except the categorical variables**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#### RUN ONLY ONCE. YOU WILL GET AN ERROR IF YOU RUN TWICE ###\n",
    "standardized_scaler_explanatory = StandardScaler()\n",
    "avocado_standardized_explanatory_columns = standardized_scaler_explanatory.fit_transform(\n",
    "                                            avocado_explanatory_variables.loc[:,['4046', \n",
    "                                            '4225', '4770', 'Small Bags', 'Large Bags', 'XLarge Bags']])\n",
    "\n",
    "avocado_explanatory_variables.loc[:,['4046', '4225', '4770', 'Small Bags', \n",
    "                                     'Large Bags', 'XLarge Bags']] =  avocado_standardized_explanatory_columns\n",
    "\n",
    "print('The standard explanatory variables are: \\n\\n', avocado_explanatory_variables.loc[0:5,:],'\\n\\n')\n",
    "#### RUN ONLY ONCE. YOU WILL GET AN ERROR IF YOU RUN TWICE ###"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below, I use [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split\n",
    "-  avocado_explanatory_variables into training_set, test_set\n",
    "-  avocado_response_variables y_training_set and y_test_set.\n",
    "\n",
    "I use a $70\\%:30\\%$ training to test split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "training_set, test_set, y_training_set, y_test_set = train_test_split(avocado_explanatory_variables,\n",
    "                                                                      avocado_response_variables,test_size=0.30,\n",
    "                                                                      train_size=0.70,random_state=0)\n",
    "\n",
    "print('The training division of the explanatory variables has head: \\n\\n',training_set.head(),'\\n\\n')\n",
    "print('The training division of the response variables has head: \\n\\n',y_training_set.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1.3\n",
    "\n",
    "The Avocado dataset has 18248 observations. Instead of splitting the data set using a fraction, do $70\\%:30\\%$ split using integers in the arguments: train_size and test_size.\n",
    "\n",
    "That is, change train_size and test_size to integers so that we get a 70\\%:30\\% split.\n",
    "\n",
    "Note: $18248 * 30\\% \\approx 5474$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Solution\n",
    "from sklearn.model_selection import train_test_split\n",
    "training_set, test_set, y_training_set, y_test_set = train_test_split(avocado_explanatory_variables,\n",
    "                                                                      avocado_response_variables,\n",
    "                                                                      test_size=5474,train_size=18248-5474,\n",
    "                                                                      random_state=0)\n",
    "print('The training division of the explanatory variables has head: \\n\\n',training_set.head(),'\\n\\n')\n",
    "print('The training division of the response variables has head: \\n\\n',y_training_set.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The below cell simply plots each explanatory variable vs the price in the data. This helps visualize the relationship between the two variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for label in training_set.columns.values:\n",
    "    plt.scatter(training_set.loc[:, label],y_training_set.values)\n",
    "    plt.xlabel('Normalized ' + label)\n",
    "    plt.ylabel('Price')\n",
    "    plt.title('Price vs. Normalized '+ label)\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using an instance of [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), I indiviually regress each column of the training_set against y_training_set and plot the best fit line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "lr = LinearRegression()\n",
    "column_names = avocado_explanatory_variables.columns.values\n",
    "\n",
    "for label in column_names:\n",
    "    x_training = training_set.loc[:,label].values.reshape(-1,1)\n",
    "    x_test = test_set.loc[:,label].values.reshape(-1,1)\n",
    "    lr.fit(x_training,y_training_set)\n",
    "    m = lr.coef_\n",
    "    b = lr.intercept_\n",
    "    \n",
    "    plt.scatter(x_training,y_training_set.values)\n",
    "    plt.plot(x_training, b + m * x_training, 'r-')\n",
    "    plt.xlabel('Training set - Normalized ' + label)\n",
    "    plt.ylabel('Price')\n",
    "    plt.title('Training set : Price vs. Normalized '+ label + ', $R^2=$'\n",
    "              +str(lr.score(x_training,y_training_set)))\n",
    "    plt.show()\n",
    "    \n",
    "    plt.scatter(x_test,y_test_set.values)\n",
    "    plt.plot(x_test, b + m * x_test, 'r-')\n",
    "    plt.xlabel('Training set - Normalized ' + label)\n",
    "    plt.ylabel('Price')\n",
    "    plt.title('Test set : Price vs. Normalized '+ label + ', $R^2=$'\n",
    "              +str(lr.score(x_test,y_test_set)))\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Overall, a linear model performs pretty poorly in each instance.\n",
    "\n",
    "What if we just considered all the variables in single linear model? How would the linear model perform?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr.fit(training_set,y_training_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2.1\n",
    "\n",
    "Print the coefficients of the [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) instance `lr` (modeled with all variables). Also print the $R^2$ value of the model using the training set and test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "#Solution\n",
    "print('The coefficients, except the intercept, are: ', lr.coef_,'\\n')\n",
    "print('The intercept is: ',lr.intercept_,'\\n')\n",
    "print('The R^2 from the training set is: ', str(lr.score(training_set,y_training_set)), '\\n')\n",
    "print('The R^2 from the test set is: ',str(lr.score(test_set,y_test_set)),'\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With more variables, the model performs a bit better! However, the model is far from perfect.\n",
    "\n",
    "Could LASSO or Ridge regression help us? \n",
    "\n",
    "They're worth a try."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### LASSO and Ridge Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using an instance of [LASSO](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) with `alpha=1.0`, I regress the columns of training_set against y_training_set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Lasso\n",
    "lasso = Lasso(alpha=1)\n",
    "lasso.fit(training_set,y_training_set)\n",
    "\n",
    "print('The coefficients, except the intercept, are: ', lasso.coef_,'\\n')\n",
    "print('The intercept is: ',lasso.intercept_,'\\n')\n",
    "print('The R^2 from the training set is: ', str(lasso.score(training_set,y_training_set)), '\\n')\n",
    "print('The R^2 from the training set is: ',str(lasso.score(test_set,y_test_set)),'\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, LASSO doesn't help. It actually performs worse."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2.2 \n",
    "\n",
    "Now it's your turn!\n",
    "\n",
    "Using an instance of [Ridge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) with `alpha=1.0`, regress the columns of training_set against y_training_set. Print the coefficients, interpet and $R^2$ with the training and test set. Comment on what you observe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Ridge"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#Solution\n",
    "ridge = Ridge(alpha=1.0)\n",
    "ridge.fit(training_set,y_training_set)\n",
    "\n",
    "print('The coefficients, except the intercept, are: ', ridge.coef_,'\\n')\n",
    "print('The intercept is: ',ridge.intercept_,'\\n')\n",
    "print('The R^2 from the training set is: ', str(ridge.score(training_set,y_training_set)), '\\n')\n",
    "print('The R^2 from the training set is: ',str(ridge.score(test_set,y_test_set)),'\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Decision Trees\n",
    "A linear model is just wrong for this problem.\n",
    "\n",
    "What about using decision tree?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I use an instance of [DecisionTreeRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) with `max_depth=5` to regress training_set against y_training_set. \n",
    "\n",
    "I print the feature importance and $R^2$ value using training set and test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeRegressor\n",
    "dtr = DecisionTreeRegressor(max_depth=5,random_state=0)\n",
    "dtr.fit(training_set,y_training_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print('The R^2 value for training set is: ',dtr.score(training_set,y_training_set))\n",
    "print('The R^2 value for test set is: ',dtr.score(test_set,y_test_set))\n",
    "print('\\n')\n",
    "for feature,importance in zip(avocado_explanatory_variables.columns.values.tolist(),dtr.feature_importances_):\n",
    "    print(feature,'variable has importance,', importance,'\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I plot the tree produced below with `graphviz`--this will not work if you do not have graphviz installed in your python AND your local computer (Graphviz executables must be \"on your systems' PATH\")."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "import graphviz \n",
    "import sklearn.tree as tree\n",
    "dot_data = tree.export_graphviz(dtr, out_file=None, \n",
    "                         feature_names=column_names,  \n",
    "                         filled=True, rounded=True,  \n",
    "                         special_characters=True)  \n",
    "graph = graphviz.Source(dot_data)  \n",
    "graph "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Increasing the max depth to 10, unsurprisingly, the model accuracy **increases**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeRegressor\n",
    "dtr = DecisionTreeRegressor(max_depth=10,random_state=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "dtr.fit(training_set,y_training_set)\n",
    "print('The R^2 value for training set is: ',dtr.score(training_set,y_training_set))\n",
    "print('The R^2 value for test set is: ',dtr.score(test_set,y_test_set))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random Forest Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Applying [RandomForestRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html), we see similar successes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestRegressor\n",
    "rf = RandomForestRegressor(n_estimators=10,max_depth=10,max_features=5,random_state=0)\n",
    "rf.fit(training_set,y_training_set)\n",
    "\n",
    "print('The R^2 value for training set is: ',rf.score(training_set,y_training_set))\n",
    "print('The R^2 value for test set is: ',rf.score(test_set,y_test_set))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2.3\n",
    "\n",
    "Play around with the `max_depth` parameter for the decision tree and random forest regression. What parameter gives the lowest error?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution\n",
    "dtr = DecisionTreeRegressor(max_depth=20,random_state=0)\n",
    "dtr.fit(training_set,y_training_set)\n",
    "print('The R^2 value for training set is: ',dtr.score(training_set,y_training_set))\n",
    "print('The R^2 value for test set is: ',dtr.score(test_set,y_test_set))\n",
    "print('\\n')\n",
    "\n",
    "rf = RandomForestRegressor(n_estimators=10,max_depth=20,max_features=5,random_state=0)\n",
    "rf.fit(training_set,y_training_set)\n",
    "\n",
    "print('The R^2 value for training set is: ',rf.score(training_set,y_training_set))\n",
    "print('The R^2 value for test set is: ',rf.score(test_set,y_test_set))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# K-Mer Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using data from [PATRIC](https://www.patricbrc.org/), I created a puedo-genomes of *Neisseria Gonorrhea* bacteria strains. Each genome is labelled for their antibotic resistance to *azithromycin*.\n",
    "\n",
    "With each strain, I splice the DNA in k-mers. k-mers are consecutive cuts of a DNA strand which contains *k* nucleotides.\n",
    "\n",
    "The image below shows the 7-mers of ATGGAAGTCGCGGAATC.\n",
    "\n",
    "![7mers.png](7mers.png)\n",
    "\n",
    "I collected all possible unique 31-mers from each genome. I then constructed a dataset whose rows represented a strain and columns represented a 31-mer. A 31-mer column had 0 if the strain's genome did not contain the 31-mer and 1 if the strain's genome contained the 31-mer.\n",
    "\n",
    "A genome is labelled 1 if it is suspectible to *azithromycin* and 0 if it is resistant to *azithromycin*.\n",
    "\n",
    "**I aim to build a classifer that can correctly label antibotic resistance in *Neisseria Gonorrhea*.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below, I load the k-mers data set. I have already divided the set into training set and test set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'kmer_data'\n",
    "\n",
    "def load_obj(name):\n",
    "    \"\"\"\n",
    "    Load a pickle file. Taken from\n",
    "    https://stackoverflow.com/questions/19201290/how-to-save-a-dictionary-to-a-file\n",
    "    :param name: Name of file\n",
    "    :return: the file inside the pickle\n",
    "    \"\"\"\n",
    "\n",
    "    with open(path + name + '.pkl', 'rb') as f:\n",
    "        return pickle.load(f)\n",
    "\n",
    "    \n",
    "address = \"\"\n",
    "kmers_training_set = load_obj( \"/train\").todense()\n",
    "label_training_set = load_obj(\"/label_train\")\n",
    "kmers_test_set = load_obj(\"/test\").todense()\n",
    "label_test_set = load_obj(\"/label_test\")\n",
    "kmerlist = load_obj(\"/kmerlist\")\n",
    "\n",
    "#Just to show head of data\n",
    "kmer_df = pd.DataFrame(kmers_training_set, columns=kmerlist)\n",
    "print('The kmer data looks like:\\n',kmer_df.head())\n",
    "print('\\n')\n",
    "print('The first 5 categories are:\\n',label_training_set[0:5])\n",
    "print('The training set has shape: ',kmers_training_set.shape)\n",
    "print('The test set has shape: ',kmers_test_set.shape)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Dimensionality Reduction: PCA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I apply [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) without whitening to the k-mer training data set and reduce the number of features (dimensions) to 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "pca_without_whitening = PCA(n_components=2,whiten=False)\n",
    "pca_without_whitening.fit(kmers_training_set)\n",
    "\n",
    "kmer_training_pca_without_whitening = pca_without_whitening.transform(kmers_training_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below, I generate a two dimensional plot of PCA data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "colours = ['red','blue']\n",
    "\n",
    "presence_0 = [int(element) == 0 for element in label_training_set]\n",
    "presence_1 = [int(element) == 1 for element in label_training_set]\n",
    "\n",
    "plt.scatter(kmer_training_pca_without_whitening[presence_0, 0],\n",
    "            kmer_training_pca_without_whitening[presence_0, 1],\n",
    "            label='label = 0 (Resistant)',\n",
    "            c='r')\n",
    "\n",
    "plt.scatter(kmer_training_pca_without_whitening[presence_1, 0],\n",
    "            kmer_training_pca_without_whitening[presence_1, 1],\n",
    "            label='label = 1 (Susceptible)',\n",
    "            c='b')\n",
    "\n",
    "plt.xlabel('$X_0$')\n",
    "plt.ylabel('$X_1$')\n",
    "plt.legend()\n",
    "plt.title('PCA plot of k-mer training data')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I also apply learn PCA without whitening, learned from the training set, to the k-mer test data set and reduce the number of features (dimensions) to 2. \n",
    "\n",
    "I then plot test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kmer_test_pca_without_whitening = pca_without_whitening.transform(kmers_test_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "colours = ['red','blue']\n",
    "\n",
    "presence_0 = [int(element) == 0 for element in label_test_set]\n",
    "presence_1 = [int(element) == 1 for element in label_test_set]\n",
    "\n",
    "plt.scatter(kmer_test_pca_without_whitening[presence_0, 0],\n",
    "            kmer_test_pca_without_whitening[presence_0, 1],\n",
    "            label='label = 0 (Resistant)',\n",
    "            c='r')\n",
    "\n",
    "plt.scatter(kmer_test_pca_without_whitening[presence_1, 0],\n",
    "            kmer_test_pca_without_whitening[presence_1, 1],\n",
    "            label='label = 1 (Susceptible)',\n",
    "            c='b')\n",
    "\n",
    "plt.xlabel('$X_0$')\n",
    "plt.ylabel('$X_1$')\n",
    "plt.title('PCA plot of k-mer test data')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I then print the explained variance and singular values for each singular value "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num = len(pca_without_whitening.explained_variance_)\n",
    "for i in range(num):\n",
    "    print('Component',i, 'explains',pca_without_whitening.explained_variance_[i],'variance')\n",
    "    print('Component',i, 'has singular value', pca_without_whitening.singular_values_[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 3.1\n",
    "\n",
    "Using the variables `kmers_training_set` and `kmers_test_set`, apply [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) with whitening to the k-mer training and test data set. The variables are initialized in the cell below.\n",
    "\n",
    "Store the reduced training data in the variable, `kmer_train_pca_with_whitening`, and the reduced test data in the variable, `kmer_test_pca_with_whitening`.\n",
    "\n",
    "Run the below code to generate a plot. Comment on the differences between the whitened and non-whitened plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#solution\n",
    "pca_with_whitening = PCA(n_components=2,whiten=True)\n",
    "kmer_training_pca_with_whitening = pca_with_whitening.fit_transform(kmers_training_set)\n",
    "\n",
    "kmer_test_pca_with_whitening = pca_with_whitening.transform(kmers_test_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#generate plots\n",
    "\n",
    "colours = ['red','blue']\n",
    "\n",
    "presence_0 = [int(element) == 0 for element in label_training_set]\n",
    "presence_1 = [int(element) == 1 for element in label_training_set]\n",
    "\n",
    "plt.scatter(kmer_training_pca_with_whitening[presence_0, 0],\n",
    "            kmer_training_pca_with_whitening[presence_0, 1],\n",
    "            label='label = 0 (Resistant)',\n",
    "            c='r')\n",
    "\n",
    "plt.scatter(kmer_training_pca_with_whitening[presence_1, 0],\n",
    "            kmer_training_pca_with_whitening[presence_1, 1],\n",
    "            label='label = 1 (Susceptible)',\n",
    "            c='b')\n",
    "\n",
    "plt.xlabel('$X_0$')\n",
    "plt.ylabel('$X_1$')\n",
    "plt.legend()\n",
    "plt.title('PCA plot with whitening of k-mer training data')\n",
    "plt.show()\n",
    "\n",
    "\n",
    "presence_0 = [int(element) == 0 for element in label_test_set]\n",
    "presence_1 = [int(element) == 1 for element in label_test_set]\n",
    "\n",
    "plt.scatter(kmer_test_pca_with_whitening[presence_0, 0],\n",
    "            kmer_test_pca_with_whitening[presence_0, 1],\n",
    "            label='label = 0 (Resistant)',\n",
    "            c='r')\n",
    "\n",
    "plt.scatter(kmer_test_pca_with_whitening[presence_1, 0],\n",
    "            kmer_test_pca_with_whitening[presence_1, 1],\n",
    "            label='label = 1 (Susceptible)',\n",
    "            c='b')\n",
    "\n",
    "plt.xlabel('$X_0$')\n",
    "plt.ylabel('$X_1$')\n",
    "plt.legend()\n",
    "plt.title('PCA with whitening plot of k-mer test data')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Naive Bayes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I build a [Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) classifer to learn and predict resistance in the genomes.  I print the model parameters for the first 5 k-mers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.naive_bayes import GaussianNB\n",
    "gNB=GaussianNB()\n",
    "gNB.fit(kmers_training_set,label_training_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "print('On the training set, Naive Bayes has an accuracy of',gNB.score(kmers_training_set,label_training_set)*100,'%')\n",
    "print('On the test set, Naive Bayes has an accuracy of', gNB.score(kmers_test_set,label_test_set)*100,'%')\n",
    "\n",
    "print('\\n')\n",
    "\n",
    "for i in range(len(gNB.class_count_)):\n",
    "    print(i,'has',gNB.class_count_[i], 'classes')\n",
    "    \n",
    "print('\\n')\n",
    "for i in range(5):\n",
    "    print('When class = 0, k-mer,', kmerlist[i],', has mean',gNB.theta_[0,i], 'and variance, ', gNB.sigma_[0,i])\n",
    "    print('\\n')\n",
    "    print('When class = 1, k-mer,', kmerlist[i],', has mean',gNB.theta_[1,i],'and variance ',gNB.sigma_[1,i])\n",
    "    print('\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Naive Bayes performs fairly well on predict antibotic resistance. It has $88\\%$ accuracy rate on the test set.\n",
    "\n",
    "Below, I also construct the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "predict_label_train_set = gNB.predict(kmers_training_set)\n",
    "predict_label_test_set = gNB.predict(kmers_test_set)\n",
    "\n",
    "print('On the training set,\\n',confusion_matrix(label_training_set,predict_label_train_set))\n",
    "print('On the test set,\\n',confusion_matrix(label_test_set,predict_label_test_set))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 4.1\n",
    "Apply [Native Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) to the PCA data, stored in `kmers_pca_training` and `kmers_pca_test`, and print [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "from sklearn.metrics import confusion_matrix\n",
    "pca = PCA(n_components=2,whiten=False)\n",
    "\n",
    "kmers_pca_training = pca.fit_transform(kmers_training_set)\n",
    "kmers_pca_test = pca.transform(kmers_test_set)\n",
    "gNB_pca=GaussianNB()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Solution\n",
    "\n",
    "gNB_pca.fit(kmers_pca_training,label_training_set)\n",
    "print('On the training set, Naive Bayes has an accuracy of',gNB_pca.score(kmers_pca_training,label_training_set)*100,'%')\n",
    "print('On the test set, Naive Bayes has an accuracy of', gNB_pca.score(kmers_pca_test,label_test_set)*100,'%')\n",
    "print('\\n')\n",
    "\n",
    "predict_label_train_set = gNB_pca.predict(kmers_pca_training)\n",
    "predict_label_test_set = gNB_pca.predict(kmers_pca_test)\n",
    "\n",
    "print('On the training set,\\n',confusion_matrix(label_test_set,predict_label_test_set))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logistic Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I build [Logistic Regression classifer](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to learn and predict resistance in the genomes.  I print for the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "lgr=LogisticRegression()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lgr.fit(kmers_training_set,label_training_set)\n",
    "\n",
    "print('The coefficients for Logistic Regression classifer are:\\n',lgr.coef_) \n",
    "print('The intercept for Logistic Regression classifer are:\\n',lgr.intercept_)\n",
    "print('\\n')\n",
    "print('On the training set, Logistic Regression has an accuracy of',lgr.score(kmers_training_set,label_training_set)*100 , '%')\n",
    "print('On the test set, Logistic Regression has an accuracy of', lgr.score(kmers_test_set,label_test_set)*100, '%')\n",
    "print('\\n')\n",
    "\n",
    "predict_label_train_set_lgr = lgr.predict(kmers_training_set)\n",
    "predict_label_test_set_lgr = lgr.predict(kmers_test_set)\n",
    "\n",
    "print('On the training set,\\n',confusion_matrix(label_training_set,predict_label_train_set_lgr))\n",
    "print('On the test set,\\n',confusion_matrix(label_test_set,predict_label_test_set_lgr))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 4.2\n",
    "Apply [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to the PCA data, `kmers_pca_training` and `kmers_pca_test`, and print the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lgr_pca=LogisticRegression()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#solution\n",
    "lgr_pca.fit(kmers_pca_training,label_training_set)\n",
    "print('On the training set, Logistic Regression has an accuracy of',lgr_pca.score(kmers_pca_training,label_training_set)*100,'%')\n",
    "print('On the test set, Logistic Regression has an accuracy of', lgr_pca.score(kmers_pca_test,label_test_set)*100,'%')\n",
    "print('\\n')\n",
    "\n",
    "predict_label_train_set = lgr_pca.predict(kmers_pca_training)\n",
    "predict_label_test_set = lgr_pca.predict(kmers_pca_test)\n",
    "\n",
    "print('On the training set,\\n',confusion_matrix(label_test_set,predict_label_test_set))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classification Trees\n",
    "\n",
    "I build [Classification Tree](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) classifer with max depth, 5, to learn and predict resistance in the genomes.  I print for the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "dtc = DecisionTreeClassifier(max_depth=5,random_state=0)\n",
    "dtc.fit(kmers_training_set,label_training_set)\n",
    "\n",
    "kmerlist_sorted = [kmer for _,kmer in sorted(zip(dtc.feature_importances_,kmerlist), reverse=True)]\n",
    "for i in range(5):\n",
    "    print(kmerlist_sorted[i],'variable has importance,', sorted(dtc.feature_importances_, reverse=True)[i])\n",
    "print('\\n')\n",
    "print('The accuracy for training set is: ',dtc.score(kmers_training_set,label_training_set)*100,'%')\n",
    "print('The accuracy for test set is: ',dtc.score(kmers_test_set,label_test_set)*100,'%')\n",
    "predict_label_train_set_dtc = dtc.predict(kmers_training_set)\n",
    "predict_label_test_set_dtc =  dtc.predict(kmers_test_set)\n",
    "print('\\n')\n",
    "\n",
    "print('On the training set,\\n',confusion_matrix(label_training_set,predict_label_train_set_dtc))\n",
    "print('On the test set,\\n',confusion_matrix(label_test_set,predict_label_test_set_dtc))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I plot the decision tree with max depth, 5, below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import graphviz \n",
    "import sklearn.tree as tree\n",
    "dot_data = tree.export_graphviz(dtc, out_file=None, \n",
    "                         feature_names=kmerlist,  \n",
    "                         filled=True, rounded=True,  \n",
    "                         special_characters=True)  \n",
    "graph = graphviz.Source(dot_data)  \n",
    "graph "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 4.3\n",
    "\n",
    "Apply a [decision tree](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) with `max_depth=5`, to the PCA data, `kmers_pca_training` and `kmers_pca_test`, and print the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dtc_pca=DecisionTreeClassifier(max_depth=5,random_state=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#Solution\n",
    "dtc_pca.fit(kmers_pca_training,label_training_set)\n",
    "print('On the training set, a classification tree with max depth, 5, has an accuracy of',dtc_pca.score(kmers_pca_training,label_training_set)*100,'%')\n",
    "print('On the test set,  a classification tree with max depth, 5,', dtc_pca.score(kmers_pca_test,label_test_set)*100,'%')\n",
    "print('\\n')\n",
    "\n",
    "predict_label_train_set = dtc_pca.predict(kmers_pca_training)\n",
    "predict_label_test_set = dtc_pca.predict(kmers_pca_test)\n",
    "\n",
    "print('On the training set,\\n',confusion_matrix(label_test_set,predict_label_test_set))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the code below to plot your classication tree.\n",
    "\n",
    "#### IF YOU INCREASE THE DEPTH BEYOND 5, DO NOT RUN THE CODE BELOW. IPYTHON MAY CRASH OR IT MAY TAKE A VERY LONG TIME LOAD."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "import graphviz \n",
    "import sklearn.tree as tree\n",
    "colours = ['red','blue']\n",
    "\n",
    "presence_0 = [int(element) == 0 for element in label_training_set]\n",
    "presence_1 = [int(element) == 1 for element in label_training_set]\n",
    "\n",
    "plt.scatter(kmer_training_pca_without_whitening[presence_0, 0],\n",
    "            kmer_training_pca_without_whitening[presence_0, 1],\n",
    "            label='label = 0 (Resistant)',\n",
    "            c='r')\n",
    "\n",
    "plt.scatter(kmer_training_pca_without_whitening[presence_1, 0],\n",
    "            kmer_training_pca_without_whitening[presence_1, 1],\n",
    "            label='label = 1 (Susceptible)',\n",
    "            c='b')\n",
    "\n",
    "plt.xlabel('$X_0$')\n",
    "plt.ylabel('$X_1$')\n",
    "plt.legend()\n",
    "plt.title('PCA plot of k-mer training data')\n",
    "plt.show()\n",
    "dot_data = tree.export_graphviz(dtc_pca, out_file=None, \n",
    "                         feature_names=['Component 0','Component 1'],  \n",
    "                         filled=True, rounded=True,  \n",
    "                         special_characters=True)  \n",
    "graph = graphviz.Source(dot_data)  \n",
    "graph "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### AdaBoost\n",
    "\n",
    "I build an [AdaBoost](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) classifier with classification tree of max depth, 1, to learn and predict resistance in the genomes.  I print for the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import AdaBoostClassifier\n",
    "\n",
    "adaboost = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),\n",
    "                              n_estimators=10, random_state=0)\n",
    "adaboost.fit(kmers_training_set,label_training_set)\n",
    "\n",
    "\n",
    "kmerlist_sorted = [kmer for _,kmer in sorted(zip(adaboost.feature_importances_,kmerlist), reverse=True)]\n",
    "for i in range(5):\n",
    "    print(kmerlist_sorted[i],'variable has importance,', sorted(adaboost.feature_importances_, reverse=True)[i])\n",
    "\n",
    "print('\\n')\n",
    "print('The accuracy for training set is: ',adaboost.score(kmers_training_set,label_training_set)*100,'%')\n",
    "print('The accuracy for test set is: ',adaboost.score(kmers_test_set,label_test_set)*100, '%')\n",
    "predict_label_train_set_ada = adaboost.predict(kmers_training_set)\n",
    "predict_label_test_set_ada =  adaboost.predict(kmers_test_set)\n",
    "print('\\n')\n",
    "\n",
    "print('For training set,\\n',confusion_matrix(label_training_set,predict_label_train_set_ada))\n",
    "print('For test set,\\n',confusion_matrix(label_test_set,predict_label_test_set_ada))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 4.4\n",
    "Apply [Adaboost](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) with classication tree stumps and 10 estimators to the PCA data, `kmers_pca_training` and `kmers_pca_test`, and print the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for the test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adaboost_pca = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),\n",
    "                              n_estimators=10, random_state=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Solution\n",
    "adaboost_pca.fit(kmers_pca_training,label_training_set)\n",
    "print('On the training set, a classification tree with max depth, 5, has an accuracy of',adaboost_pca.score(kmers_pca_training,label_training_set)*100,'%')\n",
    "print('On the test set,  a classification tree with max depth, 5,', adaboost_pca.score(kmers_pca_test,label_test_set)*100,'%')\n",
    "print('\\n')\n",
    "\n",
    "predict_label_train_set = adaboost_pca.predict(kmers_pca_training)\n",
    "predict_label_test_set = adaboost_pca.predict(kmers_pca_test)\n",
    "\n",
    "print('On the test set,\\n',confusion_matrix(label_test_set,predict_label_test_set))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### KNN\n",
    "\n",
    "I build an [KNN](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) classifier with classification tree with $n = 5$ to learn and predict resistance in the genomes.  I print for the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "knn=KNeighborsClassifier(n_neighbors=5)\n",
    "knn.fit(kmers_training_set,label_training_set)\n",
    "\n",
    "print('The accuracy for training set is: ',knn.score(kmers_training_set,label_training_set)*100,'%')\n",
    "print('The accuracy for test set is: ',knn.score(kmers_test_set,label_test_set)*100,'%')\n",
    "predict_label_train_set_knn = knn.predict(kmers_training_set)\n",
    "predict_label_test_set_knn =  knn.predict(kmers_test_set)\n",
    "print('On the training set,\\n',confusion_matrix(label_training_set,predict_label_train_set_knn))\n",
    "print('On the test set,\\n',confusion_matrix(label_test_set,predict_label_test_set_knn))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 4.5\n",
    "Apply [KNN](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) with to `n_neighbors=5` the PCA data and print the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for the test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "knn_pca=KNeighborsClassifier(n_neighbors=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Solution\n",
    "knn_pca.fit(kmers_pca_training,label_training_set)\n",
    "print('On the training set, a classification tree with max depth, 5, has an accuracy of',knn_pca.score(kmers_pca_training,label_training_set)*100,'%')\n",
    "print('On the test set,  a classification tree with max depth, 5,', knn_pca.score(kmers_pca_test,label_test_set)*100,'%')\n",
    "print('\\n')\n",
    "\n",
    "predict_label_train_set = knn_pca.predict(kmers_pca_training)\n",
    "predict_label_test_set = knn_pca.predict(kmers_pca_test)\n",
    "\n",
    "print('On the test set,\\n',confusion_matrix(label_test_set,predict_label_test_set))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}