{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This workshop is divided into 5 sections. \n", "- Section 1: Data Preprocessing\n", "- Section 2: Regression\n", "- Section 3: Dimensionality Reduction Algorithms\n", "- Section 4: Classification\n", "\n", "All these topics will be taught in Python and use the machine learning package, [scikit-learn](http://scikit-learn.org/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For sections 1 and 2, I will be using a dataset of avocado prices and volume sold in U.S. cities.\n", "\n", "For sections 3 and 4, I will be using a dataset of antibotic resistance in gonorrhea strains.\n", "\n", "**Note: In addition to sklearn, this workshop requires the pandas, numpy, pickle and matplotlib package. Although not required, it would be great if you had the [graphviz](https://graphviz.gitlab.io/download/) package.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy\n", "import pandas as pd\n", "import pickle\n", "import matplotlib\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Avocado U.S. Cities Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset contains avocado prices and total volume sold in U.S. cities. I obtain the data from the website, [Kaggle](https://www.kaggle.com/neuromusic/avocado-prices-across-regions-and-seasons/data)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "avocado_path = 'avocado.csv' #please make sure avacado.csv is in the same directory as the iPython notebook\n", "avocado_df = pd.read_csv(avocado_path,header=0)\n", "avocado_df.drop('Unnamed: 0', axis=1, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The avocado dataset is stored as a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html), **avocado_df**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('The DataFrame has columns: ',avocado_df.columns.values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From Kaggle, the description of the columns are:\n", "\n", "- Date - The date of the observation\n", "- AveragePrice - the average price of a single avocado\n", "- type - conventional or organic\n", "- year - the year\n", "- Region - the city or region of the observation\n", "- Total Volume - Total number of avocados sold\n", "- 4046 - Total number of avocados with PLU 4046 sold\n", "- 4225 - Total number of avocados with PLU 4225 sold\n", "- 4770 - Total number of avocados with PLU 4770 sold" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I load the dataset below and add a column for month of observation. I also print the `head()` of the dataset. The `head` is the first `n` (default `n=5`) entries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "month = [int(date[5:7]) for date in avocado_df['Date'].values]\n", "avocado_df['Month'] = month\n", "print(avocado_df.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, we will be apply the preprocessing techniques:\n", " \n", "- Normalization\n", "- Standardization\n", "- Label Encoding\n", "- Training-Test Split\n", "\n", "to the avocado dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's work through the cells below!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalization and Standardization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I create the `total_volume_and_bags` variable to store the Total Volume and Total Bags columns. \n", "\n", "We'll normalize and standardize these columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "total_volume_and_bags = avocado_df.loc[:,['Total Volume','Total Bags']]\n", "print('The first 5 entries of total volume are:\\n',total_volume_and_bags.head()) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1.1\n", "\n", "Below is an example to normalize `total_volume_and_bags` with an instance of sklearn's [Normalizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html) class and print the normalized values.\n", "\n", "**As an exercise**: Standardize `total_volume_and_bags` using an instance of [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) class. Print the standardized values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "from sklearn.preprocessing import Normalizer\n", "\n", "normalizer_scaler = Normalizer()\n", "normalized_volume_bags = normalizer_scaler.fit_transform(total_volume_and_bags)\n", "print('The head of the normalized Total Volume and Total Bags are:\\n',normalized_volume_bags[0:5,:])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Solution\n", "standardized_scaler = StandardScaler()\n", "standardized_volume_bags = standardized_scaler.fit_transform(total_volume_and_bags)\n", "print('The standardized Total Volume and Total Bags are:\\n',standardized_volume_bags[0:5,:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Label Encoding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use the categorical variables `region` and `type` in regression, we must encode them into integer values. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `region_categories` and `type_categories variables` store the [unique](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) categories in the `region` and `type` column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "region_categories = avocado_df['region'].unique()\n", "type_categories = avocado_df['type'].unique()\n", "\n", "print('Region categories are: \\n',region_categories,'\\n')\n", "print('Type categories are: \\n',type_categories,'\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1.2\n", "\n", "Using code for encoding `region_categories` as an example below, encode the `type_categories` using an instance of [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) and print the encoded type categories. \n", "\n", "Note: you must use different instances of [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) to encode `region_categories` and `type_categories`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import LabelEncoder\n", "\n", "region_encoder = LabelEncoder()\n", "encoded_region_cats = region_encoder.fit_transform(region_categories)\n", "print('The encoded region categories are:', encoded_region_cats)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Solution\n", "type_encoder = LabelEncoder()\n", "encoded_type_cats = type_encoder.fit_transform(type_categories)\n", "print('The encoded types are:', encoded_type_cats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To encode the `region` and `type` column run the cells below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Before the region column was: \\n', avocado_df['region'].head())\n", "print('\\n')\n", "print('Before the type column was: \\n', avocado_df['type'].head())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## RUN THIS ONLY ONCE. IF YOU RUN IT TWICE OR MORE, YOU WILL GET AN ERROR ##\n", "\n", "avocado_df['region'] = region_encoder.transform(avocado_df['region'])\n", "avocado_df['type'] = type_encoder.transform(avocado_df['type'])\n", "\n", "## RUN THIS ONLY ONCE. IF YOU RUN IT TWICE OR MORE, YOU WILL GET AN ERROR ##" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('After the region column is now: \\n', avocado_df['region'].head())\n", "print('\\n')\n", "print('After the type column is now: \\n', avocado_df['type'].head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train-Test Split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will be attempting to predict price of an avocado given the demand, time of year and place of purchase. We will then using the columns \n", "```\n", "'4046', '4225', '4770', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type', 'year', 'region', 'Month'\n", "```\n", "as *explanatory* variables and the price as the *response* variable.\n", " \n", "I use `avocado_explanatory_variables` to store the explanatory variables and `avocado_response_variables` stores the response." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "avocado_response_variables = avocado_df['AveragePrice']\n", "avocado_explanatory_variables = avocado_df.drop(['Date','Total Volume','Total Bags','AveragePrice'], axis=1)\n", "\n", "print('avocado_explanatory_variables stores: \\n',avocado_explanatory_variables.head())\n", "print('\\n')\n", "print('avocado_response_variables stores: \\n',avocado_response_variables.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code standardizes the columns of the `avocado_explanatory_variables`, **except the categorical variables**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#### RUN ONLY ONCE. YOU WILL GET AN ERROR IF YOU RUN TWICE ###\n", "standardized_scaler_explanatory = StandardScaler()\n", "avocado_standardized_explanatory_columns = standardized_scaler_explanatory.fit_transform(\n", " avocado_explanatory_variables.loc[:,['4046', \n", " '4225', '4770', 'Small Bags', 'Large Bags', 'XLarge Bags']])\n", "\n", "avocado_explanatory_variables.loc[:,['4046', '4225', '4770', 'Small Bags', \n", " 'Large Bags', 'XLarge Bags']] = avocado_standardized_explanatory_columns\n", "\n", "print('The standard explanatory variables are: \\n\\n', avocado_explanatory_variables.loc[0:5,:],'\\n\\n')\n", "#### RUN ONLY ONCE. YOU WILL GET AN ERROR IF YOU RUN TWICE ###" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, I use [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split\n", "- avocado_explanatory_variables into training_set, test_set\n", "- avocado_response_variables y_training_set and y_test_set.\n", "\n", "I use a $70\\%:30\\%$ training to test split." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "training_set, test_set, y_training_set, y_test_set = train_test_split(avocado_explanatory_variables,\n", " avocado_response_variables,test_size=0.30,\n", " train_size=0.70,random_state=0)\n", "\n", "print('The training division of the explanatory variables has head: \\n\\n',training_set.head(),'\\n\\n')\n", "print('The training division of the response variables has head: \\n\\n',y_training_set.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1.3\n", "\n", "The Avocado dataset has 18248 observations. Instead of splitting the data set using a fraction, do $70\\%:30\\%$ split using integers in the arguments: train_size and test_size.\n", "\n", "That is, change train_size and test_size to integers so that we get a 70\\%:30\\% split.\n", "\n", "Note: $18248 * 30\\% \\approx 5474$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Solution\n", "from sklearn.model_selection import train_test_split\n", "training_set, test_set, y_training_set, y_test_set = train_test_split(avocado_explanatory_variables,\n", " avocado_response_variables,\n", " test_size=5474,train_size=18248-5474,\n", " random_state=0)\n", "print('The training division of the explanatory variables has head: \\n\\n',training_set.head(),'\\n\\n')\n", "print('The training division of the response variables has head: \\n\\n',y_training_set.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The below cell simply plots each explanatory variable vs the price in the data. This helps visualize the relationship between the two variables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for label in training_set.columns.values:\n", " plt.scatter(training_set.loc[:, label],y_training_set.values)\n", " plt.xlabel('Normalized ' + label)\n", " plt.ylabel('Price')\n", " plt.title('Price vs. Normalized '+ label)\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using an instance of [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), I indiviually regress each column of the training_set against y_training_set and plot the best fit line." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "lr = LinearRegression()\n", "column_names = avocado_explanatory_variables.columns.values\n", "\n", "for label in column_names:\n", " x_training = training_set.loc[:,label].values.reshape(-1,1)\n", " x_test = test_set.loc[:,label].values.reshape(-1,1)\n", " lr.fit(x_training,y_training_set)\n", " m = lr.coef_\n", " b = lr.intercept_\n", " \n", " plt.scatter(x_training,y_training_set.values)\n", " plt.plot(x_training, b + m * x_training, 'r-')\n", " plt.xlabel('Training set - Normalized ' + label)\n", " plt.ylabel('Price')\n", " plt.title('Training set : Price vs. Normalized '+ label + ', $R^2=$'\n", " +str(lr.score(x_training,y_training_set)))\n", " plt.show()\n", " \n", " plt.scatter(x_test,y_test_set.values)\n", " plt.plot(x_test, b + m * x_test, 'r-')\n", " plt.xlabel('Training set - Normalized ' + label)\n", " plt.ylabel('Price')\n", " plt.title('Test set : Price vs. Normalized '+ label + ', $R^2=$'\n", " +str(lr.score(x_test,y_test_set)))\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overall, a linear model performs pretty poorly in each instance.\n", "\n", "What if we just considered all the variables in single linear model? How would the linear model perform?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lr.fit(training_set,y_training_set)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2.1\n", "\n", "Print the coefficients of the [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) instance `lr` (modeled with all variables). Also print the $R^2$ value of the model using the training set and test set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "#Solution\n", "print('The coefficients, except the intercept, are: ', lr.coef_,'\\n')\n", "print('The intercept is: ',lr.intercept_,'\\n')\n", "print('The R^2 from the training set is: ', str(lr.score(training_set,y_training_set)), '\\n')\n", "print('The R^2 from the test set is: ',str(lr.score(test_set,y_test_set)),'\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With more variables, the model performs a bit better! However, the model is far from perfect.\n", "\n", "Could LASSO or Ridge regression help us? \n", "\n", "They're worth a try." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### LASSO and Ridge Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using an instance of [LASSO](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) with `alpha=1.0`, I regress the columns of training_set against y_training_set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from sklearn.linear_model import Lasso\n", "lasso = Lasso(alpha=1)\n", "lasso.fit(training_set,y_training_set)\n", "\n", "print('The coefficients, except the intercept, are: ', lasso.coef_,'\\n')\n", "print('The intercept is: ',lasso.intercept_,'\\n')\n", "print('The R^2 from the training set is: ', str(lasso.score(training_set,y_training_set)), '\\n')\n", "print('The R^2 from the training set is: ',str(lasso.score(test_set,y_test_set)),'\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, LASSO doesn't help. It actually performs worse." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2.2 \n", "\n", "Now it's your turn!\n", "\n", "Using an instance of [Ridge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) with `alpha=1.0`, regress the columns of training_set against y_training_set. Print the coefficients, interpet and $R^2$ with the training and test set. Comment on what you observe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Ridge" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "#Solution\n", "ridge = Ridge(alpha=1.0)\n", "ridge.fit(training_set,y_training_set)\n", "\n", "print('The coefficients, except the intercept, are: ', ridge.coef_,'\\n')\n", "print('The intercept is: ',ridge.intercept_,'\\n')\n", "print('The R^2 from the training set is: ', str(ridge.score(training_set,y_training_set)), '\\n')\n", "print('The R^2 from the training set is: ',str(ridge.score(test_set,y_test_set)),'\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Decision Trees\n", "A linear model is just wrong for this problem.\n", "\n", "What about using decision tree?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I use an instance of [DecisionTreeRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) with `max_depth=5` to regress training_set against y_training_set. \n", "\n", "I print the feature importance and $R^2$ value using training set and test set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "dtr = DecisionTreeRegressor(max_depth=5,random_state=0)\n", "dtr.fit(training_set,y_training_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "print('The R^2 value for training set is: ',dtr.score(training_set,y_training_set))\n", "print('The R^2 value for test set is: ',dtr.score(test_set,y_test_set))\n", "print('\\n')\n", "for feature,importance in zip(avocado_explanatory_variables.columns.values.tolist(),dtr.feature_importances_):\n", " print(feature,'variable has importance,', importance,'\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I plot the tree produced below with `graphviz`--this will not work if you do not have graphviz installed in your python AND your local computer (Graphviz executables must be \"on your systems' PATH\")." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "import graphviz \n", "import sklearn.tree as tree\n", "dot_data = tree.export_graphviz(dtr, out_file=None, \n", " feature_names=column_names, \n", " filled=True, rounded=True, \n", " special_characters=True) \n", "graph = graphviz.Source(dot_data) \n", "graph " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Increasing the max depth to 10, unsurprisingly, the model accuracy **increases**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "dtr = DecisionTreeRegressor(max_depth=10,random_state=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "dtr.fit(training_set,y_training_set)\n", "print('The R^2 value for training set is: ',dtr.score(training_set,y_training_set))\n", "print('The R^2 value for test set is: ',dtr.score(test_set,y_test_set))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Random Forest Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Applying [RandomForestRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html), we see similar successes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "rf = RandomForestRegressor(n_estimators=10,max_depth=10,max_features=5,random_state=0)\n", "rf.fit(training_set,y_training_set)\n", "\n", "print('The R^2 value for training set is: ',rf.score(training_set,y_training_set))\n", "print('The R^2 value for test set is: ',rf.score(test_set,y_test_set))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2.3\n", "\n", "Play around with the `max_depth` parameter for the decision tree and random forest regression. What parameter gives the lowest error?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Solution\n", "dtr = DecisionTreeRegressor(max_depth=20,random_state=0)\n", "dtr.fit(training_set,y_training_set)\n", "print('The R^2 value for training set is: ',dtr.score(training_set,y_training_set))\n", "print('The R^2 value for test set is: ',dtr.score(test_set,y_test_set))\n", "print('\\n')\n", "\n", "rf = RandomForestRegressor(n_estimators=10,max_depth=20,max_features=5,random_state=0)\n", "rf.fit(training_set,y_training_set)\n", "\n", "print('The R^2 value for training set is: ',rf.score(training_set,y_training_set))\n", "print('The R^2 value for test set is: ',rf.score(test_set,y_test_set))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# K-Mer Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using data from [PATRIC](https://www.patricbrc.org/), I created a puedo-genomes of *Neisseria Gonorrhea* bacteria strains. Each genome is labelled for their antibotic resistance to *azithromycin*.\n", "\n", "With each strain, I splice the DNA in k-mers. k-mers are consecutive cuts of a DNA strand which contains *k* nucleotides.\n", "\n", "The image below shows the 7-mers of ATGGAAGTCGCGGAATC.\n", "\n", "![7mers.png](7mers.png)\n", "\n", "I collected all possible unique 31-mers from each genome. I then constructed a dataset whose rows represented a strain and columns represented a 31-mer. A 31-mer column had 0 if the strain's genome did not contain the 31-mer and 1 if the strain's genome contained the 31-mer.\n", "\n", "A genome is labelled 1 if it is suspectible to *azithromycin* and 0 if it is resistant to *azithromycin*.\n", "\n", "**I aim to build a classifer that can correctly label antibotic resistance in *Neisseria Gonorrhea*.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, I load the k-mers data set. I have already divided the set into training set and test set. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = 'kmer_data'\n", "\n", "def load_obj(name):\n", " \"\"\"\n", " Load a pickle file. Taken from\n", " https://stackoverflow.com/questions/19201290/how-to-save-a-dictionary-to-a-file\n", " :param name: Name of file\n", " :return: the file inside the pickle\n", " \"\"\"\n", "\n", " with open(path + name + '.pkl', 'rb') as f:\n", " return pickle.load(f)\n", "\n", " \n", "address = \"\"\n", "kmers_training_set = load_obj( \"/train\").todense()\n", "label_training_set = load_obj(\"/label_train\")\n", "kmers_test_set = load_obj(\"/test\").todense()\n", "label_test_set = load_obj(\"/label_test\")\n", "kmerlist = load_obj(\"/kmerlist\")\n", "\n", "#Just to show head of data\n", "kmer_df = pd.DataFrame(kmers_training_set, columns=kmerlist)\n", "print('The kmer data looks like:\\n',kmer_df.head())\n", "print('\\n')\n", "print('The first 5 categories are:\\n',label_training_set[0:5])\n", "print('The training set has shape: ',kmers_training_set.shape)\n", "print('The test set has shape: ',kmers_test_set.shape)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Dimensionality Reduction: PCA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I apply [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) without whitening to the k-mer training data set and reduce the number of features (dimensions) to 2." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "pca_without_whitening = PCA(n_components=2,whiten=False)\n", "pca_without_whitening.fit(kmers_training_set)\n", "\n", "kmer_training_pca_without_whitening = pca_without_whitening.transform(kmers_training_set)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, I generate a two dimensional plot of PCA data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "colours = ['red','blue']\n", "\n", "presence_0 = [int(element) == 0 for element in label_training_set]\n", "presence_1 = [int(element) == 1 for element in label_training_set]\n", "\n", "plt.scatter(kmer_training_pca_without_whitening[presence_0, 0],\n", " kmer_training_pca_without_whitening[presence_0, 1],\n", " label='label = 0 (Resistant)',\n", " c='r')\n", "\n", "plt.scatter(kmer_training_pca_without_whitening[presence_1, 0],\n", " kmer_training_pca_without_whitening[presence_1, 1],\n", " label='label = 1 (Susceptible)',\n", " c='b')\n", "\n", "plt.xlabel('$X_0$')\n", "plt.ylabel('$X_1$')\n", "plt.legend()\n", "plt.title('PCA plot of k-mer training data')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I also apply learn PCA without whitening, learned from the training set, to the k-mer test data set and reduce the number of features (dimensions) to 2. \n", "\n", "I then plot test data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kmer_test_pca_without_whitening = pca_without_whitening.transform(kmers_test_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "colours = ['red','blue']\n", "\n", "presence_0 = [int(element) == 0 for element in label_test_set]\n", "presence_1 = [int(element) == 1 for element in label_test_set]\n", "\n", "plt.scatter(kmer_test_pca_without_whitening[presence_0, 0],\n", " kmer_test_pca_without_whitening[presence_0, 1],\n", " label='label = 0 (Resistant)',\n", " c='r')\n", "\n", "plt.scatter(kmer_test_pca_without_whitening[presence_1, 0],\n", " kmer_test_pca_without_whitening[presence_1, 1],\n", " label='label = 1 (Susceptible)',\n", " c='b')\n", "\n", "plt.xlabel('$X_0$')\n", "plt.ylabel('$X_1$')\n", "plt.title('PCA plot of k-mer test data')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I then print the explained variance and singular values for each singular value " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num = len(pca_without_whitening.explained_variance_)\n", "for i in range(num):\n", " print('Component',i, 'explains',pca_without_whitening.explained_variance_[i],'variance')\n", " print('Component',i, 'has singular value', pca_without_whitening.singular_values_[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 3.1\n", "\n", "Using the variables `kmers_training_set` and `kmers_test_set`, apply [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) with whitening to the k-mer training and test data set. The variables are initialized in the cell below.\n", "\n", "Store the reduced training data in the variable, `kmer_train_pca_with_whitening`, and the reduced test data in the variable, `kmer_test_pca_with_whitening`.\n", "\n", "Run the below code to generate a plot. Comment on the differences between the whitened and non-whitened plots." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#solution\n", "pca_with_whitening = PCA(n_components=2,whiten=True)\n", "kmer_training_pca_with_whitening = pca_with_whitening.fit_transform(kmers_training_set)\n", "\n", "kmer_test_pca_with_whitening = pca_with_whitening.transform(kmers_test_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#generate plots\n", "\n", "colours = ['red','blue']\n", "\n", "presence_0 = [int(element) == 0 for element in label_training_set]\n", "presence_1 = [int(element) == 1 for element in label_training_set]\n", "\n", "plt.scatter(kmer_training_pca_with_whitening[presence_0, 0],\n", " kmer_training_pca_with_whitening[presence_0, 1],\n", " label='label = 0 (Resistant)',\n", " c='r')\n", "\n", "plt.scatter(kmer_training_pca_with_whitening[presence_1, 0],\n", " kmer_training_pca_with_whitening[presence_1, 1],\n", " label='label = 1 (Susceptible)',\n", " c='b')\n", "\n", "plt.xlabel('$X_0$')\n", "plt.ylabel('$X_1$')\n", "plt.legend()\n", "plt.title('PCA plot with whitening of k-mer training data')\n", "plt.show()\n", "\n", "\n", "presence_0 = [int(element) == 0 for element in label_test_set]\n", "presence_1 = [int(element) == 1 for element in label_test_set]\n", "\n", "plt.scatter(kmer_test_pca_with_whitening[presence_0, 0],\n", " kmer_test_pca_with_whitening[presence_0, 1],\n", " label='label = 0 (Resistant)',\n", " c='r')\n", "\n", "plt.scatter(kmer_test_pca_with_whitening[presence_1, 0],\n", " kmer_test_pca_with_whitening[presence_1, 1],\n", " label='label = 1 (Susceptible)',\n", " c='b')\n", "\n", "plt.xlabel('$X_0$')\n", "plt.ylabel('$X_1$')\n", "plt.legend()\n", "plt.title('PCA with whitening plot of k-mer test data')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Naive Bayes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I build a [Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) classifer to learn and predict resistance in the genomes. I print the model parameters for the first 5 k-mers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.naive_bayes import GaussianNB\n", "gNB=GaussianNB()\n", "gNB.fit(kmers_training_set,label_training_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "print('On the training set, Naive Bayes has an accuracy of',gNB.score(kmers_training_set,label_training_set)*100,'%')\n", "print('On the test set, Naive Bayes has an accuracy of', gNB.score(kmers_test_set,label_test_set)*100,'%')\n", "\n", "print('\\n')\n", "\n", "for i in range(len(gNB.class_count_)):\n", " print(i,'has',gNB.class_count_[i], 'classes')\n", " \n", "print('\\n')\n", "for i in range(5):\n", " print('When class = 0, k-mer,', kmerlist[i],', has mean',gNB.theta_[0,i], 'and variance, ', gNB.sigma_[0,i])\n", " print('\\n')\n", " print('When class = 1, k-mer,', kmerlist[i],', has mean',gNB.theta_[1,i],'and variance ',gNB.sigma_[1,i])\n", " print('\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Naive Bayes performs fairly well on predict antibotic resistance. It has $88\\%$ accuracy rate on the test set.\n", "\n", "Below, I also construct the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix\n", "\n", "predict_label_train_set = gNB.predict(kmers_training_set)\n", "predict_label_test_set = gNB.predict(kmers_test_set)\n", "\n", "print('On the training set,\\n',confusion_matrix(label_training_set,predict_label_train_set))\n", "print('On the test set,\\n',confusion_matrix(label_test_set,predict_label_test_set))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 4.1\n", "Apply [Native Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) to the PCA data, stored in `kmers_pca_training` and `kmers_pca_test`, and print [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for test data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "from sklearn.metrics import confusion_matrix\n", "pca = PCA(n_components=2,whiten=False)\n", "\n", "kmers_pca_training = pca.fit_transform(kmers_training_set)\n", "kmers_pca_test = pca.transform(kmers_test_set)\n", "gNB_pca=GaussianNB()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Solution\n", "\n", "gNB_pca.fit(kmers_pca_training,label_training_set)\n", "print('On the training set, Naive Bayes has an accuracy of',gNB_pca.score(kmers_pca_training,label_training_set)*100,'%')\n", "print('On the test set, Naive Bayes has an accuracy of', gNB_pca.score(kmers_pca_test,label_test_set)*100,'%')\n", "print('\\n')\n", "\n", "predict_label_train_set = gNB_pca.predict(kmers_pca_training)\n", "predict_label_test_set = gNB_pca.predict(kmers_pca_test)\n", "\n", "print('On the training set,\\n',confusion_matrix(label_test_set,predict_label_test_set))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logistic Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I build [Logistic Regression classifer](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to learn and predict resistance in the genomes. I print for the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "lgr=LogisticRegression()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lgr.fit(kmers_training_set,label_training_set)\n", "\n", "print('The coefficients for Logistic Regression classifer are:\\n',lgr.coef_) \n", "print('The intercept for Logistic Regression classifer are:\\n',lgr.intercept_)\n", "print('\\n')\n", "print('On the training set, Logistic Regression has an accuracy of',lgr.score(kmers_training_set,label_training_set)*100 , '%')\n", "print('On the test set, Logistic Regression has an accuracy of', lgr.score(kmers_test_set,label_test_set)*100, '%')\n", "print('\\n')\n", "\n", "predict_label_train_set_lgr = lgr.predict(kmers_training_set)\n", "predict_label_test_set_lgr = lgr.predict(kmers_test_set)\n", "\n", "print('On the training set,\\n',confusion_matrix(label_training_set,predict_label_train_set_lgr))\n", "print('On the test set,\\n',confusion_matrix(label_test_set,predict_label_test_set_lgr))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 4.2\n", "Apply [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to the PCA data, `kmers_pca_training` and `kmers_pca_test`, and print the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for the test data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lgr_pca=LogisticRegression()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#solution\n", "lgr_pca.fit(kmers_pca_training,label_training_set)\n", "print('On the training set, Logistic Regression has an accuracy of',lgr_pca.score(kmers_pca_training,label_training_set)*100,'%')\n", "print('On the test set, Logistic Regression has an accuracy of', lgr_pca.score(kmers_pca_test,label_test_set)*100,'%')\n", "print('\\n')\n", "\n", "predict_label_train_set = lgr_pca.predict(kmers_pca_training)\n", "predict_label_test_set = lgr_pca.predict(kmers_pca_test)\n", "\n", "print('On the training set,\\n',confusion_matrix(label_test_set,predict_label_test_set))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Classification Trees\n", "\n", "I build [Classification Tree](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) classifer with max depth, 5, to learn and predict resistance in the genomes. I print for the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "\n", "dtc = DecisionTreeClassifier(max_depth=5,random_state=0)\n", "dtc.fit(kmers_training_set,label_training_set)\n", "\n", "kmerlist_sorted = [kmer for _,kmer in sorted(zip(dtc.feature_importances_,kmerlist), reverse=True)]\n", "for i in range(5):\n", " print(kmerlist_sorted[i],'variable has importance,', sorted(dtc.feature_importances_, reverse=True)[i])\n", "print('\\n')\n", "print('The accuracy for training set is: ',dtc.score(kmers_training_set,label_training_set)*100,'%')\n", "print('The accuracy for test set is: ',dtc.score(kmers_test_set,label_test_set)*100,'%')\n", "predict_label_train_set_dtc = dtc.predict(kmers_training_set)\n", "predict_label_test_set_dtc = dtc.predict(kmers_test_set)\n", "print('\\n')\n", "\n", "print('On the training set,\\n',confusion_matrix(label_training_set,predict_label_train_set_dtc))\n", "print('On the test set,\\n',confusion_matrix(label_test_set,predict_label_test_set_dtc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I plot the decision tree with max depth, 5, below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import graphviz \n", "import sklearn.tree as tree\n", "dot_data = tree.export_graphviz(dtc, out_file=None, \n", " feature_names=kmerlist, \n", " filled=True, rounded=True, \n", " special_characters=True) \n", "graph = graphviz.Source(dot_data) \n", "graph " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 4.3\n", "\n", "Apply a [decision tree](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) with `max_depth=5`, to the PCA data, `kmers_pca_training` and `kmers_pca_test`, and print the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for the test data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dtc_pca=DecisionTreeClassifier(max_depth=5,random_state=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "#Solution\n", "dtc_pca.fit(kmers_pca_training,label_training_set)\n", "print('On the training set, a classification tree with max depth, 5, has an accuracy of',dtc_pca.score(kmers_pca_training,label_training_set)*100,'%')\n", "print('On the test set, a classification tree with max depth, 5,', dtc_pca.score(kmers_pca_test,label_test_set)*100,'%')\n", "print('\\n')\n", "\n", "predict_label_train_set = dtc_pca.predict(kmers_pca_training)\n", "predict_label_test_set = dtc_pca.predict(kmers_pca_test)\n", "\n", "print('On the training set,\\n',confusion_matrix(label_test_set,predict_label_test_set))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the code below to plot your classication tree.\n", "\n", "#### IF YOU INCREASE THE DEPTH BEYOND 5, DO NOT RUN THE CODE BELOW. IPYTHON MAY CRASH OR IT MAY TAKE A VERY LONG TIME LOAD." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "import graphviz \n", "import sklearn.tree as tree\n", "colours = ['red','blue']\n", "\n", "presence_0 = [int(element) == 0 for element in label_training_set]\n", "presence_1 = [int(element) == 1 for element in label_training_set]\n", "\n", "plt.scatter(kmer_training_pca_without_whitening[presence_0, 0],\n", " kmer_training_pca_without_whitening[presence_0, 1],\n", " label='label = 0 (Resistant)',\n", " c='r')\n", "\n", "plt.scatter(kmer_training_pca_without_whitening[presence_1, 0],\n", " kmer_training_pca_without_whitening[presence_1, 1],\n", " label='label = 1 (Susceptible)',\n", " c='b')\n", "\n", "plt.xlabel('$X_0$')\n", "plt.ylabel('$X_1$')\n", "plt.legend()\n", "plt.title('PCA plot of k-mer training data')\n", "plt.show()\n", "dot_data = tree.export_graphviz(dtc_pca, out_file=None, \n", " feature_names=['Component 0','Component 1'], \n", " filled=True, rounded=True, \n", " special_characters=True) \n", "graph = graphviz.Source(dot_data) \n", "graph " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AdaBoost\n", "\n", "I build an [AdaBoost](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) classifier with classification tree of max depth, 1, to learn and predict resistance in the genomes. I print for the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import AdaBoostClassifier\n", "\n", "adaboost = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),\n", " n_estimators=10, random_state=0)\n", "adaboost.fit(kmers_training_set,label_training_set)\n", "\n", "\n", "kmerlist_sorted = [kmer for _,kmer in sorted(zip(adaboost.feature_importances_,kmerlist), reverse=True)]\n", "for i in range(5):\n", " print(kmerlist_sorted[i],'variable has importance,', sorted(adaboost.feature_importances_, reverse=True)[i])\n", "\n", "print('\\n')\n", "print('The accuracy for training set is: ',adaboost.score(kmers_training_set,label_training_set)*100,'%')\n", "print('The accuracy for test set is: ',adaboost.score(kmers_test_set,label_test_set)*100, '%')\n", "predict_label_train_set_ada = adaboost.predict(kmers_training_set)\n", "predict_label_test_set_ada = adaboost.predict(kmers_test_set)\n", "print('\\n')\n", "\n", "print('For training set,\\n',confusion_matrix(label_training_set,predict_label_train_set_ada))\n", "print('For test set,\\n',confusion_matrix(label_test_set,predict_label_test_set_ada))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 4.4\n", "Apply [Adaboost](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) with classication tree stumps and 10 estimators to the PCA data, `kmers_pca_training` and `kmers_pca_test`, and print the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for the test data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adaboost_pca = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),\n", " n_estimators=10, random_state=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Solution\n", "adaboost_pca.fit(kmers_pca_training,label_training_set)\n", "print('On the training set, a classification tree with max depth, 5, has an accuracy of',adaboost_pca.score(kmers_pca_training,label_training_set)*100,'%')\n", "print('On the test set, a classification tree with max depth, 5,', adaboost_pca.score(kmers_pca_test,label_test_set)*100,'%')\n", "print('\\n')\n", "\n", "predict_label_train_set = adaboost_pca.predict(kmers_pca_training)\n", "predict_label_test_set = adaboost_pca.predict(kmers_pca_test)\n", "\n", "print('On the test set,\\n',confusion_matrix(label_test_set,predict_label_test_set))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### KNN\n", "\n", "I build an [KNN](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) classifier with classification tree with $n = 5$ to learn and predict resistance in the genomes. I print for the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "knn=KNeighborsClassifier(n_neighbors=5)\n", "knn.fit(kmers_training_set,label_training_set)\n", "\n", "print('The accuracy for training set is: ',knn.score(kmers_training_set,label_training_set)*100,'%')\n", "print('The accuracy for test set is: ',knn.score(kmers_test_set,label_test_set)*100,'%')\n", "predict_label_train_set_knn = knn.predict(kmers_training_set)\n", "predict_label_test_set_knn = knn.predict(kmers_test_set)\n", "print('On the training set,\\n',confusion_matrix(label_training_set,predict_label_train_set_knn))\n", "print('On the test set,\\n',confusion_matrix(label_test_set,predict_label_test_set_knn))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 4.5\n", "Apply [KNN](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) with to `n_neighbors=5` the PCA data and print the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for the test data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "knn_pca=KNeighborsClassifier(n_neighbors=5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Solution\n", "knn_pca.fit(kmers_pca_training,label_training_set)\n", "print('On the training set, a classification tree with max depth, 5, has an accuracy of',knn_pca.score(kmers_pca_training,label_training_set)*100,'%')\n", "print('On the test set, a classification tree with max depth, 5,', knn_pca.score(kmers_pca_test,label_test_set)*100,'%')\n", "print('\\n')\n", "\n", "predict_label_train_set = knn_pca.predict(kmers_pca_training)\n", "predict_label_test_set = knn_pca.predict(kmers_pca_test)\n", "\n", "print('On the test set,\\n',confusion_matrix(label_test_set,predict_label_test_set))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 1 }