{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Training and Testing Datasets\n", "Author: Ravin Poudel" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Our goal in statistics or machine learning is to build a model. Often we start with a set of data, fit a model of choice to the data, publish the model. However, it is equally important to test the model with new data and check/evaluate the model performance. Model validation requires a new set of data; the data that has not been used in fitting a model or the model has never seen these data. From an agricultural perspective, it means an additional experiment to generate data for model validation. Instead, we can __randomly__ divide a single dataset into two sets. Then use one set for training the model and other sets for testing/evaluating the learned model.\n", "\n", "\n", "\n", "> Train data set: A data set used to __construct/train/learn__ a model. \n", "\n", "> Test data set: A data set used to __evaluate__ the model.\n", "\n", "\n", "\n", "#### How do we spilit a single dataset into two?\n", "\n", "There is not a single or one best solution. Conventionally more data is used for model training than for model testing. Often convention such as `75%/ 25% train/ test or 90%/10% train/test` scheme are used. Regardless of how we decide to split the dataset, there are some pros and some cons. For instance, a larger training dataset allows us to learn the model better, while the larger testing dataset increases confidence in the model evaluation. _(Don't forget to evaluate sd in model accuracy among various approaches discussed in the sections below)_.\n", "\n", "Before we split the data, we also need to keep in mind the following question.\n", "\n", "> Can we apply similar data-splitting scheme when we have a small dataset? Often the case in agriculture or life sciences - \"as of now.\"\n", "\n", "> Does a single random split make our predictive model random? Do we want a stable model or a random model?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's start working on python. Here using `iris dataset`, we will explore the data splitting scheme, then build and evaluate the model. We will also explore various cross-validation methods briefly. The main goal of this module is to provide a general overview of creating train and test dataset, apply them to build a model, and evaluate the model performance. Beyond this model, you will be using this concept of train/test data throughout the other advanced modules in the workshop or the rest of your research." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `iris dataset` contains:\n", "\n", "- 50 samples of 3 different species of iris flower (150 samples in total)\n", "- Iris flower: Setosa, Versicolour, and Virginica\n", "- Measurements: sepal length, sepal width, petal length, petal width\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# import modules\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn import datasets\n", "from sklearn.model_selection import train_test_split\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# import iris data from scikit and data preparation\n", "\n", "iris = datasets.load_iris() # inbuilt data \n", "iris_X = iris['data'] # features data\n", "iris_y = iris['target'] # this has information about the flower type, has been coded as 0, 1, or 2.\n", "names = iris['target_names'] # flower type\n", "feature_names = iris['feature_names'] # features name\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(150, 4)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check data shape\n", "\n", "iris_X.data.shape\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n", " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n", " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n", " 2 2]\n" ] } ], "source": [ "print(iris_y)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['setosa' 'versicolor' 'virginica']\n" ] } ], "source": [ "print(names)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']\n" ] } ], "source": [ "print(feature_names)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "# splitting into train and test data. For example, test dataset = 25% of the original dataset\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(iris_X, iris_y, test_size=0.25, random_state=0)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((112, 4), (112,))" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# shape of train dataset\n", "\n", "X_train.shape, y_train.shape" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((38, 4), (38,))" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# shape of test dataset\n", "\n", "X_test.shape, y_test.shape" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# instantiate a K-Nearest Neighbors(KNN) model, and fit with X and y\n", "\n", "from sklearn.neighbors import KNeighborsClassifier\n", "model = KNeighborsClassifier()\n", "model_tt = model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NOTE: Here we are using KNeighborsClassifier model. Any other model, approprite to your study can be deployed. If you are interested to learn more on models, please follow [scikit-learn](https://scikit-learn.org/stable/)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9732142857142857" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check the accuracy on the training set\n", "model_tt.score(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0\n", " 2]\n" ] } ], "source": [ "# predict class labels for the test set\n", "predicted = model_tt.predict(X_test)\n", "print (predicted)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0\n", " 1]\n" ] } ], "source": [ "print(y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Did you see any differences is the predicted and test output?" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 'Predicted Values')" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# scatter plot\n", "plt.scatter(y_test, predicted)\n", "plt.xlabel(\"True Values\")\n", "plt.ylabel(\"Predicted Values\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Note: There are overlapping of predicted values and true values in the scatter plot. There are more than four points; more precisely there are 38 points." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9736842105263158\n" ] } ], "source": [ "# generate evaluation metrics\n", "from sklearn import metrics\n", "print (metrics.accuracy_score(y_test, predicted))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Confusion Matrix\n", "Also known as an error matrix. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa)." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[13 0 0]\n", " [ 0 15 1]\n", " [ 0 0 9]]\n" ] } ], "source": [ "print (metrics.confusion_matrix(y_test, predicted))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NOTE:\n", "\n", "> Never train model on your test dataset.\n", "\n", "> Be suspesious: If you ever happen to have 100% accuracy in your model __(overf-fitting)__ with test data, be suspecious and double check if you have not used test dataset for traning your model. \n", "\n", "> __over-fitting:__ model performs very well on the training data but poorly on the test data. Model follows exactly the same trend as the training dataset. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model Evaluation via Cross-Validation\n", "\n", "Results of train and test split are based on a single random split. Given we are randomly splitting datasets, each time we ran the model, we might get a slightly different results. To minimize stochasticity in the model, rather we can use cross-validation approaches which are robust to these issues. Besides, these approaches are much suitable, especially when you have a smaller dataset. There are many [methods](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) of cross-validation available in scikit-learn, but just to get started we will be learning:\n", "\n", "- K-Folds Cross-Validation\n", "\n", "- Leave One Out Cros-Validation (LOOCV)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### K-Folds Cross-Validation\n", "In K-Folds Cross-Validation, first, we divide the dataset randomly into k subset/bins. One of the subset/bin is used to validate the model, whereas the rest of the bins are used for training the model. We repeat the process for multiple rounds. Model performances at each round are averaged to define the overall performance of the model.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "from sklearn import model_selection\n", "model = KNeighborsClassifier()\n", "kfold = model_selection.KFold(n_splits=5, random_state=12323, shuffle=True) # note shuffle is true so that samples are randomly assigned to the folds." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.93333333, 0.96666667, 0.96666667, 1. , 0.96666667])" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = model_selection.cross_val_score(model, iris_X, iris_y, cv=kfold)\n", "results" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 96.667% (2.108%)\n" ] } ], "source": [ "print(\"Accuracy: %.3f%% (%.3f%%)\" % (results.mean()*100.0, results.std()*100.0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Leave One Out Cross Validation (LOOCV)\n", "\n", "In LOOCV, first, we randomly select one data point for testing and use the remaining data points for building a model. Given the smaller size of the test data, sd% is higher for model accuracy.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 96.667% (17.951%)\n" ] } ], "source": [ "model = KNeighborsClassifier()\n", "loocv = model_selection.LeaveOneOut()\n", "results = model_selection.cross_val_score(model, iris_X, iris_y, cv=loocv)\n", "print(\"Accuracy: %.3f%% (%.3f%%)\" % (results.mean()*100.0, results.std()*100.0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparision of model accuracy among various approches:\n", "\n", "|Methods|Accuracy%|Sd%|Notes|\n", "| :---: | :---: | :---: |:---:|\n", "| Train/Test (75/25)|97.36|NA| |\n", "| K-Folds (5)|96.67|2.20||\n", "| LOOCV|96.67|17.95|Higher sd|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ### !!! Now your turn :)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- How does train/test split ratio affects the model performance ? \n", " - Try by changing the percentage of test data to 50% and evaluate the model.\n", " \n", "- Evalate the model performance by incleasing K-folds size. \n", " - Check the model accuracy and sd.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }