{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) 2015, 2016 [Sebastian Raschka](sebastianraschka.com)\n",
"\n",
"https://github.com/rasbt/python-machine-learning-book\n",
"\n",
"[MIT License](https://github.com/rasbt/python-machine-learning-book/blob/master/LICENSE.txt)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Python Machine Learning - Code Examples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chapter 6 - Learning Best Practices for Model Evaluation and Hyperparameter Tuning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sebastian Raschka \n",
"last updated: 2016-09-29 \n",
"\n",
"CPython 3.5.2\n",
"IPython 5.1.0\n",
"\n",
"numpy 1.11.1\n",
"pandas 0.18.1\n",
"matplotlib 1.5.1\n",
"sklearn 0.18\n"
]
}
],
"source": [
"%load_ext watermark\n",
"%watermark -a 'Sebastian Raschka' -u -d -v -p numpy,pandas,matplotlib,sklearn"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"*The use of `watermark` is optional. You can install this IPython extension via \"`pip install watermark`\". For more information, please see: https://github.com/rasbt/watermark.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [Streamlining workflows with pipelines](#Streamlining-workflows-with-pipelines)\n",
" - [Loading the Breast Cancer Wisconsin dataset](#Loading-the-Breast-Cancer-Wisconsin-dataset)\n",
" - [Combining transformers and estimators in a pipeline](#Combining-transformers-and-estimators-in-a-pipeline)\n",
"- [Using k-fold cross-validation to assess model performance](#Using-k-fold-cross-validation-to-assess-model-performance)\n",
" - [The holdout method](#The-holdout-method)\n",
" - [K-fold cross-validation](#K-fold-cross-validation)\n",
"- [Debugging algorithms with learning and validation curves](#Debugging-algorithms-with-learning-and-validation-curves)\n",
" - [Diagnosing bias and variance problems with learning curves](#Diagnosing-bias-and-variance-problems-with-learning-curves)\n",
" - [Addressing overfitting and underfitting with validation curves](#Addressing-overfitting-and-underfitting-with-validation-curves)\n",
"- [Fine-tuning machine learning models via grid search](#Fine-tuning-machine-learning-models-via-grid-search)\n",
" - [Tuning hyperparameters via grid search](#Tuning-hyperparameters-via-grid-search)\n",
" - [Algorithm selection with nested cross-validation](#Algorithm-selection-with-nested-cross-validation)\n",
"- [Looking at different performance evaluation metrics](#Looking-at-different-performance-evaluation-metrics)\n",
" - [Reading a confusion matrix](#Reading-a-confusion-matrix)\n",
" - [Optimizing the precision and recall of a classification model](#Optimizing-the-precision-and-recall-of-a-classification-model)\n",
" - [Plotting a receiver operating characteristic](#Plotting-a-receiver-operating-characteristic)\n",
" - [The scoring metrics for multiclass classification](#The-scoring-metrics-for-multiclass-classification)\n",
"- [Summary](#Summary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from IPython.display import Image\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Added version check for recent scikit-learn 0.18 checks\n",
"from distutils.version import LooseVersion as Version\n",
"from sklearn import __version__ as sklearn_version"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Streamlining workflows with pipelines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading the Breast Cancer Wisconsin dataset"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases'\n",
" '/breast-cancer-wisconsin/wdbc.data', header=None)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(569, 32)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "... | \n", "22 | \n", "23 | \n", "24 | \n", "25 | \n", "26 | \n", "27 | \n", "28 | \n", "29 | \n", "30 | \n", "31 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "842302 | \n", "M | \n", "17.99 | \n", "10.38 | \n", "122.80 | \n", "1001.0 | \n", "0.11840 | \n", "0.27760 | \n", "0.3001 | \n", "0.14710 | \n", "... | \n", "25.38 | \n", "17.33 | \n", "184.60 | \n", "2019.0 | \n", "0.1622 | \n", "0.6656 | \n", "0.7119 | \n", "0.2654 | \n", "0.4601 | \n", "0.11890 | \n", "
1 | \n", "842517 | \n", "M | \n", "20.57 | \n", "17.77 | \n", "132.90 | \n", "1326.0 | \n", "0.08474 | \n", "0.07864 | \n", "0.0869 | \n", "0.07017 | \n", "... | \n", "24.99 | \n", "23.41 | \n", "158.80 | \n", "1956.0 | \n", "0.1238 | \n", "0.1866 | \n", "0.2416 | \n", "0.1860 | \n", "0.2750 | \n", "0.08902 | \n", "
2 | \n", "84300903 | \n", "M | \n", "19.69 | \n", "21.25 | \n", "130.00 | \n", "1203.0 | \n", "0.10960 | \n", "0.15990 | \n", "0.1974 | \n", "0.12790 | \n", "... | \n", "23.57 | \n", "25.53 | \n", "152.50 | \n", "1709.0 | \n", "0.1444 | \n", "0.4245 | \n", "0.4504 | \n", "0.2430 | \n", "0.3613 | \n", "0.08758 | \n", "
3 | \n", "84348301 | \n", "M | \n", "11.42 | \n", "20.38 | \n", "77.58 | \n", "386.1 | \n", "0.14250 | \n", "0.28390 | \n", "0.2414 | \n", "0.10520 | \n", "... | \n", "14.91 | \n", "26.50 | \n", "98.87 | \n", "567.7 | \n", "0.2098 | \n", "0.8663 | \n", "0.6869 | \n", "0.2575 | \n", "0.6638 | \n", "0.17300 | \n", "
4 | \n", "84358402 | \n", "M | \n", "20.29 | \n", "14.34 | \n", "135.10 | \n", "1297.0 | \n", "0.10030 | \n", "0.13280 | \n", "0.1980 | \n", "0.10430 | \n", "... | \n", "22.54 | \n", "16.67 | \n", "152.20 | \n", "1575.0 | \n", "0.1374 | \n", "0.2050 | \n", "0.4000 | \n", "0.1625 | \n", "0.2364 | \n", "0.07678 | \n", "
5 rows × 32 columns
\n", "