{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to scikit-learn\n", "\n", "## Basic preprocessing and model fitting\n", "\n", "In this notebook, we present how to build predictive models on tabular\n", "datasets.\n", "\n", "In particular we will highlight:\n", "* the importance of scaling numerical variables;\n", "* how to train predictive models when you only have numerical variables;\n", "* how to evaluate the performance of a model via cross-validation.\n", "\n", "## Introducing the dataset\n", "\n", "To this aim, we will use data from the 1994 Census bureau database. The goal\n", "with this data is to regress wages from heterogeneous data such as age,\n", "employment, education, family information, etc.\n", "\n", "Let's first load the data located in the `datasets` folder." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(\n", " \"https://www.openml.org/data/get_csv/1595261/adult-census.csv\")\n", "\n", "# Or use the local copy:\n", "# df = pd.read_csv('../datasets/adult-census.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at the first records of this data frame:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>age</th>\n", " <th>workclass</th>\n", " <th>fnlwgt</th>\n", " <th>education</th>\n", " <th>education-num</th>\n", " <th>marital-status</th>\n", " <th>occupation</th>\n", " <th>relationship</th>\n", " <th>race</th>\n", " <th>sex</th>\n", " <th>capital-gain</th>\n", " <th>capital-loss</th>\n", " <th>hours-per-week</th>\n", " <th>native-country</th>\n", " <th>class</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>0</td>\n", " <td>25</td>\n", " <td>Private</td>\n", " <td>226802</td>\n", " <td>11th</td>\n", " <td>7</td>\n", " <td>Never-married</td>\n", " <td>Machine-op-inspct</td>\n", " <td>Own-child</td>\n", " <td>Black</td>\n", " <td>Male</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>40</td>\n", " <td>United-States</td>\n", " <td><=50K</td>\n", " </tr>\n", " <tr>\n", " <td>1</td>\n", " <td>38</td>\n", " <td>Private</td>\n", " <td>89814</td>\n", " <td>HS-grad</td>\n", " <td>9</td>\n", " <td>Married-civ-spouse</td>\n", " <td>Farming-fishing</td>\n", " <td>Husband</td>\n", " <td>White</td>\n", " <td>Male</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>50</td>\n", " <td>United-States</td>\n", " <td><=50K</td>\n", " </tr>\n", " <tr>\n", " <td>2</td>\n", " <td>28</td>\n", " <td>Local-gov</td>\n", " <td>336951</td>\n", " <td>Assoc-acdm</td>\n", " <td>12</td>\n", " <td>Married-civ-spouse</td>\n", " <td>Protective-serv</td>\n", " <td>Husband</td>\n", " <td>White</td>\n", " <td>Male</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>40</td>\n", " <td>United-States</td>\n", " <td>>50K</td>\n", " </tr>\n", " <tr>\n", " <td>3</td>\n", " <td>44</td>\n", " <td>Private</td>\n", " <td>160323</td>\n", " <td>Some-college</td>\n", " <td>10</td>\n", " <td>Married-civ-spouse</td>\n", " <td>Machine-op-inspct</td>\n", " <td>Husband</td>\n", " <td>Black</td>\n", " <td>Male</td>\n", " <td>7688</td>\n", " <td>0</td>\n", " <td>40</td>\n", " <td>United-States</td>\n", " <td>>50K</td>\n", " </tr>\n", " <tr>\n", " <td>4</td>\n", " <td>18</td>\n", " <td>?</td>\n", " <td>103497</td>\n", " <td>Some-college</td>\n", " <td>10</td>\n", " <td>Never-married</td>\n", " <td>?</td>\n", " <td>Own-child</td>\n", " <td>White</td>\n", " <td>Female</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>30</td>\n", " <td>United-States</td>\n", " <td><=50K</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " age workclass fnlwgt education education-num marital-status \\\n", "0 25 Private 226802 11th 7 Never-married \n", "1 38 Private 89814 HS-grad 9 Married-civ-spouse \n", "2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse \n", "3 44 Private 160323 Some-college 10 Married-civ-spouse \n", "4 18 ? 103497 Some-college 10 Never-married \n", "\n", " occupation relationship race sex capital-gain \\\n", "0 Machine-op-inspct Own-child Black Male 0 \n", "1 Farming-fishing Husband White Male 0 \n", "2 Protective-serv Husband White Male 0 \n", "3 Machine-op-inspct Husband Black Male 7688 \n", "4 ? Own-child White Female 0 \n", "\n", " capital-loss hours-per-week native-country class \n", "0 0 40 United-States <=50K \n", "1 0 50 United-States <=50K \n", "2 0 40 United-States >50K \n", "3 0 40 United-States >50K \n", "4 0 30 United-States <=50K " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The target variable in our study will be the \"class\" column while we will use\n", "the other columns as input variables for our model. This target column divides\n", "the samples (also known as records) into two groups: high income (>50K) vs low\n", "income (<=50K). The resulting prediction problem is therefore a binary\n", "classification problem.\n", "\n", "For simplicity, we will ignore the \"fnlwgt\" (final weight) column that was\n", "crafted by the creators of the dataset when sampling the dataset to be\n", "representative of the full census database." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([' <=50K', ' <=50K', ' >50K', ..., ' <=50K', ' <=50K', ' >50K'],\n", " dtype=object)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target_name = \"class\"\n", "target = df[target_name].to_numpy()\n", "target" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>age</th>\n", " <th>workclass</th>\n", " <th>education</th>\n", " <th>education-num</th>\n", " <th>marital-status</th>\n", " <th>occupation</th>\n", " <th>relationship</th>\n", " <th>race</th>\n", " <th>sex</th>\n", " <th>capital-gain</th>\n", " <th>capital-loss</th>\n", " <th>hours-per-week</th>\n", " <th>native-country</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>0</td>\n", " <td>25</td>\n", " <td>Private</td>\n", " <td>11th</td>\n", " <td>7</td>\n", " <td>Never-married</td>\n", " <td>Machine-op-inspct</td>\n", " <td>Own-child</td>\n", " <td>Black</td>\n", " <td>Male</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>40</td>\n", " <td>United-States</td>\n", " </tr>\n", " <tr>\n", " <td>1</td>\n", " <td>38</td>\n", " <td>Private</td>\n", " <td>HS-grad</td>\n", " <td>9</td>\n", " <td>Married-civ-spouse</td>\n", " <td>Farming-fishing</td>\n", " <td>Husband</td>\n", " <td>White</td>\n", " <td>Male</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>50</td>\n", " <td>United-States</td>\n", " </tr>\n", " <tr>\n", " <td>2</td>\n", " <td>28</td>\n", " <td>Local-gov</td>\n", " <td>Assoc-acdm</td>\n", " <td>12</td>\n", " <td>Married-civ-spouse</td>\n", " <td>Protective-serv</td>\n", " <td>Husband</td>\n", " <td>White</td>\n", " <td>Male</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>40</td>\n", " <td>United-States</td>\n", " </tr>\n", " <tr>\n", " <td>3</td>\n", " <td>44</td>\n", " <td>Private</td>\n", " <td>Some-college</td>\n", " <td>10</td>\n", " <td>Married-civ-spouse</td>\n", " <td>Machine-op-inspct</td>\n", " <td>Husband</td>\n", " <td>Black</td>\n", " <td>Male</td>\n", " <td>7688</td>\n", " <td>0</td>\n", " <td>40</td>\n", " <td>United-States</td>\n", " </tr>\n", " <tr>\n", " <td>4</td>\n", " <td>18</td>\n", " <td>?</td>\n", " <td>Some-college</td>\n", " <td>10</td>\n", " <td>Never-married</td>\n", " <td>?</td>\n", " <td>Own-child</td>\n", " <td>White</td>\n", " <td>Female</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>30</td>\n", " <td>United-States</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " age workclass education education-num marital-status \\\n", "0 25 Private 11th 7 Never-married \n", "1 38 Private HS-grad 9 Married-civ-spouse \n", "2 28 Local-gov Assoc-acdm 12 Married-civ-spouse \n", "3 44 Private Some-college 10 Married-civ-spouse \n", "4 18 ? Some-college 10 Never-married \n", "\n", " occupation relationship race sex capital-gain \\\n", "0 Machine-op-inspct Own-child Black Male 0 \n", "1 Farming-fishing Husband White Male 0 \n", "2 Protective-serv Husband White Male 0 \n", "3 Machine-op-inspct Husband Black Male 7688 \n", "4 ? Own-child White Female 0 \n", "\n", " capital-loss hours-per-week native-country \n", "0 0 40 United-States \n", "1 0 50 United-States \n", "2 0 40 United-States \n", "3 0 40 United-States \n", "4 0 30 United-States " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = df.drop(columns=[target_name, \"fnlwgt\"])\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check the number of samples and the number of features available in\n", "the dataset:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The dataset contains 48842 samples and 13 features\n" ] } ], "source": [ "print(\n", " f\"The dataset contains {data.shape[0]} samples and {data.shape[1]} \"\n", " \"features\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with numerical data\n", "\n", "Numerical data is the most natural type of data used in machine learning\n", "and can (almost) directly be fed to predictive models. We can quickly have a\n", "look at such data by selecting the subset of numerical columns from the\n", "original data.\n", "\n", "We will use this subset of data to fit a linear classification model to\n", "predict the income class." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['age', 'workclass', 'education', 'education-num', 'marital-status',\n", " 'occupation', 'relationship', 'race', 'sex', 'capital-gain',\n", " 'capital-loss', 'hours-per-week', 'native-country'],\n", " dtype='object')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.columns" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age int64\n", "workclass object\n", "education object\n", "education-num int64\n", "marital-status object\n", "occupation object\n", "relationship object\n", "race object\n", "sex object\n", "capital-gain int64\n", "capital-loss int64\n", "hours-per-week int64\n", "native-country object\n", "dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dtypes" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "# \"i\" denotes integer type, \"f\" denotes float type\n", "numerical_columns = [\n", " c for c in data.columns if data[c].dtype.kind in [\"i\", \"f\"]]\n", "numerical_columns" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>age</th>\n", " <th>education-num</th>\n", " <th>capital-gain</th>\n", " <th>capital-loss</th>\n", " <th>hours-per-week</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>0</td>\n", " <td>25</td>\n", " <td>7</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>40</td>\n", " </tr>\n", " <tr>\n", " <td>1</td>\n", " <td>38</td>\n", " <td>9</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>50</td>\n", " </tr>\n", " <tr>\n", " <td>2</td>\n", " <td>28</td>\n", " <td>12</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>40</td>\n", " </tr>\n", " <tr>\n", " <td>3</td>\n", " <td>44</td>\n", " <td>10</td>\n", " <td>7688</td>\n", " <td>0</td>\n", " <td>40</td>\n", " </tr>\n", " <tr>\n", " <td>4</td>\n", " <td>18</td>\n", " <td>10</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>30</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " age education-num capital-gain capital-loss hours-per-week\n", "0 25 7 0 0 40\n", "1 38 9 0 0 50\n", "2 28 12 0 0 40\n", "3 44 10 7688 0 40\n", "4 18 10 0 0 30" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_numeric = data[numerical_columns]\n", "data_numeric.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When building a machine learning model, it is important to leave out a\n", "subset of the data which we can use later to evaluate the trained model.\n", "The data used to fit a model a called training data while the one used to\n", "assess a model are called testing data.\n", "\n", "Scikit-learn provides an helper function `train_test_split` which will\n", "split the dataset into a training and a testing set. It will ensure that\n", "the data are shuffled randomly before splitting the data." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The training dataset contains 36631 samples and 5 features\n", "The testing dataset contains 12211 samples and 5 features\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "data_train, data_test, target_train, target_test = train_test_split(\n", " data_numeric, target, random_state=42)\n", "\n", "print(\n", " f\"The training dataset contains {data_train.shape[0]} samples and \"\n", " f\"{data_train.shape[1]} features\")\n", "print(\n", " f\"The testing dataset contains {data_test.shape[0]} samples and \"\n", " f\"{data_test.shape[1]} features\")" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "We will build a linear classification model called \"Logistic Regression\". The\n", "`fit` method is called to train the model from the input (features) and\n", "target data. Only the training data should be given for this purpose.\n", "\n", "In addition, check the time required to train the model and the number of\n", "iterations done by the solver to find a solution." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The model LogisticRegression was trained in 0.381 seconds for [100] iterations\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/lesteve/miniconda3/envs/scikit-learn-tutorial/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.\n", " \"of iterations.\", ConvergenceWarning)\n" ] } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "import time\n", "\n", "model = LogisticRegression(solver='lbfgs')\n", "start = time.time()\n", "model.fit(data_train, target_train)\n", "elapsed_time = time.time() - start\n", "\n", "print(f\"The model {model.__class__.__name__} was trained in \"\n", " f\"{elapsed_time:.3f} seconds for {model.n_iter_} iterations\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's ignore the convergence warning for now and instead let's try\n", "to use our model to make some predictions on the first three records\n", "of the held out test set:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target_predicted = model.predict(data_test)\n", "target_predicted[:5]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target_test[:5]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>age</th>\n", " <th>education-num</th>\n", " <th>capital-gain</th>\n", " <th>capital-loss</th>\n", " <th>hours-per-week</th>\n", " <th>predicted-class</th>\n", " <th>expected-class</th>\n", " <th>correct</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>7762</td>\n", " <td>56</td>\n", " <td>9</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>40</td>\n", " <td><=50K</td>\n", " <td><=50K</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <td>23881</td>\n", " <td>25</td>\n", " <td>9</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>40</td>\n", " <td><=50K</td>\n", " <td><=50K</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <td>30507</td>\n", " <td>43</td>\n", " <td>13</td>\n", " <td>14344</td>\n", " <td>0</td>\n", " <td>40</td>\n", " <td>>50K</td>\n", " <td>>50K</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <td>28911</td>\n", " <td>32</td>\n", " <td>9</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>40</td>\n", " <td><=50K</td>\n", " <td><=50K</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <td>19484</td>\n", " <td>39</td>\n", " <td>13</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>30</td>\n", " <td><=50K</td>\n", " <td><=50K</td>\n", " <td>True</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " age education-num capital-gain capital-loss hours-per-week \\\n", "7762 56 9 0 0 40 \n", "23881 25 9 0 0 40 \n", "30507 43 13 14344 0 40 \n", "28911 32 9 0 0 40 \n", "19484 39 13 0 0 30 \n", "\n", " predicted-class expected-class correct \n", "7762 <=50K <=50K True \n", "23881 <=50K <=50K True \n", "30507 >50K >50K True \n", "28911 <=50K <=50K True \n", "19484 <=50K <=50K True " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictions = data_test.copy()\n", "predictions['predicted-class'] = target_predicted\n", "predictions['expected-class'] = target_test\n", "predictions['correct'] = target_predicted == target_test\n", "predictions.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To quantitatively evaluate our model, we can use the method `score`. It will\n", "compute the classification accuracy when dealing with a classificiation\n", "problem." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The test accuracy using a LogisticRegression is 0.818\n" ] } ], "source": [ "print(f\"The test accuracy using a {model.__class__.__name__} is \"\n", " f\"{model.score(data_test, target_test):.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is mathematically equivalent as computing the average number of time\n", "the model makes a correct prediction on the test set:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8177053476373761" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(target_test == target_predicted).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1\n", "\n", "- What would be the score of a model that always predicts `' >50K'`?\n", "- What would be the score of a model that always predicts `' <= 50K'`?\n", "- Is 81% or 82% accuracy a good score for this problem?\n", "\n", "Hint: You can compute the cross-validated of a [DummyClassifier](https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators) the performance of such baselines.\n", "\n", "Use the dedicated notebook to do this exercise." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now consider the `ConvergenceWarning` message that was raised previously\n", "when calling the `fit` method to train our model. This warning informs us that\n", "our model stopped learning because it reached the maximum number of\n", "iterations allowed by the user. This could potentially be detrimental for the\n", "model accuracy. We can follow the (bad) advice given in the warning message\n", "and increase the maximum number of iterations allowed." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "model = LogisticRegression(solver='lbfgs', max_iter=50000)\n", "start = time.time()\n", "model.fit(data_train, target_train)\n", "elapsed_time = time.time() - start" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The accuracy using a LogisticRegression is 0.818 with a fitting time of 0.353 seconds in [105] iterations\n" ] } ], "source": [ "print(\n", " f\"The accuracy using a {model.__class__.__name__} is \"\n", " f\"{model.score(data_test, target_test):.3f} with a fitting time of \"\n", " f\"{elapsed_time:.3f} seconds in {model.n_iter_} iterations\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now observe a longer training time but not significant improvement in\n", "the predictive performance. Instead of increasing the number of iterations, we\n", "can try to help fit the model faster by scaling the data first. A range of\n", "preprocessing algorithms in scikit-learn allows us to transform the input data\n", "before training a model. We can easily combine these sequential operations\n", "with a scikit-learn `Pipeline`, which chain together operations and can be\n", "used like any other classifier or regressor. The helper function\n", "`make_pipeline` will create a `Pipeline` by giving as arguments the successive\n", "transformations to perform followed by the classifier or regressor model.\n", "\n", "In our case, we will standardize the data and then train a new logistic\n", "regression model on that new version of the dataset set." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>age</th>\n", " <th>education-num</th>\n", " <th>capital-gain</th>\n", " <th>capital-loss</th>\n", " <th>hours-per-week</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>count</td>\n", " <td>36631.000000</td>\n", " <td>36631.000000</td>\n", " <td>36631.000000</td>\n", " <td>36631.000000</td>\n", " <td>36631.000000</td>\n", " </tr>\n", " <tr>\n", " <td>mean</td>\n", " <td>38.642352</td>\n", " <td>10.078131</td>\n", " <td>1087.077721</td>\n", " <td>89.665311</td>\n", " <td>40.431247</td>\n", " </tr>\n", " <tr>\n", " <td>std</td>\n", " <td>13.725748</td>\n", " <td>2.570143</td>\n", " <td>7522.692939</td>\n", " <td>407.110175</td>\n", " <td>12.423952</td>\n", " </tr>\n", " <tr>\n", " <td>min</td>\n", " <td>17.000000</td>\n", " <td>1.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>1.000000</td>\n", " </tr>\n", " <tr>\n", " <td>25%</td>\n", " <td>28.000000</td>\n", " <td>9.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>40.000000</td>\n", " </tr>\n", " <tr>\n", " <td>50%</td>\n", " <td>37.000000</td>\n", " <td>10.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>40.000000</td>\n", " </tr>\n", " <tr>\n", " <td>75%</td>\n", " <td>48.000000</td>\n", " <td>12.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>45.000000</td>\n", " </tr>\n", " <tr>\n", " <td>max</td>\n", " <td>90.000000</td>\n", " <td>16.000000</td>\n", " <td>99999.000000</td>\n", " <td>4356.000000</td>\n", " <td>99.000000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " age education-num capital-gain capital-loss hours-per-week\n", "count 36631.000000 36631.000000 36631.000000 36631.000000 36631.000000\n", "mean 38.642352 10.078131 1087.077721 89.665311 40.431247\n", "std 13.725748 2.570143 7522.692939 407.110175 12.423952\n", "min 17.000000 1.000000 0.000000 0.000000 1.000000\n", "25% 28.000000 9.000000 0.000000 0.000000 40.000000\n", "50% 37.000000 10.000000 0.000000 0.000000 40.000000\n", "75% 48.000000 12.000000 0.000000 0.000000 45.000000\n", "max 90.000000 16.000000 99999.000000 4356.000000 99.000000" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_train.describe()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0.17177061, 0.35868902, -0.14450843, 5.71188483, -2.28845333],\n", " [ 0.02605707, 1.1368665 , -0.14450843, -0.22025127, -0.27618374],\n", " [-0.33822677, 1.1368665 , -0.14450843, -0.22025127, 0.77019645],\n", " ...,\n", " [-0.77536738, -0.03039972, -0.14450843, -0.22025127, -0.03471139],\n", " [ 0.53605445, 0.35868902, -0.14450843, -0.22025127, -0.03471139],\n", " [ 1.48319243, 1.52595523, -0.14450843, -0.22025127, -2.69090725]])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "scaler = StandardScaler()\n", "data_train_scaled = scaler.fit_transform(data_train)\n", "data_train_scaled" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>age</th>\n", " <th>education-num</th>\n", " <th>capital-gain</th>\n", " <th>capital-loss</th>\n", " <th>hours-per-week</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>count</td>\n", " <td>3.663100e+04</td>\n", " <td>3.663100e+04</td>\n", " <td>3.663100e+04</td>\n", " <td>3.663100e+04</td>\n", " <td>3.663100e+04</td>\n", " </tr>\n", " <tr>\n", " <td>mean</td>\n", " <td>-2.273364e-16</td>\n", " <td>1.219606e-16</td>\n", " <td>3.530310e-17</td>\n", " <td>3.840667e-17</td>\n", " <td>1.844684e-16</td>\n", " </tr>\n", " <tr>\n", " <td>std</td>\n", " <td>1.000014e+00</td>\n", " <td>1.000014e+00</td>\n", " <td>1.000014e+00</td>\n", " <td>1.000014e+00</td>\n", " <td>1.000014e+00</td>\n", " </tr>\n", " <tr>\n", " <td>min</td>\n", " <td>-1.576792e+00</td>\n", " <td>-3.532198e+00</td>\n", " <td>-1.445084e-01</td>\n", " <td>-2.202513e-01</td>\n", " <td>-3.173852e+00</td>\n", " </tr>\n", " <tr>\n", " <td>25%</td>\n", " <td>-7.753674e-01</td>\n", " <td>-4.194885e-01</td>\n", " <td>-1.445084e-01</td>\n", " <td>-2.202513e-01</td>\n", " <td>-3.471139e-02</td>\n", " </tr>\n", " <tr>\n", " <td>50%</td>\n", " <td>-1.196565e-01</td>\n", " <td>-3.039972e-02</td>\n", " <td>-1.445084e-01</td>\n", " <td>-2.202513e-01</td>\n", " <td>-3.471139e-02</td>\n", " </tr>\n", " <tr>\n", " <td>75%</td>\n", " <td>6.817680e-01</td>\n", " <td>7.477778e-01</td>\n", " <td>-1.445084e-01</td>\n", " <td>-2.202513e-01</td>\n", " <td>3.677425e-01</td>\n", " </tr>\n", " <tr>\n", " <td>max</td>\n", " <td>3.741752e+00</td>\n", " <td>2.304133e+00</td>\n", " <td>1.314865e+01</td>\n", " <td>1.047970e+01</td>\n", " <td>4.714245e+00</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " age education-num capital-gain capital-loss hours-per-week\n", "count 3.663100e+04 3.663100e+04 3.663100e+04 3.663100e+04 3.663100e+04\n", "mean -2.273364e-16 1.219606e-16 3.530310e-17 3.840667e-17 1.844684e-16\n", "std 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00\n", "min -1.576792e+00 -3.532198e+00 -1.445084e-01 -2.202513e-01 -3.173852e+00\n", "25% -7.753674e-01 -4.194885e-01 -1.445084e-01 -2.202513e-01 -3.471139e-02\n", "50% -1.196565e-01 -3.039972e-02 -1.445084e-01 -2.202513e-01 -3.471139e-02\n", "75% 6.817680e-01 7.477778e-01 -1.445084e-01 -2.202513e-01 3.677425e-01\n", "max 3.741752e+00 2.304133e+00 1.314865e+01 1.047970e+01 4.714245e+00" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_train_scaled = pd.DataFrame(data_train_scaled,\n", " columns=data_train.columns)\n", "data_train_scaled.describe()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import make_pipeline\n", "\n", "model = make_pipeline(StandardScaler(),\n", " LogisticRegression(solver='lbfgs'))\n", "start = time.time()\n", "model.fit(data_train, target_train)\n", "elapsed_time = time.time() - start" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The accuracy using a Pipeline is 0.818 with a fitting time of 0.086 seconds in [13] iterations\n" ] } ], "source": [ "print(\n", " f\"The accuracy using a {model.__class__.__name__} is \"\n", " f\"{model.score(data_test, target_test):.3f} with a fitting time of \"\n", " f\"{elapsed_time:.3f} seconds in {model[-1].n_iter_} iterations\")" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "We can see that the training time and the number of iterations is much shorter\n", "while the predictive performance (accuracy) stays the same.\n", "\n", "In the previous example, we split the original data into a training set and a\n", "testing set. This strategy has several issues: in the setting where the amount\n", "of data is limited, the subset of data used to train or test will be small;\n", "and the splitting was done in a random manner and we have no information\n", "regarding the confidence of the results obtained.\n", "\n", "Instead, we can use cross-validation. Cross-validation consists of\n", "repeating this random splitting into training and testing sets and aggregating\n", "the model performance. By repeating the experiment, one can get an estimate of\n", "the variability of the model performance.\n", "\n", "The function `cross_val_score` allows for such experimental protocol by giving\n", "the model, the data and the target. Since there exists several\n", "cross-validation strategies, `cross_val_score` takes a parameter `cv` which\n", "defines the splitting strategy." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The different scores obtained are: \n", "[0.81216092 0.8096018 0.81337019 0.81326781 0.82207207]\n" ] } ], "source": [ "from sklearn.model_selection import cross_val_score\n", "\n", "scores = cross_val_score(model, data_numeric, target, cv=5)\n", "print(f\"The different scores obtained are: \\n{scores}\")" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The mean cross-validation accuracy is: 0.814 +/- 0.004\n" ] } ], "source": [ "print(f\"The mean cross-validation accuracy is: \"\n", " f\"{scores.mean():.3f} +/- {scores.std():.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that by computing the standard-deviation of the cross-validation scores\n", "we can get an idea of the uncertainty of our estimation of the predictive\n", "performance of the model: in the above results, only the first 2 decimals seem\n", "to be trustworthy. Using a single train / test split would not allow us to\n", "know anything about the level of uncertainty of the accuracy of the model.\n", "\n", "Setting `cv=5` created 5 distinct splits to get 5 variations for the training\n", "and testing sets. Each training set is used to fit one model which is then\n", "scored on the matching test set. This strategy is called K-fold\n", "cross-validation where `K` corresponds to the number of splits.\n", "\n", "The following matplotlib code helps visualize how the dataset is partitioned\n", "into train and test samples at each iteration of the cross-validation\n", "procedure:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "from sklearn.model_selection import KFold\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from matplotlib.patches import Patch\n", "\n", "cmap_cv = plt.cm.coolwarm\n", "\n", "\n", "def plot_cv_indices(cv, X, y, ax, lw=20):\n", " \"\"\"Create a sample plot for indices of a cross-validation object.\"\"\"\n", " splits = list(cv.split(X=X, y=y))\n", " n_splits = len(splits)\n", "\n", " # Generate the training/testing visualizations for each CV split\n", " for ii, (train, test) in enumerate(splits):\n", " # Fill in indices with the training/test groups\n", " indices = np.zeros(shape=X.shape[0], dtype=np.int32)\n", " indices[train] = 1\n", "\n", " # Visualize the results\n", " ax.scatter(range(len(indices)), [ii + .5] * len(indices),\n", " c=indices, marker='_', lw=lw, cmap=cmap_cv,\n", " vmin=-.2, vmax=1.2)\n", "\n", " # Formatting\n", " yticklabels = list(range(n_splits))\n", " ax.set(yticks=np.arange(n_splits + 2) + .5,\n", " yticklabels=yticklabels, xlabel='Sample index',\n", " ylabel=\"CV iteration\", ylim=[n_splits + .2,\n", " -.2], xlim=[0, 100])\n", " ax.set_title('{}'.format(type(cv).__name__), fontsize=15)\n", " return ax" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmEAAAGFCAYAAAC1yCRCAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAZCUlEQVR4nO3dfbRlZ10f8O+PmQSCwUKIKE0mmaABRUoDDhTFKgKlvESwBoEUK7LAVIIFqS4FQlFaqYuuKtRqgBiQqAgiQY0oCgsSIiiRGYiEEMJLJC8lEmJ4iYIJCb/+cfYkl8vMnTt35tznnjufz1pn3XOes/d+fjd77ZvvPPvZe1d3BwCA9XWH0QUAAByKhDAAgAGEMACAAYQwAIABhDAAgAGEMACAAYQwYMOpql+squuXtd2hql5fVf9cVY+qqguqqvfwetF+9tVV9VP7WObkabnt+//bAOzZ1tEFAOxLVVWS30zyI0lO6e63V9ULk5yf5IXLFr96vesDWAshDFgEv57kaUme3N1/sqT9hu5+36CaAA6I05HAhlZVv5LkJ5P8WHefu5/rnlBVf1RVX6yqG6vqT6rq2/axTk2nQ6+b1vntJN94AL8CwB4JYcCGVVUvTfK8JM/s7t/b8yK1delryRd3TPLOJN+R5CeS/HiSE5K8u6qOWqHb5yR5cZKzkjwxyZeT/K+D8fsALOV0JLBR3T2z+V4v7+7f2ssyP5zkK0sbquqw7r4lydOTHJfk3t19xfTdRUmuSPKfk/zy8o1V1ZYkP5/k1d29e4L/X1TVO5Icc+C/EsDtjIQBG9UXk1yU5BlVddJelnlXkgctfU0BLEkenOQDuwNYknT3NUnem+R797K9bUnumeSPl7W/ZU2/AcAKjIQBG9VXkjwuyXuSvK2qHro0UE0+190797L+PZN8Zg/tn0ly/F7W+Zbp53XL2pd/BjhgRsKADau7/yHJo5LcktlpwXvsx+rXJtnT8t+c5Ia9rPP308/l6+1PvwCrIoQBG1p3X53k0ZnNEXtbVd1llatelOS7quqE3Q1VdUyS78lsdG1Prs4siD1hWfsP71fRAKsghAEbXndfmuTkzK50/MOqOnwVq70uyVWZBbcnVdUpSf48yfVJXr2Xfm7N7ErI06rqf0x35n/11C/AQSWEAQuhu/8qyZOSfH+S38k+/n51901JHpnko0lek+ScJFcmeVh37+10ZJK8Isn/zOzeZOcmOTLJzx1o/QDLVXePrgEA4JBjJAwAYAAhDABgACEMAGAAIQwAYAAhDABggA312KKjjz66t2/fProMAIB92rVr1/Xd/U1rXX9DhbDt27dn5869PQYOAGDjqKorD2R9pyMBAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAG2Dq6gKWu/Ptb85Mv+9zoMlilM6585ugSAGBdbTvz3IO2LSNhAAADCGEAAAMIYQAAAwhhAAADCGEAAAPMNYRV1aOr6vKq+kRVPX+efQEALJK5hbCq2pLkN5I8Jsl9k5xaVfedV38AAItkniNhD07yie6+ortvTvLGJE+YY38AAAtjniHsmCRXL/l8zdQGAHDIm2cIqz209dctVHVaVe2sqp3//E/Xz7EcAICNY54h7Jok25Z8PjbJp5cv1N1ndfeO7t5xp284eo7lAABsHPMMYe9PcmJVnVBVhyd5SpLz5tgfAMDCmNsDvLv7lqr6qSR/kWRLktd296Xz6g8AYJHMLYQlSXf/WZI/m2cfAACLyB3zAQAGEMIAAAYQwgAABhDCAAAGEMIAAAaY69WR++v4b9mSV/383UaXwaqdO7oAAFhYRsIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAGEMIAAAYQwgAABhDCAAAG2Dq6gKVuvuqTufr0U0aXwSq99PizR5fAATjjymeOLgFg4Ww789yDti0jYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAPMLYRV1Wur6rqq+vC8+gAAWFTzHAl7XZJHz3H7AAALa24hrLsvTHLDvLYPALDIhs8Jq6rTqmpnVe284cs3jS4HAGBdDA9h3X1Wd+/o7h1HHXHH0eUAAKyL4SEMAOBQJIQBAAwwz1tUvCHJXye5T1VdU1XPmFdfAACLZuu8Ntzdp85r2wAAi87pSACAAYQwAIABhDAAgAGEMACAAaq7R9dwmx07dvTOnTtHlwEAsE9Vtau7d6x1fSNhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADbF3NQlV1TJLjly7f3RfOqygAgM1unyGsql6W5MlJPpLk1qm5kwhhAABrtJqRsB9Kcp/uvmnexQAAHCpWMyfsiiSHzbsQAIBDyWpGwr6U5OKqemeS20bDuvs5c6sKAGCTW00IO296AQBwkOwzhHX3OVV1eJJ7T02Xd/dX5lsWAMDmtpqrIx+W5Jwkn0pSSbZV1dPcogIAYO1WczryV5I8qrsvT5KquneSNyT5rnkWBgCwma3m6sjDdgewJOnuj8XVkgAAB2Q1I2E7q+o1SX5n+vzUJLvmVxIAwOa3mhD2rCTPTvKczOaEXZjkzHkWBQCw2a3m6sibkvzq9AIA4CDYawirqjd195Oq6pLMnhX5Nbr7/nOtDABgE1tpJOy508+T16MQAIBDyV6vjuzua6e3p3f3lUtfSU5fn/IAADan1dyi4t/toe0xB7sQAIBDyUpzwp6V2YjXvarqQ0u+ukuS9867MACAzWylOWG/l+RtSX45yfOXtN/Y3TfMo5ibr/pkrj79lHlsGljmpcefPboE1uiMK585ugQ4ZG0789yDtq29hrDu/kKSLyQ5NUmq6h5J7pTkyKo6sruvOmhVAAAcYvY5J6yqfrCqPp7k75K8O7MHeb9tznUBAGxqq5mY/0tJHpLkY919QpJHxJwwAIADspoQ9pXu/ockd6iqO3T3+UlOmnNdAACb2mqeHfn5qjoys2dGvr6qrktyy3zLAgDY3FYzEvaEJF9K8rwkf57kk0l+cF8rVdW2qjq/qi6rqkur6rn7WgcA4FCx4khYVW1J8sfd/cgkX01yzn5s+5YkP9PdH6iquyTZVVXv6O6PrL1cAIDNYcWRsO6+NcmXqupf7O+Gu/va7v7A9P7GJJclOWZNVQIAbDKrmRP2z0kuqap3JPmn3Y3d/ZzVdlJV25M8IMlF+1kfAMCmtJoQ9qfTa02mSf3nJvnp7v7iHr4/LclpSXLMkUestRsAgIWyzxDW3edU1RFJjuvuy/dn41V1WGYB7PXd/Za9bP+sJGclyf3vcbfen+0DACyqVd0xP8nFmV0Zmao6qarOW8V6leQ1SS7r7l890EIBADaT1dyi4heTPDjJ55Okuy9OcsIq1ntokv+U5OFVdfH0euxaCwUA2ExWMyfslu7+wmxg6zb7PG3Y3e9JUvtaDgDgULSaEPbhqvqPSbZU1YlJnpPkr+ZbFgDA5raa05H/Jcl3Jrkpye8l+UISd78HADgAqxkJe1x3n5HkjN0NVfUjSf5gblUBAGxyqxkJe8Eq2wAAWKW9joRV1WOSPDbJMVX1a0u++sbMngsJAMAarXQ68tNJdiZ5fJJdS9pvTPK8eRRz+HHfmm1nnjuPTQPLvGp0ARwAfydhM9hrCOvuv03yt1X1+u428gUAcBCtdDryTd39pCQfrKqvuy9Yd99/rpUBAGxiK52O3H0bipPXoxAAgEPJSqcjr51+Xrl+5QAAHBpWc4sKAAAOMiEMAGCAvYawqvrZqtq2nsUAABwqVhoJOybJX1XVhVX1rKo6er2KAgDY7PYawrr7eUmOS/Lfktw/yYeq6m1V9WNVdZf1KhAAYDNacU5Yz7y7u5+VZFuSV2R2t/zPrEdxAACb1Ur3CbtNVf2rJE9J8uQk/5DkhfMsCgBgs1vpjvknJjk1s/B1a5I3JnlUd1+xTrUBAGxaK42E/UWSNyR5cndfsk71AAAcElYKYf8+yTcvD2BV9W+TfLq7PznXygAANrGVJua/PMkX99D+5cwm6AMAsEYrhbDt3f2h5Y3dvTPJ9rlVBABwCFgphN1phe+OONiFAAAcSlYKYe+vqp9Y3lhVz0iya34lAQBsfitNzP/pJH9YVU/N7aFrR5LDk/yHeRcGALCZ7TWEdfdnknxPVf1AkvtNzX/a3e9al8oAADaxfd4xv7vPT3L+OtQCAHDIWPHZkQAAzIcQBgAwgBAGADCAEAYAMIAQBgAwgBAGADCAEAYAMIAQBgAwgBAGADCAEAYAMIAQBgAwgBAGADCAEAYAMIAQBgAwgBAGADCAEAYAMIAQBgAwgBAGADDA1tEFLHXzVZ/M1aefMroMgA3tpcefPboEDsAZVz5zdAkcgG1nnnvQtmUkDABgACEMAGAAIQwAYAAhDABgACEMAGAAIQwAYIC5hbCqulNV/U1V/W1VXVpVL5lXXwAAi2ae9wm7KcnDu/sfq+qwJO+pqrd19/vm2CcAwEKYWwjr7k7yj9PHw6ZXz6s/AIBFMtc5YVW1paouTnJdknd090V7WOa0qtpZVTtv+PJN8ywHAGDDmGsI6+5bu/ukJMcmeXBV3W8Py5zV3Tu6e8dRR9xxnuUAAGwY63J1ZHd/PskFSR69Hv0BAGx087w68puq6q7T+yOSPDLJR+fVHwDAIpnn1ZH3THJOVW3JLOy9qbvfOsf+AAAWxjyvjvxQkgfMa/sAAIvMHfMBAAYQwgAABhDCAAAGEMIAAAao2dOFNoYdO3b0zp07R5cBALBPVbWru3esdX0jYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAA2wdXcBSN1/1yVx9+imjywCAuXnp8WePLoED8Kqfv9tB25aRMACAAYQwAIABhDAAgAGEMACAAYQwAIAB5h7CqmpLVX2wqt46774AABbFeoyEPTfJZevQDwDAwphrCKuqY5M8LombogAALDHvkbBXJPm5JF+dcz8AAAtlbiGsqk5Ocl1379rHcqdV1c6q2nnDl2+aVzkAABvKPEfCHprk8VX1qSRvTPLwqvrd5Qt191ndvaO7dxx1xB3nWA4AwMYxtxDW3S/o7mO7e3uSpyR5V3f/6Lz6AwBYJO4TBgAwwNb16KS7L0hywXr0BQCwCIyEAQAMIIQBAAwghAEADCCEAQAMIIQBAAywLldHrtbhx31rtp157ugyAGBuXjW6ADYMI2EAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAAMIYQAAAwhhAAADCGEAAANUd4+u4TZVdWOSy0fXwZocneT60UWwZvbf4rLvFpv9t9ju0913WevKWw9mJQfB5d29Y3QR7L+q2mnfLS77b3HZd4vN/ltsVbXzQNZ3OhIAYAAhDABggI0Wws4aXQBrZt8tNvtvcdl3i83+W2wHtP821MR8AIBDxUYbCQMAOCRsiBBWVY+uqsur6hNV9fzR9bCyqtpWVedX1WVVdWlVPXdqP6qq3lFVH59+3m10rexZVW2pqg9W1VunzydU1UXTvvv9qjp8dI3sWVXdtareXFUfnY7B73bsLYaqet70N/PDVfWGqrqTY2/jqqrXVtV1VfXhJW17PNZq5temHPOhqnrgavoYHsKqakuS30jymCT3TXJqVd13bFXswy1Jfqa7vyPJQ5I8e9pnz0/yzu4+Mck7p89sTM9NctmSzy9L8vJp330uyTOGVMVq/J8kf97d357kX2e2Hx17G1xVHZPkOUl2dPf9kmxJ8pQ49jay1yV59LK2vR1rj0ly4vQ6LckrV9PB8BCW5MFJPtHdV3T3zUnemOQJg2tiBd19bXd/YHp/Y2b/Ezgms/12zrTYOUl+aEyFrKSqjk3yuCRnT58rycOTvHlaxL7boKrqG5N8X5LXJEl339zdn49jb1FsTXJEVW1Ncuck18axt2F194VJbljWvLdj7QlJfrtn3pfkrlV1z331sRFC2DFJrl7y+ZqpjQVQVduTPCDJRUm+ubuvTWZBLck9xlXGCl6R5OeSfHX6fPckn+/uW6bPjsGN615JPpvkt6bTyWdX1TfEsbfhdff/S/K/k1yVWfj6QpJdcewtmr0da2vKMhshhNUe2lyyuQCq6sgk5yb56e7+4uh62LeqOjnJdd29a2nzHhZ1DG5MW5M8MMkru/sBSf4pTj0uhGnu0BOSnJDkXyb5hsxOYS3n2FtMa/o7uhFC2DVJti35fGySTw+qhVWqqsMyC2Cv7+63TM2f2T38Ov28blR97NVDkzy+qj6V2an/h2c2MnbX6RRJ4hjcyK5Jck13XzR9fnNmocyxt/E9Msnfdfdnu/srSd6S5Hvi2Fs0ezvW1pRlNkIIe3+SE6crRA7PbKLieYNrYgXTHKLXJLmsu391yVfnJXna9P5pSf54vWtjZd39gu4+tru3Z3asvau7n5rk/CRPnBaz7zao7v77JFdX1X2mpkck+Ugce4vgqiQPqao7T39Dd+87x95i2duxdl6SH5uuknxIki/sPm25kg1xs9aqemxm/xrfkuS13f3SwSWxgqr63iR/meSS3D6v6IWZzQt7U5LjMvuD8yPdvXxSIxtEVT0syc9298lVda/MRsaOSvLBJD/a3TeNrI89q6qTMruo4vAkVyR5emb/oHbsbXBV9ZIkT87sCvMPJnlmZvOGHHsbUFW9IcnDkhyd5DNJfiHJH2UPx9oUrH89s6spv5Tk6d29z4d7b4gQBgBwqNkIpyMBAA45QhgAwABCGADAAEIYAMAAQhgAwABCGDBXVXVGVV1aVR+qqour6t/Mub8LqmrHfiz/36vqkfvZx6eq6uj9rw7gdlv3vQjA2lTVdyc5OckDu/umKbgcPrisr9HdLx5dA3BoMhIGzNM9k1y/++aT3X19d386SarqxVX1/qr6cFWdNd3scPdI1sur6sKquqyqHlRVb6mqj1fVL03LbK+qj1bVOdMI25ur6s7LO6+qR1XVX1fVB6rqD6bnnS5f5nVV9cTp/aeq6iXT8pdU1bdP7XevqrdPD81+dZY8J66qfrSq/mYa5Xt1VW2pquOneo+uqjtU1V9W1aMO/n9eYJEJYcA8vT3Jtqr6WFWdWVXfv+S7X+/uB3X3/ZIckdmI2W43d/f3JXlVZo8FeXaS+yX58aq6+7TMfZKc1d33T/LFJKcv7XgadXtRkkd29wOT7EzyX1dR8/XT8q9M8rNT2y8kec/00OzzMrtbdqrqOzK7A/pDu/ukJLcmeWp3X5nkZVP9P5PkI9399lX0DRxChDBgbrr7H5N8V5LTknw2ye9X1Y9PX/9AVV1UVZdk9iDx71yy6u7nx16S5NLuvnYaTbsitz8k9+rufu/0/neTfO+y7h+S5L5J3ltVF2f2nLfjV1H27gfS70qyfXr/fVMf6e4/TfK5qf0R0+/3/qmPRyS517Tc2UnukuQnc3uYA7iNOWHAXHX3rUkuSHLBFLieVlVvTHJmkh3dfXVV/WKSOy1Zbfez87665P3uz7v/bi1/5tryz5XkHd196n6WvLu/W/O1fyP39Iy3SnJOd7/g676YnR49dvp4ZJIb97MOYJMzEgbMTVXdp6pOXNJ0UpIrc3vgun6ap/XENWz+uGnif5KcmuQ9y75/X5KHVtW3TbXcuaruvYZ+kuTCJE+dtvOYJHeb2t+Z5IlVdY/pu6Oqavdo28uSvD7Ji5P85hr7BTYxI2HAPB2Z5P9W1V2T3JLkE0lO6+7PV9VvZna68VNJ3r+GbV+W2ajaq5N8PLM5XLfp7s9Opz7fUFV3nJpflORja+jrJdN2PpDk3Umumvr4SFW9KMnbq+oOSb6S5NlVtT3JgzKbK3ZrVZ1SVU/v7t9aQ9/AJlXdexphB9i4ppDz1mlSP8BCcjoSAGAAI2EAAAMYCQMAGEAIAwAYQAgDABhACAMAGEAIAwAYQAgDABjg/wMRQZjH9LWq6wAAAABJRU5ErkJggg==\n", "text/plain": [ "<Figure size 720x432 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Some random data points\n", "n_points = 100\n", "X = np.random.randn(n_points, 10)\n", "y = np.random.randn(n_points)\n", "\n", "fig, ax = plt.subplots(figsize=(10, 6))\n", "cv = KFold(5)\n", "_ = plot_cv_indices(cv, X, y, ax)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "In this notebook we have seen:\n", "* how to train predictive models when you only have numerical variables;\n", " the importance of scaling numerical variables through\n", " `sklearn.preprocessing.StandardScaler`;\n", "* how to chain multiple steps (e.g. preprocessing with `StandardScaler` and\n", " a `LogisticRegression` model) in a single `scikit-learn` estimator through\n", " `sklearn.compose.Pipeline`\n", "* how to evaluate the performance of a model via cross-validation through\n", "* `sklearn.model_selection.cross_val_score`." ] } ], "metadata": { "jupytext": { "encoding": "# -*- coding: utf-8 -*-", "formats": "python_scripts//py:percent,notebooks//ipynb" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }