{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# First model with scikit-learn\n", "\n", "In this notebook, we present how to build predictive models on tabular\n", "datasets, with only numerical features.\n", "\n", "In particular we highlight:\n", "\n", "* the scikit-learn API: `.fit(X, y)`/`.predict(X)`/`.score(X, y)`;\n", "* how to evaluate the generalization performance of a model with a train-test\n", " split.\n", "\n", "Here API stands for \"Application Programming Interface\" and refers to a set of\n", "conventions to build self-consistent software. Notice that you can visit the\n", "Glossary for more info on technical jargon.\n", "\n", "## Loading the dataset with Pandas\n", "\n", "We use the \"adult_census\" dataset described in the previous notebook. For more\n", "details about the dataset see .\n", "\n", "Numerical data is the most natural type of data used in machine learning and\n", "can (almost) directly be fed into predictive models. Here we load a subset of\n", "the original data with only the numerical columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "adult_census = pd.read_csv(\"../datasets/adult-census-numeric.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at the first records of this dataframe:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adult_census.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that this CSV file contains all information: the target that we would\n", "like to predict (i.e. `\"class\"`) and the data that we want to use to train our\n", "predictive model (i.e. the remaining columns). The first step is to separate\n", "columns to get on one side the target and on the other side the data.\n", "\n", "## Separate the data and the target" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "target_name = \"class\"\n", "target = adult_census[target_name]\n", "target" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = adult_census.drop(columns=[target_name])\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now linger on the variables, also denominated features, that we later\n", "use to build our predictive model. In addition, we can also check how many\n", "samples are available in our dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\n", " f\"The dataset contains {data.shape[0]} samples and \"\n", " f\"{data.shape[1]} features\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fit a model and make predictions\n", "\n", "We now build a classification model using the \"K-nearest neighbors\" strategy.\n", "To predict the target of a new sample, a k-nearest neighbors takes into\n", "account its `k` closest samples in the training set and predicts the majority\n", "target of these samples.\n", "\n", "
\n", "

Caution!

\n", "

We use a K-nearest neighbors here. However, be aware that it is seldom useful\n", "in practice. We use it because it is an intuitive algorithm. In the next\n", "notebook, we will introduce better models.

\n", "
\n", "\n", "The `fit` method is called to train the model from the input (features) and\n", "target data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "model = KNeighborsClassifier()\n", "_ = model.fit(data, target)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Learning can be represented as follows:\n", "\n", "![Predictor fit diagram](../figures/api_diagram-predictor.fit.svg)\n", "\n", "In scikit-learn an object that has a `fit` method is called an **estimator**.\n", "The method `fit` is composed of two elements: (i) a **learning algorithm** and\n", "(ii) some **model states**. The learning algorithm takes the training data and\n", "training target as input and sets the model states. These model states are\n", "later used to either predict (for classifiers and regressors) or transform\n", "data (for transformers).\n", "\n", "Both the learning algorithm and the type of model states are specific to each\n", "type of model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Note

\n", "

Here and later, we use the name data and target to be explicit. In\n", "scikit-learn documentation, data is commonly named X and target is\n", "commonly called y.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use our model to make some predictions using the same dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "target_predicted = model.predict(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An estimator (an object with a `fit` method) with a `predict` method is called\n", "a **predictor**. We can illustrate the prediction mechanism as follows:\n", "\n", "![Predictor predict diagram](../figures/api_diagram-predictor.predict.svg)\n", "\n", "To predict, a model uses a **prediction function** that uses the input data\n", "together with the model states. As for the learning algorithm and the model\n", "states, the prediction function is specific for each type of model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now have a look at the computed predictions. For the sake of simplicity,\n", "we look at the five first predicted targets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "target_predicted[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed, we can compare these predictions to the actual data..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "target[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "...and we could even check if the predictions agree with the real targets:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "target[:5] == target_predicted[:5]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\n", " \"Number of correct prediction: \"\n", " f\"{(target[:5] == target_predicted[:5]).sum()} / 5\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we see that our model makes a mistake when predicting for the first\n", "sample.\n", "\n", "To get a better assessment, we can compute the average success rate." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(target == target_predicted).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This result means that the model makes a correct prediction for approximately\n", "82 samples out of 100. Note that we used the same data to train and evaluate\n", "our model. Can this evaluation be trusted or is it too good to be true?\n", "\n", "## Train-test data split\n", "\n", "When building a machine learning model, it is important to evaluate the\n", "trained model on data that was not used to fit it, as **generalization** is\n", "more than memorization (meaning we want a rule that generalizes to new data,\n", "without comparing to data we memorized). It is harder to conclude on\n", "never-seen instances than on already seen ones.\n", "\n", "Correct evaluation is easily done by leaving out a subset of the data when\n", "training the model and using it afterwards for model evaluation. The data used\n", "to fit a model is called training data while the data used to assess a model\n", "is called testing data.\n", "\n", "We can load more data, which was actually left-out from the original data set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adult_census_test = pd.read_csv(\"../datasets/adult-census-numeric-test.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this new data, we separate our input features and the target to predict,\n", "as in the beginning of this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "target_test = adult_census_test[target_name]\n", "data_test = adult_census_test.drop(columns=[target_name])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check the number of features and samples available in this new set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\n", " f\"The testing dataset contains {data_test.shape[0]} samples and \"\n", " f\"{data_test.shape[1]} features\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of computing the prediction and manually computing the average success\n", "rate, we can use the method `score`. When dealing with classifiers this method\n", "returns their performance metric." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accuracy = model.score(data_test, target_test)\n", "model_name = model.__class__.__name__\n", "\n", "print(f\"The test accuracy using a {model_name} is {accuracy:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the generic term **model** for objects whose goodness of fit can be\n", "measured using the `score` method. Let's check the underlying mechanism when\n", "calling `score`:\n", "\n", "![Predictor score diagram](../figures/api_diagram-predictor.score.svg)\n", "\n", "To compute the score, the predictor first computes the predictions (using the\n", "`predict` method) and then uses a scoring function to compare the true target\n", "`y` and the predictions. Finally, the score is returned." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we compare with the accuracy obtained by wrongly evaluating the model on\n", "the training set, we find that this evaluation was indeed optimistic compared\n", "to the score obtained on a held-out test set.\n", "\n", "It shows the importance to always testing the generalization performance of\n", "predictive models on a different set than the one used to train these models.\n", "We will discuss later in more detail how predictive models should be\n", "evaluated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Note

\n", "

In this MOOC, we refer to generalization performance of a model when\n", "referring to the test score or test error obtained by comparing the prediction\n", "of a model and the true targets. Equivalent terms for generalization\n", "performance are predictive performance and statistical performance. We refer\n", "to computational performance of a predictive model when assessing the\n", "computational costs of training a predictive model or using it to make\n", "predictions.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook Recap\n", "\n", "In this notebook we:\n", "\n", "* fitted a **k-nearest neighbors** model on a training dataset;\n", "* evaluated its generalization performance on the testing data;\n", "* introduced the scikit-learn API `.fit(X, y)` (to train a model),\n", " `.predict(X)` (to make predictions) and `.score(X, y)` (to evaluate a\n", " model);\n", "* introduced the jargon for estimator, predictor and model." ] } ], "metadata": { "jupytext": { "main_language": "python" }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }