\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1- Scikit Learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"Now that we explored data structures provided by the Pandas library, we will investigate how to learn over it using **Scikit-learn**.\n",
"\n",
"Scikit-learn is ont of the most celebrated and used machine learning library. It features a complete set of efficiently implemented machine learning algorithms for classification, regression, and clustering. Scikit-learn is designed to operate over Numpy, Scipy, and Pandas data structures. \n",
"\n",
"**Links:** [Scikit-learn webpage](http://scikit-learn.org) [Wikipedia article](https://en.wikipedia.org/wiki/Scikit-learn)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## Machine Learning problems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Machine learning* is the task of predicting properties out of some data. The *dataset* consists in several *examples* or *samples* and the associated target properties can be available, partially available, or not at all; we respectively call these setting *supervised*, *semi-supervised*, *unsupervised*. The examples are made out of one or several *features* or *attributes* that can be of different types (real number, discretes values, strings, booleans, etc.). \n",
"\n",
"Learning problems can be broadly divided in a few categories:\n",
"* **supervised learning** \n",
" * **classification:** Place incoming data into a finite number or classes by learning over labeled data. Example: Classifying iris into species based on recorded petal and sentil sizes from the 3 species. \n",
" * **regression:** Predict a value from example data. To the difference of classification, the output value is continuous. Example: Predict the carbon monoxide concentration for next years based on previous measures.\n",
"* **unsupervised learning**\n",
" * **clustering:** Place the data (both new and the dataset) into a finite number of classes. To the difference with classification, no labeled data is provided. Example: Create market segments from customer information for targeted advertising.\n",
" * **dimension reduction:** Discard uniformative features for the purpose of visualization or efficient storage. Example: Creation of eigenfaces in visage recognition. \n",
" \n",
"\n",
"The following flowchart can be found on the [Scikit Learn website](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html):\n",
"\n",
"![Scikit Learn Algorithm cheatsheet](img/ml_map.png \"Scikit Learn Algorithm cheatsheet\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Learning with Scikit Learn\n",
"\n",
"The process of learning and predicting with Scikit Learn follows three main steps: \n",
"**1. Selecting and adjusting a model** \n",
"**2. Fitting the model to the data** \n",
"**3. Predicting from this fitted model**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will illustrate this process on a simple *linear model*\n",
"$$ y = a x + b + \\nu$$\n",
"where \n",
"* $(x,y)\\in\\mathbb{R}^m\\times\\mathbb{R}^m$ are the data points. $x$ contains the examples and $y$ the associated outputs \n",
"* $a,b$ are the model coefficients to estimate\n",
"* $\\nu$ is a standard centered white Gaussian noise"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD8CAYAAABuHP8oAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/d3fzzAAAACXBIWXMAAAsTAAALEwEAmpwYAAAWWUlEQVR4nO3df5DcdX3H8deLyyHHj3pQr5lw4UqmOsmoKVzdIjYdC0GEiiNntCqjNrbMpJ2pLVKLhv4DbXUSGyr6h+M0CppOKeJADAxYkSGxjEwn7YVLIRAYKP7KGkgYcmqaiJfw7h/3vWSz2dv97t7++H7v+3zMZLL7vb3d9w748sPn+/58Po4IAQDy55ReFwAAaA0BDgA5RYADQE4R4ACQUwQ4AOQUAQ4AOZU6wG332Z6wfX/yfInt7bafs32X7VM7VyYAoFozI/DrJO2ueP45SbdGxOslHZB0bTsLAwDUlyrAbS+WdJWkrybPLWmlpLuTl2ySNNaB+gAAs1iQ8nVfkPQpSWclz39d0mREHEme75E0XOsXba+RtEaSzjjjjLcsW7as5WIBoIh27NjxUkQMVV9vGOC23y1pX0TssH1Jsx8cERslbZSkUqkU4+Pjzb4FABSa7R/Vup5mBL5C0ntsv0vSaZJ+TdIXJQ3aXpCMwhdLKrerWABAYw3nwCPixohYHBHnS/qQpK0R8WFJ2yS9P3nZakn3dqxKAMBJ5tIH/mlJf237OU3Pid/WnpIAAGmkvYkpSYqI70n6XvL4eUkXtb8kAEAarMQEgJxqagQOAGhsy0RZGx58Rj+dPKxzBwd0wxVLNTZas9N6TghwAGijLRNl3bj5CR2eOipJKk8e1o2bn5Cktoc4UygA0EYbHnzmWHjPODx1VBsefKbtn0WAA0Ab/XTycFPX54IAB4A2OndwoKnrc0GAA0Ab3XDFUg30951wbaC/TzdcsbTtn0WAA0AbjY0Oa92q5RoeHJAlDQ7067T+U3T9XTu1Yv1WbZlo364jBDgAtNnY6LAeXbtSt37wQr1y5FUdODSl0PGOlHaFOAEOAB3S6Y4U+sABoEq7FuJ0uiOFETgAVJhZiFOePHxs2uMTd+3U6N9/t+mpj053pDACB4AKtaY9JOnAoSnduPkJjf/oZW17ev9Jo/Nao/Ybrlh6wqpMqb0dKY6ItrxRGpzIAyDrlqx9QM2k4kB/n973lmHds6N8UlCvW7VckuY8HWN7R0SUqq8zAgeACucODqjcxBz14amjunP7T3S0ajA8c7Py0bUrO7KRlcQcOACcoNZCnEaqw3tGJ5bPV2IEDmDea6arZOb6zfc9qcnDU6nev8+uGeKdWD5fiRE4gHmtVldJo8U0Y6PD2nnTO/WFD154bEVln13ztZZ0zVvP69ry+UoEOIB5rdXFNNWj9lohbUkfvnhEnxlbfsLy+eHBAa1btbxjc9/HPr9RF4rt0yQ9Iuk1mp5yuTsibrL9dUl/IOlnyUs/FhE7670XXSgAuq1eV8nwLNMp1YcySMe7TWq1EHbaXLpQXpG0MiIO2u6X9H3b/5787IaIuLudhQJAOw2e3q8Dh2rPZc92Ws5so/ZtT+/Xo2tXdq7YJjWcQolpB5On/cmf7jWPA0CLtkyUdfCXR+q+ptZ0SjcPZZiLVHPgtvts75S0T9JDEbE9+dFnbT9u+1bbr5nld9fYHrc9vn///vZUDQApbHjwGU292ni8WR3M3TyUYS5SBXhEHI2ICyUtlnSR7TdLulHSMkm/K+kcSZ+e5Xc3RkQpIkpDQ0PtqRoAUkg7Yq4O5m4eyjAXTfWBR8Sk7W2SroyIW5LLr9j+mqS/aXt1ANBAvR7vNKsqawXzzO+3Y0fCTmoY4LaHJE0l4T0g6XJJn7O9KCL22rakMUm7OlsqAJyouluk+qZkrc2k+k+xzjxtgSYPTdUN5rHR4cwFdrU0I/BFkjbZ7tP0lMs3I+J+21uTcLeknZL+vHNlAsDJ6vV4VwZw1kfSrWoY4BHxuKTRGtez00sDoJDSdIvkYSTdKlZiAsitvHSLdAoBDiC38tIt0insRgggt+b7HHcjBDiAXJvPc9yNMIUCADlFgANATjGFAiDTmjlNp2gIcACZ1WilZdExhQIgs1o9TacoCHAAmZWXfbl7hQAHkFlFX2nZCAEOILOKvtKyEW5iAsisoq+0bIQAB5BpRV5p2QgBDqCt6NvuHgIcQNvQt91d3MQE0Db0bXcXAQ6gbejb7q6GAW77NNv/Zft/bD9p+++S60tsb7f9nO27bJ/a+XIBZBl9292VZgT+iqSVEXGBpAslXWn7Ykmfk3RrRLxe0gFJ13asSgC5QN92dzUM8Jh2MHnan/wJSSsl3Z1c3yRprBMFAsiPsdFhrVu1XMODA7Kk4cEBrVu1nBuYHZKqC8V2n6Qdkl4v6UuS/lfSZEQcSV6yR1LNf0K210haI0kjIyNzrRdAxtG33T2pAjwijkq60PagpG9JWpb2AyJio6SNklQqlaKFGgFkBD3e2dJUH3hETNreJultkgZtL0hG4YsllTtRIIBsoMc7e9J0oQwlI2/ZHpB0uaTdkrZJen/ystWS7u1QjQAygB7v7EkzAl8kaVMyD36KpG9GxP22n5L0DdufkTQh6bYO1gmgx+jxzp6GAR4Rj0sarXH9eUkXdaIoANlz7uCAyjXCmh7v3mElJoBU6PHOHjazAgqsma4S9ubOHgIcKKhWukro8c4WplCAgqKrJP8IcKCg6CrJPwIcKCh2Dsw/AhwoqEuXDclV1+gqyRcCHCigLRNl3bOjrMrNiSzpfW/hJmWeEOBAAdW6gRmSHnh8b28KQksIcKCAZrtReeDQlLZMsC9dXhDgQI5tmShrxfqtWrL2Aa1YvzV1+Na7UUkbYX4Q4EBOzSzEKU8eVuj4Qpw0IV7vRiVthPlBgAM5dfN9T7a8EGdsdFiDA/01f0YbYX4Q4EAObZkoa/LwVM2fpR1B3/yeN7E5Vc6xFwqQQ/VG2WlH0GxOlX8EOJBD9UbZzYyg2Zwq3whwIIcGT+/XgUMnT6Gcffr0vPaK9VsZVRcAAQ7kzJaJsg7+8shJ1/v7rKt+exEHDxcINzGBnNnw4DOaejVOun7GqQu07en9bBFbIGlOpT/P9jbbT9l+0vZ1yfWbbZdt70z+vKvz5QKYbf77Z4en2CK2YNJMoRyR9MmIeMz2WZJ22H4o+dmtEXFL58oDUK3R4cIcPFwcDUfgEbE3Ih5LHv9C0m5JTKYBLWp1+fuMeocLc/BwsTR1E9P2+ZJGJW2XtELSx23/saRxTY/SD9T4nTWS1kjSyMjIXOsFcq2Vcyirpenfpre7GBxx8s2Qmi+0z5T0H5I+GxGbbS+U9JKmd6H8B0mLIuJP671HqVSK8fHxOZYM5NeK9VtrTnEMDw7o0bUre1AR8sD2jogoVV9P1YViu1/SPZLuiIjNkhQRL0bE0Yh4VdJXJF3UzoKB+YibjGinNF0olnSbpN0R8fmK64sqXvZeSbvaXx4wv3AOJdopzQh8haSPSlpZ1TL4j7afsP24pEslXd/JQoH5gJuMaKeGNzEj4vvSSWefStK3218OML/VuwG5ZaLMzUc0haX0QJfV2kCqHd0pKB6W0gMZUOuQYZbAoxECHMgAulPQCgIcyAC6U9AKAhzIALpT0ApuYgIZwPFmaAUBDmQEx5uhWUyhAEBOMQIHJBbRIJcIcBQei2iQV0yhoPBYRIO8IsBReLX25653HcgKAhyFVu84sz7X2sMNyA4CHIU1M/c9m6MpT6sCeoUAR2HVmvuuNMwydmQcAY7CqrdRFMvYkQcEOAprto2i+mytW7WcFkJkHgGOwpptA6l/+sAFhDdyIc2hxufZ3mb7KdtP2r4uuX6O7YdsP5v8fXbnywXaZ2x0WOtWLdfw4ICs6TlvRt7IE0eDO+3J6fOLIuIx22dJ2iFpTNLHJL0cEettr5V0dkR8ut57lUqlGB8fb0vhAFAUtndERKn6eppDjfdK2ps8/oXt3ZKGJV0t6ZLkZZskfU9S3QAHOoF9TFBUTe2FYvt8SaOStktamIS7JL0gaWF7SwMaYx8TFFnqm5i2z5R0j6RPRMTPK38W0/MwNedibK+xPW57fP/+/XMqFqjGPiYoslQBbrtf0+F9R0RsTi6/mMyPz8yT76v1uxGxMSJKEVEaGhpqR83AMRwGjCJL04ViSbdJ2h0Rn6/40X2SViePV0u6t/3lAfVxGDCKLM0IfIWkj0paaXtn8uddktZLutz2s5LekTwHuorDgFFkabpQvi9ptm3ZLmtvOUBz6h0GTHcK5jtO5EHu1ToMmO4UFAEBjp7q1Ci5XncKAY75ggBHz3RylEx3CoqAzazQM53s4aY7BUVAgKNnOjlKpjsFRUCAo2c6OUpmp0EUAXPg6Jkbrlh6why41N5Rcq3uFGA+IcDRM/V6uAE0RoCjpxglA60jwJFZrKQE6iPAkUmspAQaowsFmcQ+30BjBDgyiZWUQGMEODJly0RZK9ZvrX28k6RTbG2ZKHe1JiCrmANHZlTPe9dyNIK5cCDBCByZUWveuxbmwoFpBDgyo5n5bebCAQIcGdLMHijsKggQ4MiQWjsI9p9i9fedeKIfuwoC09KcSn+77X22d1Vcu9l2ueqQY2BOau0guOGPLtCG91/AroJADY6YrWEreYH9dkkHJf1LRLw5uXazpIMRcUszH1YqlWJ8fLzFUgGgmGzviIhS9fU0p9I/Yvv8jlSF3GO/EqB35jIH/nHbjydTLGfP9iLba2yP2x7fv3//HD4OWTPTt12ePKzQ8f1KWGgDdEerAf5lSb8l6UJJeyX902wvjIiNEVGKiNLQ0FCLH4csYr8SoLdaCvCIeDEijkbEq5K+Iumi9paFPGC/EqC3Wgpw24sqnr5X0q7ZXov5i5Pfgd5qeBPT9p2SLpH0Ott7JN0k6RLbF0oKST+U9GedKxFZM3Pjsjx5WJZO2HiKHm2ge9J0oVxT4/JtHagFOVC94VRIx0J8mC4UoKvYjRBNqXXjcia8H127sjdFAQXFUno0hRuXQHYQ4GgKNy6B7CDA0ZRaG05x4xLoDebA0ZSZG5Qsnwd6jwBH08ZGhwlsIAOYQgGAnCLAASCnmEIpOLaDBfKLAC+w6lWVM9vBSiLEgRxgCqXA2A4WyDcCvMBYVQnkGwFeYKyqBPKNAC8wVlUC+cZNzAJjVSWQbwR4wbGqEsgvplAAIKcIcADIqYYBbvt22/ts76q4do7th2w/m/x9dmfLBABUSzMC/7qkK6uurZX0cES8QdLDyXMAQBc1DPCIeETSy1WXr5a0KXm8SdJYe8sCADTS6hz4wojYmzx+QdLCNtUDAEhpzm2EERG2Y7af214jaY0kjYyMzPXj5h12AwTQqlZH4C/aXiRJyd/7ZnthRGyMiFJElIaGhlr8uPlpZjfA8uRhhY7vBrhlotzr0gDkQKsBfp+k1cnj1ZLubU85xcJugADmIk0b4Z2S/lPSUtt7bF8rab2ky20/K+kdyXM0qTzLrn+zXQeASg3nwCPimll+dFmba5nXas11A8BcsBdKF9Q7+QYAWsVS+i6Yba57Nn12p0sCMA8Q4F3Q7Ak317z1vA5VAmA+IcC7YLYTboYHB/SRi0eOjbj7bH3k4hF9Zmz5sddsmShrxfqtWrL2Aa1Yv5UWQwDHMAfeBTdcsfSEOXDp+Mk3Y6PDJwR2JU6NB1APAV6hU6siWz35pl6fOAEOgABPdHq028rJN5waD6Ae5sATaVdFdnNOmlPjAdRDgCfSjHa7vXcJp8YDqIcAT6QZ7XZ775Kx0WGtW7Vcw4MDsqa7VtatWs78NwBJzIEfU69TZEYv5qQ5NR7AbBiBJ9KMdpmTBpAljMArNBrtphmlA0C3EOBNaLWfGwA6gQBXcwt4mJMGkBWFD3CWqwPIq8LfxORYMwB5VfgAZ7k6gLwqfIDTGgggr+YU4LZ/aPsJ2zttj7erqG5iuTqAvGrHTcxLI+KlNrxPT9AaCCCvCt+FItEaCCCf5hrgIem7tkPSP0fExuoX2F4jaY0kjYyMzPHjOqNTBzkAQCfNNcB/PyLKtn9D0kO2n46IRypfkIT6RkkqlUoxx887pjJ0XzvQL1uaPDTVdADTBw4gr+Z0EzMiysnf+yR9S9JF7Siqkep9uScPT+nAoamW9uimDxxAXrUc4LbPsH3WzGNJ75S0q12F1VMrdCs1E8D0gQPIq7lMoSyU9C3bM+/zbxHxnbZU1UCacK0+SWe2Oe5zBwdUrvF+9IEDyLqWR+AR8XxEXJD8eVNEfLadhdWTJlxnXtPoGDT6wAHkVeZXYtY6RLhW6FaqDOBGc9wcWwYgrxzRtsaQhkqlUoyPp1+wWd0hIk2H87pVyyUpVRfKkrUPqNY3tKQfrL9qLl8HALrC9o6IKFVfz/RCnnqj50fXrkw1SmaOG8B8lekplHZ0iDDHDWC+ynSAt2OnQOa4AcxXmZ5Cadchwux1AmA+ynSA19op8NJlQ9rw4DO6/q6d7FsCoNAyHeDSiaNn9i0BgOMyPQdejX1LAOC4XAU4+5YAwHG5CnDOrwSA43IV4PR0A8Bxmb+JWYnzKwHguFwFuERPNwDMyNUUCgDgOAIcAHKKAAeAnCLAASCnCHAAyKmunshje7+kH7Xwq6+T9FKby8mLon73on5vqbjfvajfW2r83X8zIoaqL3Y1wFtle7zWcUJFUNTvXtTvLRX3uxf1e0utf3emUAAgpwhwAMipvAT4xl4X0ENF/e5F/d5Scb97Ub+31OJ3z8UcOADgZHkZgQMAqhDgAJBTmQ9w21fafsb2c7bX9rqebrB9nu1ttp+y/aTt63pdU7fZ7rM9Yfv+XtfSLbYHbd9t+2nbu22/rdc1dYvt65N/13fZvtP2ab2uqVNs3257n+1dFdfOsf2Q7WeTv89O816ZDnDbfZK+JOkPJb1R0jW239jbqrriiKRPRsQbJV0s6S8K8r0rXSdpd6+L6LIvSvpORCyTdIEK8v1tD0v6K0mliHizpD5JH+ptVR31dUlXVl1bK+nhiHiDpIeT5w1lOsAlXSTpuYh4PiJ+Jekbkq7ucU0dFxF7I+Kx5PEvNP0/5MJsgm57saSrJH2117V0i+3XSnq7pNskKSJ+FRGTPS2quxZIGrC9QNLpkn7a43o6JiIekfRy1eWrJW1KHm+SNJbmvbIe4MOSflLxfI8KFGSSZPt8SaOStve4lG76gqRPSXq1x3V00xJJ+yV9LZk6+qrtM3pdVDdERFnSLZJ+LGmvpJ9FxHd7W1XXLYyIvcnjFyQtTPNLWQ/wQrN9pqR7JH0iIn7e63q6wfa7Je2LiB29rqXLFkj6HUlfjohRSf+nlP8ZnXfJfO/Vmv4/sXMlnWH7I72tqndiurc7VX931gO8LOm8iueLk2vznu1+TYf3HRGxudf1dNEKSe+x/UNNT5mttP2vvS2pK/ZI2hMRM/+ldbemA70I3iHpBxGxPyKmJG2W9Hs9rqnbXrS9SJKSv/el+aWsB/h/S3qD7SW2T9X0jY37elxTx9m2pudCd0fE53tdTzdFxI0RsTgiztf0P++tETHvR2MR8YKkn9hemly6TNJTPSypm34s6WLbpyf/7l+mgtzArXCfpNXJ49WS7k3zS5k+1Dgijtj+uKQHNX1n+vaIeLLHZXXDCkkflfSE7Z3Jtb+NiG/3riR0wV9KuiMZrDwv6U96XE9XRMR223dLekzTHVgTmsfL6m3fKekSSa+zvUfSTZLWS/qm7Ws1veX2B1K9F0vpASCfsj6FAgCYBQEOADlFgANAThHgAJBTBDgA5BQBDgA5RYADQE79P8YZtRCd8beXAAAAAElFTkSuQmCC\n",
"text/plain": [
"