{ "metadata": { "name": "", "signature": "sha256:4c5c82048733cdbf75467f019dd99a78107b5c684747f698fcf7390509004f84" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Introduction to scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`scikit-learn` is a Machine Learning library for Python, in this tutorial we will learn about the `scikit-learn` interface walking through a regression problem, in details:\n", "\n", "* load a dataset in `.csv` format with `pandas`\n", "* preprocess the data to make them suitable for `scikit-learn`\n", "* fit a Decision Tree to the data\n", "* compare the predicted and true values\n", "* plot the learning curve\n", "* use cross validation to tune model parameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First install `scikit-learn` with `conda` (recommended) or `pip`:\n", "\n", " conda install sklearn\n", "\n", "or:\n", "\n", " pip install sklearn" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "The dataset: predict Abalone age" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Abalone is a mollusc with a peculiar ear-shaped shell lined of mother of pearl.\n", "Its age can be estimated counting the number of rings in their shell with a microscope, but it is a time consuming process, in this tutorial we will use Machine Learning to predict the age using physical measurements." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Picture of the shell of an Abalone:\n", "![Abalone](http://upload.wikimedia.org/wikipedia/commons/0/0b/AbaloneInside.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset is available from the [University of California Irvine Machine Learning data repository](http://archive.ics.uci.edu/ml/datasets/Abalone), we can download the [data in Comma-Separated Values format](http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data) in the current folder." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are no column labels in the data, so we copy them from the documentation and use `pandas` to read and print few lines of the dataset." ] }, { "cell_type": "code", "collapsed": false, "input": [ "column_names = [\"sex\", \"length\", \"diameter\", \"height\", \"whole weight\", \n", " \"shucked weight\", \"viscera weight\", \"shell weight\", \"rings\"]\n", "data = pd.read_csv(\"abalone.data\", names=column_names)\n", "print(\"Number of samples: %d\" % len(data))\n", "data.head()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Number of samples: 4177\n" ] }, { "html": [ "
\n", " | sex | \n", "length | \n", "diameter | \n", "height | \n", "whole weight | \n", "shucked weight | \n", "viscera weight | \n", "shell weight | \n", "rings | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "M | \n", "0.455 | \n", "0.365 | \n", "0.095 | \n", "0.5140 | \n", "0.2245 | \n", "0.1010 | \n", "0.150 | \n", "15 | \n", "
1 | \n", "M | \n", "0.350 | \n", "0.265 | \n", "0.090 | \n", "0.2255 | \n", "0.0995 | \n", "0.0485 | \n", "0.070 | \n", "7 | \n", "
2 | \n", "F | \n", "0.530 | \n", "0.420 | \n", "0.135 | \n", "0.6770 | \n", "0.2565 | \n", "0.1415 | \n", "0.210 | \n", "9 | \n", "
3 | \n", "M | \n", "0.440 | \n", "0.365 | \n", "0.125 | \n", "0.5160 | \n", "0.2155 | \n", "0.1140 | \n", "0.155 | \n", "10 | \n", "
4 | \n", "I | \n", "0.330 | \n", "0.255 | \n", "0.080 | \n", "0.2050 | \n", "0.0895 | \n", "0.0395 | \n", "0.055 | \n", "7 | \n", "
5 rows \u00d7 9 columns
\n", "\n", " | length | \n", "diameter | \n", "height | \n", "whole weight | \n", "shucked weight | \n", "viscera weight | \n", "shell weight | \n", "rings | \n", "M | \n", "F | \n", "I | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.455 | \n", "0.365 | \n", "0.095 | \n", "0.5140 | \n", "0.2245 | \n", "0.1010 | \n", "0.150 | \n", "15 | \n", "True | \n", "False | \n", "False | \n", "
1 | \n", "0.350 | \n", "0.265 | \n", "0.090 | \n", "0.2255 | \n", "0.0995 | \n", "0.0485 | \n", "0.070 | \n", "7 | \n", "True | \n", "False | \n", "False | \n", "
2 | \n", "0.530 | \n", "0.420 | \n", "0.135 | \n", "0.6770 | \n", "0.2565 | \n", "0.1415 | \n", "0.210 | \n", "9 | \n", "False | \n", "True | \n", "False | \n", "
3 | \n", "0.440 | \n", "0.365 | \n", "0.125 | \n", "0.5160 | \n", "0.2155 | \n", "0.1140 | \n", "0.155 | \n", "10 | \n", "True | \n", "False | \n", "False | \n", "
4 | \n", "0.330 | \n", "0.255 | \n", "0.080 | \n", "0.2050 | \n", "0.0895 | \n", "0.0395 | \n", "0.055 | \n", "7 | \n", "False | \n", "False | \n", "True | \n", "
5 rows \u00d7 11 columns
\n", "