{ "metadata": { "name": "iris" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Iris classification using copper" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try copper using the Iris dataset just downloaded from the [UCI machine learning repository](http://archive.ics.uci.edu/ml/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First step is import copper and the magnificent pandas library, load the data using and create a Dataset using the DataFrame." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import copper\n", "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "data = pd.read_csv('iris.csv', header=None)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "ds = copper.Dataset(data)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is possible to access and modify the DataFrame directly using the `Dataset.frame` property. On this case I wanted to check for any missing values and to drop them if any. One missing value on this case." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ds.frame" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n",
        "<class 'pandas.core.frame.DataFrame'>\n",
        "Int64Index: 151 entries, 0 to 150\n",
        "Data columns (total 5 columns):\n",
        "0    150  non-null values\n",
        "1    150  non-null values\n",
        "2    150  non-null values\n",
        "3    150  non-null values\n",
        "4    150  non-null values\n",
        "dtypes: float64(4), object(1)\n",
        "
" ], "output_type": "pyout", "prompt_number": 4, "text": [ "\n", "Int64Index: 151 entries, 0 to 150\n", "Data columns (total 5 columns):\n", "0 150 non-null values\n", "1 150 non-null values\n", "2 150 non-null values\n", "3 150 non-null values\n", "4 150 non-null values\n", "dtypes: float64(4), object(1)" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "ds.frame = ds.frame.dropna()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The objective of the `Dataset` is to introduce metadata to a DataFrame so is easy to make simple data transformations for machine learning.\n", "\n", "On this case the Target is a string, we can take see the metadata and the first rows of the data to undestand better what are we dealing with." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ds.metadata" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RoleTypedtype
Columns
0 Input Number float64
1 Input Number float64
2 Input Number float64
3 Input Number float64
4 Input Category object
\n", "
" ], "output_type": "pyout", "prompt_number": 6, "text": [ " Role Type dtype\n", "Columns \n", "0 Input Number float64\n", "1 Input Number float64\n", "2 Input Number float64\n", "3 Input Number float64\n", "4 Input Category object" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "ds.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Columns01234
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
\n", "
" ], "output_type": "pyout", "prompt_number": 7, "text": [ "Columns 0 1 2 3 4\n", "0 5.1 3.5 1.4 0.2 Iris-setosa\n", "1 4.9 3.0 1.4 0.2 Iris-setosa\n", "2 4.7 3.2 1.3 0.2 Iris-setosa\n", "3 4.6 3.1 1.5 0.2 Iris-setosa\n", "4 5.0 3.6 1.4 0.2 Iris-setosa" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "On this case the only change we need to make is the role of the 4th column to Target instead of Input. \n", "\n", "Also is necessary to change the dtype of the Target from `str` (object) to numbers (`np.int` or `np.float`) but this is going to be handled by copper automaticly " ] }, { "cell_type": "code", "collapsed": false, "input": [ "ds.role[4] = ds.TARGET" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Machine Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To easy compare different classifiers and different combinations of those we use the ModelComparison module.\n", "\n", "We can pass a Dataset to randomly split it into training and testing datasets." ] }, { "cell_type": "code", "collapsed": false, "input": [ "mc = copper.ModelComparison()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "mc.train_test_split(ds, test_size=0.3, random_state=12345)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If want to take a look at the data that is going to be used can see the ModelComparison.X_train, y_train, X_test and y_test properties." ] }, { "cell_type": "code", "collapsed": false, "input": [ "mc.X_train[:5], mc.y_train[:5]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 11, "text": [ "(array([[ 6. , 2.9, 4.5, 1.5],\n", " [ 4.9, 3.1, 1.5, 0.1],\n", " [ 6.2, 2.2, 4.5, 1.5],\n", " [ 6.5, 2.8, 4.6, 1.5],\n", " [ 6.9, 3.1, 4.9, 1.5]]),\n", " array([1, 0, 1, 1, 1]))" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "On this case we can see that the target was encoded to numbers and is possible to use the scikit-learn classifiers.\n", "\n", "For this example I am going to compare a Logistic Regression with a Support Vector Machine." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.linear_model import LogisticRegression" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.svm import SVC" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to define a unique name for each algorithm." ] }, { "cell_type": "code", "collapsed": false, "input": [ "mc['LR'] = LogisticRegression()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "mc['SVM'] = SVC(probability=True)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finaly we can fit/train all the classifiers and predict using the testing dataset that was splitter." ] }, { "cell_type": "code", "collapsed": false, "input": [ "mc.fit()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "mc.predict().head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SVMLR
0 1 1
1 0 0
2 1 2
3 0 0
4 0 0
\n", "
" ], "output_type": "pyout", "prompt_number": 17, "text": [ " SVM LR\n", "0 1 1\n", "1 0 0\n", "2 1 2\n", "3 0 0\n", "4 0 0" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like both classifiers agree in the first items.\n", "\n", "We get some metrics about the classifiers, my personal favorite is the Confusion Matrix." ] }, { "cell_type": "code", "collapsed": false, "input": [ "mc.cm('LR')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Iris-setosaIris-versicolorIris-virginica
Iris-setosa 16 0 0
Iris-versicolor 0 12 5
Iris-virginica 0 0 12
\n", "
" ], "output_type": "pyout", "prompt_number": 18, "text": [ " Iris-setosa Iris-versicolor Iris-virginica\n", "Iris-setosa 16 0 0\n", "Iris-versicolor 0 12 5\n", "Iris-virginica 0 0 12" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "mc.cm('SVM')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Iris-setosaIris-versicolorIris-virginica
Iris-setosa 16 0 0
Iris-versicolor 0 16 1
Iris-virginica 0 0 12
\n", "
" ], "output_type": "pyout", "prompt_number": 19, "text": [ " Iris-setosa Iris-versicolor Iris-virginica\n", "Iris-setosa 16 0 0\n", "Iris-versicolor 0 16 1\n", "Iris-virginica 0 0 12" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the Model Comparison still has the encoding for the target variables and does the inverse when necessary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another always good metric is the accuracy for the models lets see who is the winner." ] }, { "cell_type": "code", "collapsed": false, "input": [ "mc.accuracy_score()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 20, "text": [ "SVM 0.977778\n", "LR 0.888889\n", "dtype: float64" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now imagine that sklearn don't have the fbeta_score metric, and also imagine that you need it. Then you can create a custom metric and use the `ModelComparison.metric` utility.\n", "\n", "On this example I am going to import `sklearn.metrics.fbeta_score`, but it works with any function that follows the same format: `func(y_true, y_pred, **args)`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.metrics import fbeta_score" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "mc.metric(fbeta_score, beta=0.1)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 22, "text": [ "SVM 0.979441\n", "LR 0.920566\n", "dtype: float64" ] } ], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }