{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Agenda\n", "\n", "- Define the problem and the approach\n", "- Data basics: loading data, looking at your data, basic commands\n", "- Handling missing values\n", "-
Intro to scikit-learn
\n", "- Grouping and aggregating data\n", "- Feature selection\n", "- Fitting and evaluating a model\n", "- Deploying your work" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##In this notebook you will\n", "\n", "- Take a tour of `scikit-learn` and learn what it's used for\n", "- Build a toy classifier\n", "- Make and vizualize a decision tree\n", "- Build your own regression model" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import numpy as np\n", "import pylab as pl" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##[scikit-learn](http://scikit-learn.org/stable/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####Consistent APIs\n", "Algorithms are implemented with the same core functions:\n", "\n", "- fit = train an algorithm\n", "- predict = predict the value for a given record\n", "- predict_proba = predict the probability of all possible classes for a given record (classification only)\n", "- transform = alter your data based on a given preprocessor (i.e. normalize or scale your data) (preprocessing/unsuperivsed)\n", "- fit_transform = train a preprocessor and then transform the data in a single step (preprocessing/unsuperivsed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "df = pd.DataFrame(iris.data, columns=iris.feature_names)\n", "df['species'] = iris.target" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.svm import SVC\n", "from sklearn.neighbors import KNeighborsClassifier" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "svm_clf = SVC()\n", "neighbors_clf = KNeighborsClassifier()\n", "clfs = [\n", " (\"svc\", SVC()),\n", " (\"KNN\", KNeighborsClassifier())\n", " ]\n", "for name, clf in clfs:\n", " clf.fit(df[iris.feature_names], df.species)\n", " print name, clf.predict(iris.data)\n", " print \"*\"*80" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "svc [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n", " 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n", " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n", " 2 2]\n", "********************************************************************************\n", "KNN [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1\n", " 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2\n", " 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n", " 2 2]\n", "********************************************************************************\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "####Try a [RandomForetClassifier](http://scikit-learn.org/stable/modules/ensemble.html#random-forests) (`from sklearn.ensemble import RandomForestClassifier`) and see how it does" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.ensemble import RandomForestClassifier\n", "clf = RandomForestClassifier()\n", "clf.fit(df[iris.feature_names], df.species)\n", "clf.predict(df[iris.feature_names])\n", "pd.crosstab(df.species, clf.predict(df[iris.feature_names]))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stderr", "text": [ "/usr/local/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.\n", "\n", " warnings.warn(d.msg, DeprecationWarning)\n", "/usr/local/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.\n", "\n", " warnings.warn(d.msg, DeprecationWarning)\n" ] }, { "html": [ "Scikit-Learn reply to today's @wiseio Random Forest benchmark: https://t.co/El5at9KvHS \u2026 Coming soon in the next 0.14 stable release!
— Gilles Louppe (@glouppe) July 16, 2013
col_0 | \n", "0 | \n", "1 | \n", "2 | \n", "
---|---|---|---|
species | \n", "\n", " | \n", " | \n", " |
0 | \n", "50 | \n", "0 | \n", "0 | \n", "
1 | \n", "0 | \n", "50 | \n", "0 | \n", "
2 | \n", "0 | \n", "1 | \n", "49 | \n", "
\n", " | crim | \n", "zn | \n", "indus | \n", "chas | \n", "nox | \n", "rm | \n", "age | \n", "dis | \n", "rad | \n", "tax | \n", "ptratio | \n", "b | \n", "lstat | \n", "price | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.00632 | \n", "18 | \n", "2.31 | \n", "0 | \n", "0.538 | \n", "6.575 | \n", "65.2 | \n", "4.0900 | \n", "1 | \n", "296 | \n", "15.3 | \n", "396.90 | \n", "4.98 | \n", "24.0 | \n", "
1 | \n", "0.02731 | \n", "0 | \n", "7.07 | \n", "0 | \n", "0.469 | \n", "6.421 | \n", "78.9 | \n", "4.9671 | \n", "2 | \n", "242 | \n", "17.8 | \n", "396.90 | \n", "9.14 | \n", "21.6 | \n", "
2 | \n", "0.02729 | \n", "0 | \n", "7.07 | \n", "0 | \n", "0.469 | \n", "7.185 | \n", "61.1 | \n", "4.9671 | \n", "2 | \n", "242 | \n", "17.8 | \n", "392.83 | \n", "4.03 | \n", "34.7 | \n", "
3 | \n", "0.03237 | \n", "0 | \n", "2.18 | \n", "0 | \n", "0.458 | \n", "6.998 | \n", "45.8 | \n", "6.0622 | \n", "3 | \n", "222 | \n", "18.7 | \n", "394.63 | \n", "2.94 | \n", "33.4 | \n", "
4 | \n", "0.06905 | \n", "0 | \n", "2.18 | \n", "0 | \n", "0.458 | \n", "7.147 | \n", "54.2 | \n", "6.0622 | \n", "3 | \n", "222 | \n", "18.7 | \n", "396.90 | \n", "5.33 | \n", "36.2 | \n", "