{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Andreas Mueller, Kyle Kastner, Sebastian Raschka \n", "last updated: 2016-06-28 \n", "\n", "CPython 3.5.1\n", "IPython 4.2.0\n", "\n", "numpy 1.11.0\n", "scipy 0.17.1\n", "matplotlib 1.5.1\n", "scikit-learn 0.17.1\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# SciPy 2016 Scikit-learn Tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Case Study - Titanic Survival" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Extraction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we will talk about an important piece of machine learning: the extraction of\n", "quantitative features from data. By the end of this section you will\n", "\n", "- Know how features are extracted from real-world data.\n", "- See an example of extracting numerical features from textual data\n", "\n", "In addition, we will go over several basic tools within scikit-learn which can be used to accomplish the above tasks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What Are Features?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Numerical Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that data in scikit-learn is expected to be in two-dimensional arrays, of size\n", "**n_samples** $\\times$ **n_features**.\n", "\n", "Previously, we looked at the iris dataset, which has 150 samples and 4 features" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "print(iris.data.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These features are:\n", "\n", "- sepal length in cm\n", "- sepal width in cm\n", "- petal length in cm\n", "- petal width in cm\n", "\n", "Numerical features such as these are pretty straightforward: each sample contains a list\n", "of floating-point numbers corresponding to the features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Categorical Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if you have categorical features? For example, imagine there is data on the color of each\n", "iris:\n", "\n", " color in [red, blue, purple]\n", "\n", "You might be tempted to assign numbers to these features, i.e. *red=1, blue=2, purple=3*\n", "but in general **this is a bad idea**. Estimators tend to operate under the assumption that\n", "numerical features lie on some continuous scale, so, for example, 1 and 2 are more alike\n", "than 1 and 3, and this is often not the case for categorical features.\n", "\n", "In fact, the example above is a subcategory of \"categorical\" features, namely, \"nominal\" features. Nominal features don't imply an order, whereas \"ordinal\" features are categorical features that do imply an order. An example of ordinal features would be T-shirt sizes, e.g., XL > L > M > S. \n", "\n", "One work-around for parsing nominal features into a format that prevents the classification algorithm from asserting an order is the so-called one-hot encoding representation. Here, we give each category its own dimension. \n", "\n", "The enriched iris feature set would hence be in this case:\n", "\n", "- sepal length in cm\n", "- sepal width in cm\n", "- petal length in cm\n", "- petal width in cm\n", "- color=purple (1.0 or 0.0)\n", "- color=blue (1.0 or 0.0)\n", "- color=red (1.0 or 0.0)\n", "\n", "Note that using many of these categorical features may result in data which is better\n", "represented as a **sparse matrix**, as we'll see with the text classification example\n", "below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using the DictVectorizer to encode categorical features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When the source data is encoded has a list of dicts where the values are either strings names for categories or numerical values, you can use the `DictVectorizer` class to compute the boolean expansion of the categorical features while leaving the numerical features unimpacted:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "measurements = [\n", " {'city': 'Dubai', 'temperature': 33.},\n", " {'city': 'London', 'temperature': 12.},\n", " {'city': 'San Francisco', 'temperature': 18.},\n", "]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_extraction import DictVectorizer\n", "\n", "vec = DictVectorizer()\n", "vec" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vec.fit_transform(measurements).toarray()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vec.get_feature_names()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Derived Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another common feature type are **derived features**, where some pre-processing step is\n", "applied to the data to generate features that are somehow more informative. Derived\n", "features may be based in **feature extraction** and **dimensionality reduction** (such as PCA or manifold learning),\n", "may be linear or nonlinear combinations of features (such as in polynomial regression),\n", "or may be some more sophisticated transform of the features. The latter is often used\n", "in image processing.\n", "\n", "For example, [scikit-image](http://scikit-image.org/) provides a variety of feature\n", "extractors designed for image data: see the ``skimage.feature`` submodule.\n", "We will see some *dimensionality*-based feature extraction routines later in the tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining Numerical and Categorical Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an example of how to work with both categorical and numerical data, we will perform survival predicition for the passengers of the HMS Titanic.\n", "\n", "We will use a version of the Titanic (titanic3.xls) dataset from Thomas Cason, as retrieved from Frank Harrell's webpage [here](http://lib.stat.cmu.edu/S/Harrell/data/descriptions/titanic.html). We converted the .xls to .csv for easier manipulation without involving external libraries, but the data is otherwise unchanged.\n", "\n", "We need to read in all the lines from the (titanic3.csv) file, set aside the keys from the first line, and find our labels (who survived or died) and data (attributes of that person). Let's look at the keys and some corresponding example lines." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n", "\n", "f = open(os.path.join('datasets', 'titanic', 'titanic3.csv'))\n", "print(f.readline())\n", "lines = []\n", "for i in range(3):\n", " lines.append(f.readline())\n", "print(lines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The site [linked here](http://lib.stat.cmu.edu/S/Harrell/data/descriptions/titanic3info.txt) gives a broad description of the keys and what they mean - we show it here for completeness\n", "\n", "```\n", "pclass Passenger Class\n", " (1 = 1st; 2 = 2nd; 3 = 3rd)\n", "survival Survival\n", " (0 = No; 1 = Yes)\n", "name Name\n", "sex Sex\n", "age Age\n", "sibsp Number of Siblings/Spouses Aboard\n", "parch Number of Parents/Children Aboard\n", "ticket Ticket Number\n", "fare Passenger Fare\n", "cabin Cabin\n", "embarked Port of Embarkation\n", " (C = Cherbourg; Q = Queenstown; S = Southampton)\n", "boat Lifeboat\n", "body Body Identification Number\n", "home.dest Home/Destination\n", "```\n", "\n", "In general, it looks like `name`, `sex`, `cabin`, `embarked`, `boat`, `body`, and `homedest` may be candidates for categorical features, while the rest appear to be numerical features. We can now write a function to extract features from a text line, shown below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's process an example line using the `process_titanic_line` function from `helpers` to see the expected output." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from helpers import process_titanic_line\n", "\n", "print(process_titanic_line(lines[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we see the expected format from the line, we can call a dataset helper which uses this processing to read in the whole dataset. See ``helpers.py`` for more details." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from helpers import load_titanic\n", "\n", "keys, train_data, test_data, train_labels, test_labels = load_titanic(\n", " test_size=0.2, feature_skip_tuple=(), random_state=1999)\n", "print(\"Key list: %s\" % keys)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With all of the hard data loading work out of the way, evaluating a classifier on this data becomes straightforward. Setting up the simplest possible model, we want to see what the simplest score can be with `DummyClassifier`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score\n", "from sklearn.dummy import DummyClassifier\n", "clf = DummyClassifier('most_frequent')\n", "clf.fit(train_data, train_labels)\n", "pred_labels = clf.predict(test_data)\n", "print(\"Prediction accuracy: %f\" % accuracy_score(pred_labels, test_labels))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Exercise\n", "=====\n", "Try executing the above classification, using RandomForestClassifier instead of DummyClassifier\n", "\n", "Can you remove or create new features to improve your score? Try printing feature importance as shown in this [sklearn example](http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html) and removing features by adding arguments to feature_skip_tuple." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }