{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial: Machine Learning with Text in scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Agenda\n", "\n", "1. Model building in scikit-learn (refresher)\n", "2. Representing text as numerical data\n", "3. Reading a text-based dataset into pandas\n", "4. Vectorizing our dataset\n", "5. Building and evaluating a model\n", "6. Comparing models" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# for Python 2: use print only as a function\n", "from __future__ import print_function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Model building in scikit-learn (refresher)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# load the iris dataset as an example\n", "from sklearn.datasets import load_iris\n", "iris = load_iris()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# store the feature matrix (X) and response vector (y)\n", "X = iris.data\n", "y = iris.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# check the shapes of X and y\n", "print(X.shape)\n", "print(y.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**\"Observations\"** are also known as samples, instances, or records." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# examine the first 5 rows of the feature matrix (including the feature names)\n", "import pandas as pd\n", "pd.DataFrame(X, columns=iris.feature_names).head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# examine the response vector\n", "print(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# import the class\n", "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "# instantiate the model (with the default parameters)\n", "knn = KNeighborsClassifier()\n", "\n", "# fit the model with data (occurs in-place)\n", "knn.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# predict the response for a new observation\n", "knn.predict([[3, 5, 4, 2]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Representing text as numerical data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# example text for model training (SMS messages)\n", "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# example response vector\n", "is_desperate = [0, 0, 1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", "\n", "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n", "\n", "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# import and instantiate CountVectorizer (with the default parameters)\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "vect = CountVectorizer()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# learn the 'vocabulary' of the training data (occurs in-place)\n", "vect.fit(simple_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# examine the fitted vocabulary\n", "vect.get_feature_names()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# transform training data into a 'document-term matrix'\n", "simple_train_dtm = vect.transform(simple_train)\n", "simple_train_dtm" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# convert sparse matrix to a dense matrix\n", "simple_train_dtm.toarray()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# examine the vocabulary and document-term matrix together\n", "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", "\n", "> In this scheme, features and samples are defined as follows:\n", "\n", "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n", "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n", "\n", "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n", "\n", "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# check the type of the document-term matrix\n", "type(simple_train_dtm)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "# examine the sparse matrix contents\n", "print(simple_train_dtm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", "\n", "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n", "\n", "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n", "\n", "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# build a model to predict desperation\n", "knn = KNeighborsClassifier(n_neighbors=1)\n", "knn.fit(simple_train_dtm, is_desperate)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# example text for model testing\n", "simple_test = [\"please don't call me\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# transform testing data into a document-term matrix (using existing vocabulary)\n", "simple_test_dtm = vect.transform(simple_test)\n", "simple_test_dtm.toarray()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# examine the vocabulary and document-term matrix together\n", "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# predict whether simple_test is desperate\n", "knn.predict(simple_test_dtm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Summary:**\n", "\n", "- `vect.fit(train)` **learns the vocabulary** of the training data\n", "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n", "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Reading a text-based dataset into pandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# read file into pandas from the working directory\n", "sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# alternative: read file into pandas from a URL\n", "# url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv'\n", "# sms = pd.read_table(url, header=None, names=['label', 'message'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# examine the shape\n", "sms.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# examine the first 10 rows\n", "sms.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# examine the class distribution\n", "sms.label.value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# convert label to a numerical variable\n", "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# check that the conversion worked\n", "sms.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# how to define X and y (from the iris data) for use with a MODEL\n", "X = iris.data\n", "y = iris.target\n", "print(X.shape)\n", "print(y.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n", "X = sms.message\n", "y = sms.label_num\n", "print(X.shape)\n", "print(y.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# split X and y into training and testing sets\n", "from sklearn.cross_validation import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n", "print(X_train.shape)\n", "print(X_test.shape)\n", "print(y_train.shape)\n", "print(y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 4: Vectorizing our dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# instantiate the vectorizer\n", "vect = CountVectorizer()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# learn training data vocabulary, then use it to create a document-term matrix\n", "vect.fit(X_train)\n", "X_train_dtm = vect.transform(X_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# equivalently: combine fit and transform into a single step\n", "X_train_dtm = vect.fit_transform(X_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# examine the document-term matrix\n", "X_train_dtm" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# transform testing data (using fitted vocabulary) into a document-term matrix\n", "X_test_dtm = vect.transform(X_test)\n", "X_test_dtm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 5: Building and evaluating a model\n", "\n", "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n", "\n", "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# import and instantiate a Multinomial Naive Bayes model\n", "from sklearn.naive_bayes import MultinomialNB\n", "nb = MultinomialNB()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n", "%time nb.fit(X_train_dtm, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# make class predictions for X_test_dtm\n", "y_pred_class = nb.predict(X_test_dtm)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# calculate accuracy of class predictions\n", "from sklearn import metrics\n", "metrics.accuracy_score(y_test, y_pred_class)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# print the confusion matrix\n", "metrics.confusion_matrix(y_test, y_pred_class)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# print message text for the false positives (ham incorrectly classified as spam)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "# print message text for the false negatives (spam incorrectly classified as ham)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "# example false negative\n", "X_test[3132]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n", "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n", "y_pred_prob" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# calculate AUC\n", "metrics.roc_auc_score(y_test, y_pred_prob)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 6: Comparing models\n", "\n", "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n", "\n", "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# import and instantiate a logistic regression model\n", "from sklearn.linear_model import LogisticRegression\n", "logreg = LogisticRegression()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# train the model using X_train_dtm\n", "%time logreg.fit(X_train_dtm, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# make class predictions for X_test_dtm\n", "y_pred_class = logreg.predict(X_test_dtm)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# calculate predicted probabilities for X_test_dtm (well calibrated)\n", "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n", "y_pred_prob" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# calculate accuracy\n", "metrics.accuracy_score(y_test, y_pred_class)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# calculate AUC\n", "metrics.roc_auc_score(y_test, y_pred_prob)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }