{ "metadata": { "name": "", "signature": "sha256:f45424aed977e5ffec61f468f2fa9fdc640284b1f3eb7535ddf1fd1516eb7c51" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "If we learn something so far, that is machine learning is not only applying an algorithm and get the predictions.(Not so much commodization after all) It has quite a lot of different and moving parts for a given problem. Steps(feature extraction, feature selection, classifier, evaluation) follows a sequential order, though. Would it be perfect if we could wrap all of the steps in one object and then do the parameter search(i.e. grid parameter search) for cross validation in that object. Further, if we have two estimators in the __pipeline__(say we apply PCA to reduce dimension in the input and then apply SVM), we would need only once to use the `fit` function in the estimator. Pipeline automatically applies the correct steps for you to get the correct output at the end of the pipeline. Still not convinced? What if I say, serializing one `pipeline` instead of serializing `vectorizer`, `feature_selector` and `classifier` separately and then deploying into production makes much easier.(More on to this in the next notebook) This was a quick win for the pipeline. Let's see how one might use it in classification." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "import csv\n", "import json\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import os\n", "from sklearn import cross_validation \n", "from sklearn import datasets\n", "from sklearn import decomposition\n", "from sklearn import ensemble\n", "from sklearn import feature_extraction\n", "from sklearn import feature_selection\n", "from sklearn import grid_search\n", "from sklearn import metrics\n", "from sklearn import naive_bayes\n", "from sklearn import pipeline\n", "from sklearn import tree\n", "\n", "import seaborn as sns\n", "\n", "pd.set_option('display.max_columns', None)\n", "\n", "_DATA_DIR ='data'\n", "_SPAM_DATA_PATH = os.path.join(_DATA_DIR, 'SMSSpamCollection')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 95 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Dataset Explanation](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)\n", "- A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. \n", "- A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link]. \n", "- A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link]. \n", "- Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link]. This corpus has been used in the following academic researches: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "df = pd.read_csv(_SPAM_DATA_PATH, sep='\\t', header=None, names=['Label', 'Text'])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 46 }, { "cell_type": "code", "collapsed": false, "input": [ "df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | Label | \n", "Text | \n", "
---|---|---|
0 | \n", "ham | \n", "Go until jurong point, crazy.. Available only ... | \n", "
1 | \n", "ham | \n", "Ok lar... Joking wif u oni... | \n", "
2 | \n", "spam | \n", "Free entry in 2 a wkly comp to win FA Cup fina... | \n", "
3 | \n", "ham | \n", "U dun say so early hor... U c already then say... | \n", "
4 | \n", "ham | \n", "Nah I don't think he goes to usf, he lives aro... | \n", "