{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam. \n", "\n", "The feature vectors generated in this notebook are composed of simple summaries of the text data. We begin by loading in the data produced by [the generator notebook.](00-generator.ipynb) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_parquet(\"data/training.parquet\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate the computation of feature vectors, we compute them for a sample of three documents from the data loaded in above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "np.random.seed(0xc0fee)\n", "df_samp = df.sample(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible\n", "\n", "df_samp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The summmaries we will compute for each document are: \n", "* number of pieces of punctuation \n", "* number of words\n", "* average word length\n", "* maximum word length\n", "* minimum word length\n", "* 10th percentile word length\n", "* 90th percentile word length\n", "* number of words containing upper case letters\n", "* number 'stop words'\n", " \n", "To begin, we count the number of pieces of punctuation in each piece of text. We will remove the punctuation from the text as it is counted. This will make computing the later summaries a little simpler." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "def strip_punct(doc):\n", " \"\"\"\n", " takes in a document _doc_ and\n", " returns a tuple of the punctuation-free\n", " _doc_ and the count of punctuation in _doc_\n", " \"\"\"\n", " \n", " return re.subn(r\"\"\"[!.><:;',@#~{}\\[\\]\\-_+=£$%^&()?]\"\"\", \"\", doc, count=0, flags=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_samp[\"text_str\"]= df_samp[\"text\"].apply(strip_punct)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_samp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will store the count of punctuation in a new summaries vector: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_summaries = pd.DataFrame({'num_punct' :df_samp[\"text_str\"].apply(lambda x: x[1])})\n", "df_summaries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_samp.reset_index(inplace=True) \n", "\n", "#note level and index coincide for the legitimate documents, but not for the spam - \n", " #for spam, index = level_0 mod 20,000" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_samp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many of the summaries we will compute require us to consider each word in the text, one by one. To prevent needing to 'split' the text multiple times, we split once, then apply each function to the resultant words. \n", "\n", "To do this, we \"explode\" the text into words, so that each word occupies a row of the data frame, and retains the associated \"level_0\", \"index\" and \"label\". " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rows = []\n", "_ = df_samp.apply(lambda row: [rows.append([ row['level_0'], row['index'], row['label'], word]) \n", " for word in row.text_str[0].split()], axis=1)\n", "df_samp_explode = pd.DataFrame(rows, columns=df_samp.columns[0:4])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_samp_explode" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Column `level_0` contains the index we want to aggregate any calculations over. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Computing the number of words in each document is now simply calculating the number of rows for each value of `level_0`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_summaries[\"num_words\"] = df_samp_explode['level_0'].value_counts()\n", "df_summaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many of the remaining summaries require word length to be computed. To save us from recomputing this every time, we will add a column containing this information to our 'exploded' data frame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_samp_explode[\"word_len\"] = df_samp_explode[\"text\"].apply(len) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_samp_explode.sample(10) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next cell we compute the average word length as well as the minimum and maximum, for each document. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_summaries[\"av_wl\"] = df_samp_explode.groupby('level_0')['word_len'].mean() #average word length\n", "df_summaries[\"max_wl\"] = df_samp_explode.groupby('level_0')['word_len'].max() #max word length\n", "df_summaries[\"min_wl\"] = df_samp_explode.groupby('level_0')['word_len'].min() #min word length" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also compute quantiles of the word length: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_summaries[\"10_quantile\"] = df_samp_explode.groupby('level_0')['word_len'].quantile(0.1) #10th quantile word length\n", "df_summaries[\"90_quantile\"]= df_samp_explode.groupby('level_0')['word_len'].quantile(0.9) #90th quantile word length" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_summaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As well as the simple summaries relating to word length, we can compute some more involved summaries related to language. For each document we will compute: \n", "\n", "* the number of words which contain at least one capital letter\n", "* the number of stop words\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "#item.islower returns true if all characters are lowercase, else false.\n", "#nb: isupper only returns true if all characters are upper case. \n", "def caps(word):\n", " return not word.islower()\n", "df_samp_explode[\"upper_case\"]=df_samp_explode['text'].apply(caps)\n", "df_summaries[\"upper_case\"] = df_samp_explode.groupby('level_0')['upper_case'].sum() " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_summaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stop words are commonly used words which are usually considered to be unrelated to the document topic. Examples include 'in', 'the', 'at' and 'otherwise'." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def isstopword(word):\n", " return word in ENGLISH_STOP_WORDS\n", "\n", "df_samp_explode[\"stop_words\"]=df_samp_explode['text'].apply(isstopword)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df_samp_explode.sample(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_summaries[\"stop_words\"] = df_samp_explode.groupby('level_0')['stop_words'].sum() " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_summaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've illustrated how to compute the summaries on a subsample of our data, we will go ahead and compute the summaries for each of the texts in the full dataset. In order to minimise clutter in this notebook we have [introduced a helper function called `features_simple`](mlworkflows/featuressimple.py)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.reset_index(inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mlworkflows import featuressimple" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "simple_summary = featuressimple.SimpleSummaries()\n", "\n", "summaries = simple_summary.transform(df[\"text\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "\n", "feat_pipeline = Pipeline([\n", " ('features',simple_summary)\n", "])\n", "\n", "from mlworkflows import util\n", "util.serialize_to(feat_pipeline, \"feature_pipeline.sav\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features = pd.concat([df[[\"index\", \"label\"]],\n", " pd.DataFrame(summaries)], axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features.columns = features.columns.astype(str)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Visualisation:\n", "\n", "As in earlier notebooks, we use PCA to project the space of summaries to 2 dimensions, which we can then plot. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn.decomposition\n", "\n", "DIMENSIONS = 2\n", "\n", "pca = sklearn.decomposition.PCA(DIMENSIONS)\n", "\n", "pca_summaries = pca.fit_transform(features.iloc[:,2:features.shape[1]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mlworkflows import plot\n", "\n", "pca_summaries_plot_data = pd.concat([df, pd.DataFrame(pca_summaries, columns=[\"x\", \"y\"])], axis=1)\n", "\n", "plot.plot_points(pca_summaries_plot_data, x=\"x\", y=\"y\", color=\"label\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features.to_parquet(\"data/features.parquet\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a feature engineering approach, next step is to train a model. Again, you have two choices for your next step: [click here](04-model-logistic-regression.ipynb) for a model based on *logistic regression*, or [click here](04-model-random-forest.ipynb) for a model based on *ensembles of decision trees*." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "jupyter" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }