In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam. 

The feature vectors generated in this notebook are composed of simple summaries of the text data. We begin by loading in the data produced by [the generator notebook.](00-generator.ipynb) 

In [None]:
import pandas as pd

df = pd.read_parquet("data/training.parquet")

To illustrate the computation of feature vectors, we compute them for a sample of three documents from the data loaded in above.

In [None]:
import numpy as np

np.random.seed(0xc0fee)
df_samp = df.sample(3)

In [None]:
pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible

df_samp

The summmaries we will compute for each document are: 
* number of pieces of punctuation 
* number of words
* average word length
* maximum word length
* minimum word length
* 10th percentile word length
* 90th percentile word length
* number of words containing upper case letters
* number 'stop words'
 
To begin, we count the number of pieces of punctuation in each piece of text. We will remove the punctuation from the text as it is counted. This will make computing the later summaries a little simpler.

In [None]:
import re

def strip_punct(doc):
 """
 takes in a document _doc_ and
 returns a tuple of the punctuation-free
 _doc_ and the count of punctuation in _doc_
 """
 
 return re.subn(r"""[!.><:;',@#~{}\[\]\-_+=£$%^&()?]""", "", doc, count=0, flags=0)

In [None]:
df_samp["text_str"]= df_samp["text"].apply(strip_punct)

In [None]:
df_samp

We will store the count of punctuation in a new summaries vector: 

In [None]:
df_summaries = pd.DataFrame({'num_punct' :df_samp["text_str"].apply(lambda x: x[1])})
df_summaries

In [None]:
df_samp.reset_index(inplace=True) 

#note level and index coincide for the legitimate documents, but not for the spam - 
 #for spam, index = level_0 mod 20,000

In [None]:
df_samp

Many of the summaries we will compute require us to consider each word in the text, one by one. To prevent needing to 'split' the text multiple times, we split once, then apply each function to the resultant words. 

To do this, we "explode" the text into words, so that each word occupies a row of the data frame, and retains the associated "level_0", "index" and "label". 

In [None]:
rows = []
_ = df_samp.apply(lambda row: [rows.append([ row['level_0'], row['index'], row['label'], word]) 
 for word in row.text_str[0].split()], axis=1)
df_samp_explode = pd.DataFrame(rows, columns=df_samp.columns[0:4])

In [None]:
df_samp_explode

Column `level_0` contains the index we want to aggregate any calculations over. 

Computing the number of words in each document is now simply calculating the number of rows for each value of `level_0`.

In [None]:
df_summaries["num_words"] = df_samp_explode['level_0'].value_counts()
df_summaries

Many of the remaining summaries require word length to be computed. To save us from recomputing this every time, we will add a column containing this information to our 'exploded' data frame:

In [None]:
df_samp_explode["word_len"] = df_samp_explode["text"].apply(len) 

In [None]:
df_samp_explode.sample(10) 

In the next cell we compute the average word length as well as the minimum and maximum, for each document. 

In [None]:
df_summaries["av_wl"] = df_samp_explode.groupby('level_0')['word_len'].mean() #average word length
df_summaries["max_wl"] = df_samp_explode.groupby('level_0')['word_len'].max() #max word length
df_summaries["min_wl"] = df_samp_explode.groupby('level_0')['word_len'].min() #min word length

We can also compute quantiles of the word length: 

In [None]:
df_summaries["10_quantile"] = df_samp_explode.groupby('level_0')['word_len'].quantile(0.1) #10th quantile word length
df_summaries["90_quantile"]= df_samp_explode.groupby('level_0')['word_len'].quantile(0.9) #90th quantile word length

In [None]:
df_summaries

As well as the simple summaries relating to word length, we can compute some more involved summaries related to language. For each document we will compute: 

* the number of words which contain at least one capital letter
* the number of stop words



In [None]:
#item.islower returns true if all characters are lowercase, else false.
#nb: isupper only returns true if all characters are upper case. 
def caps(word):
 return not word.islower()
df_samp_explode["upper_case"]=df_samp_explode['text'].apply(caps)
df_summaries["upper_case"] = df_samp_explode.groupby('level_0')['upper_case'].sum() 

In [None]:
df_summaries

Stop words are commonly used words which are usually considered to be unrelated to the document topic. Examples include 'in', 'the', 'at' and 'otherwise'.

In [None]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

In [None]:
def isstopword(word):
 return word in ENGLISH_STOP_WORDS

df_samp_explode["stop_words"]=df_samp_explode['text'].apply(isstopword)

In [None]:
df_samp_explode.sample(10)

In [None]:
df_summaries["stop_words"] = df_samp_explode.groupby('level_0')['stop_words'].sum() 

In [None]:
df_summaries

Now that we've illustrated how to compute the summaries on a subsample of our data, we will go ahead and compute the summaries for each of the texts in the full dataset. In order to minimise clutter in this notebook we have [introduced a helper function called `features_simple`](mlworkflows/featuressimple.py).

In [None]:
df.reset_index(inplace=True)

In [None]:
from mlworkflows import featuressimple

In [None]:
simple_summary = featuressimple.SimpleSummaries()

summaries = simple_summary.transform(df["text"])

In [None]:
from sklearn.pipeline import Pipeline

feat_pipeline = Pipeline([
 ('features',simple_summary)
])

from mlworkflows import util
util.serialize_to(feat_pipeline, "feature_pipeline.sav")

In [None]:
features = pd.concat([df[["index", "label"]],
 pd.DataFrame(summaries)], axis=1)

In [None]:
features

In [None]:
features.columns = features.columns.astype(str)

#### Visualisation:

As in earlier notebooks, we use PCA to project the space of summaries to 2 dimensions, which we can then plot. 

In [None]:
import sklearn.decomposition

DIMENSIONS = 2

pca = sklearn.decomposition.PCA(DIMENSIONS)

pca_summaries = pca.fit_transform(features.iloc[:,2:features.shape[1]])

In [None]:
from mlworkflows import plot

pca_summaries_plot_data = pd.concat([df, pd.DataFrame(pca_summaries, columns=["x", "y"])], axis=1)

plot.plot_points(pca_summaries_plot_data, x="x", y="y", color="label")

In [None]:
features.to_parquet("data/features.parquet")

Now that we have a feature engineering approach, next step is to train a model. Again, you have two choices for your next step: [click here](04-model-logistic-regression.ipynb) for a model based on *logistic regression*, or [click here](04-model-random-forest.ipynb) for a model based on *ensembles of decision trees*.