We'll start by using the [markovify](https://github.com/jsvine/markovify/) library to make some individual sentences in the style of Jane Austen. These will be the basis for generating a stream of synthetic documents.

In [None]:
import markovify
import codecs
import random

# Markovify uses a single random generator -- notebooks using it will thus 
# only be reproducible if you set a random seed before each cell using markovify
random.seed(0xbaff1ed)

with codecs.open("data/austen.txt", "r", "cp1252") as f:
 text = f.read()

austen_model = markovify.Text(text, retain_original=False, state_size=3)

for i in range(10):
 print(austen_model.make_short_sentence(200))

Constructing single sentences is interesting, but we'd really rather construct larger documents. Here we'll construct a series of documents that have, on average, five sentences.

In [None]:
from scipy.stats import poisson
import numpy as np

def make_basic_documents(sentence_count=5, document_count=1, model=austen_model, seed=None):
 def shortsentence(ct):
 return " ".join([model.make_short_sentence(200) for _ in range(ct + 1)])

 if seed is not None:
 # seed both the Python generator and the NumPy one used by SciPy
 random.seed(seed)
 np.random.seed(seed)
 
 return [shortsentence(ct) for ct in poisson.rvs(sentence_count, size=document_count)]

for doc in make_basic_documents(5, 10, seed=0xdecaf):
 print(doc)
 print("\n###\n")

We're going to use the Austen model as the main basis for _legitimate messages_ in our sample data set. For the _spam messages_, we'll train two Markov models on positive and negative product reviews (taken from the [public-domain Amazon fine foods reviews dataset on Kaggle](https://www.kaggle.com/snap/amazon-fine-food-reviews/)). We'll combine the models from these sources in different proportions so that all words are _possible_ in certain kinds of messages but some words are _more likely_ in legitimate messages or in spam.

In [None]:
import gzip

def train_markov_gz(fn):
 """ trains a Markov model on gzipped text data """
 with gzip.open(fn, "rt", encoding="utf-8") as f:
 text = f.read()
 return markovify.Text(text, retain_original=False, state_size=3)

negative_model = train_markov_gz("data/reviews-1.txt.gz")
positive_model = train_markov_gz("data/reviews-5-100k.txt.gz")

We can combine these models with relative weights, but this yields somewhat unusual results:

In [None]:
legitimate_model = markovify.combine([austen_model, negative_model, positive_model], [196, 2, 2])
spam_model = markovify.combine([austen_model, negative_model, positive_model], [3, 30, 40])

In [None]:
# seed both the Python generator and the NumPy one used by SciPy
random.seed(0xc0ffee)
np.random.seed(0xc0ffee)

for s in make_basic_documents(5, 20, legitimate_model):
 print(s)
 print("\n###\n")

In [None]:
random.seed(0xf00)
np.random.seed(0xf00)

for s in make_basic_documents(5, 20, spam_model):
 print(s)
 print("\n###\n")

We can then generate some example documents and save them to a file for use in the next notebook. 

In [None]:
import pandas as pd
import numpy as np

pd.set_option("io.parquet.engine", "pyarrow")

random.seed(0xda7aba5e)
np.random.seed(0xda7aba5e)

df = pd.DataFrame(columns=["label", "text"], dtype=np.dtype(str))

mean_sentences_per_example = 5
examples_per_class = 20000

for (label, model) in [("legitimate", legitimate_model), ("spam", spam_model)]:
 docs = [{"label" : label, "text" : txt} for txt in make_basic_documents(mean_sentences_per_example, examples_per_class, model)]
 df = pd.concat([df, pd.DataFrame(docs)])

df["text"] = df["text"].astype("str")
df["label"] = df["label"].astype("category")
df.reset_index().to_parquet("data/training.parquet")

Let's go to [the next notebook](01-vectors-and-visualization.ipynb) now!