<a href="https://colab.research.google.com/github/PhilChodrow/PIC16B/blob/master/lectures/tf/tf-5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation with Recurrent Neural Networks

In this set of lecture notes, we'll consider a new kind of machine learning task. Previously, we've focused on *classification* problems. In classification problems, the goal is to assign a given piece of data to one of several categories. Today, we'll instead consider a simple  *generation* problem. A *generative* model can be used to create "realistic" examples after it's been trained. Generative models are at the heart of machine learning topics like [deepfakes](https://en.wikipedia.org/wiki/Deepfake), [language generation](https://aiweirdness.com/post/140219420017/the-silicon-gourmet-training-a-neural-network-to), and [style transfer](https://www.tensorflow.org/tutorials/generative/style_transfer).  

*Parts of these lecture notes were based on [this tutorial](https://keras.io/examples/generative/lstm_character_level_text_generation/). It is recommended to run the code contained in these notes in a Google Colab instance with GPU acceleration enabled.* 

In [2]:
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras import layers

import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

In [3]:
# link to Google Drive to read in trained model
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive



## Our Task

Today, we are going to see whether we can teach an algorithm to understand and reproduce the pinnacle of cultural achievement; the benchmark against which all art is to be judged; the mirror that reveals to humany its truest self. I speak, of course, of *Star Trek: Deep Space Nine.*

<figure class="image" style="width:300px">
  <img src="https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/_images/DS9.jpg" alt="">
  <figcaption><i></i></figcaption>
</figure>

In particular, we are going to attempt to teach a neural  network to generate *episode scripts*. This a text generation task: after training, our hope is that our model will be able to create scripts that are reasonably realistic in their appearance. 


In [4]:
## miscellaneous data cleaning

start_episode = 20 # Start in Season 2, Season 1 is not very good
num_episodes = 50  # only pick this many episodes to train on

url = "https://github.com/PhilChodrow/PIC16B/blob/master/datasets/star_trek_scripts.json?raw=true"
star_trek_scripts = pd.read_json(url)

cleaned = star_trek_scripts["DS9"].str.replace("\n\n\n\n\n\nThe Deep Space Nine Transcripts -", "")
cleaned = cleaned.str.split("\n\n\n\n\n\n\n").str.get(-2)
text = "\n\n".join(cleaned[start_episode:(start_episode + num_episodes)])
for char in ['\xa0', 'à', 'é', "}", "{"]:
    text = text.replace(char, "")

The result is one long string containing the scripts of 50 episodes of Star Trek: Deep Space 9. How glorious!

In [5]:
len(text)

1570834

In [6]:
print(text[0:500])

  Last
time on Deep Space Nine.  
SISKO: This is the emblem of the Alliance for Global Unity. They call
themselves the Circle. 
O'BRIEN: What gives them the right to mess up our station? 
ODO: They're an extremist faction who believe in Bajor for the
Bajorans. 
SISKO: I can't loan you a Starfleet runabout without knowing where you
plan on taking it. 
KIRA: To Cardassia Four to rescue a Bajoran prisoner of war. 
(The prisoners are rescued.) 
KIRA: Come on. We have a ship waiting. 
JARO: What you 


Our first step, as usual, is data preparation. What we need to do is format the data in such a way that we can treat the situation as a classification problem after all. That is: 

> Given a string of text, predict the next character in that string. 

Doing this repeatedly will allow the model to generate large bodies of text. 

To do this, we want to split our data like so: 

```
predictor = "to boldly g"
target    = "o"
```

The following function will do this for us. The `max_len` argument gives the number of characters that should be in the predictor string, and the `step_size` argument lets us skip indices if we want to in order to decrease the size of the data. 

In [7]:
def split(raw_text, max_len, step_size = 1):

    lines = []
    next_chars = []

    for i in range(0, len(text) - max_len, step_size):
        lines.append(text[i:i+max_len])
        next_chars.append(text[i+max_len])
    
    return lines, next_chars

In [8]:
max_len = 20

lines, next_chars =  split(text, max_len = max_len, step_size = 5)
for i in range(10, 15):
    print(lines[i] + "     =>    " + next_chars[i])

he emblem of the All     =>    i
blem of the Alliance     =>     
of the Alliance for      =>    G
e Alliance for Globa     =>    l
iance for Global Uni     =>    t


Our next step is to vectorize the characters. This is similar to the word vectorization task, but it's simple enough in this case that's arguably more convenient to actually handle it outside of TensorFlow. It is also possible to handle vectorization using TensorFlow tools, as demonstrated in [this tutorial](https://www.tensorflow.org/tutorials/text/text_generation). 

In [9]:
chars = sorted(set(text))
char_indices = {char : chars.index(char) for char in chars}
X = np.zeros((len(lines), max_len, len(chars)))
y = np.zeros((len(lines), 1), dtype = np.int32)
for i, line in enumerate(lines):
	for t, char in enumerate(line):
		X[i, t, char_indices[char]] = 1
	y[i] = char_indices[next_chars[i]]

Let's take a look at the results. 

In [10]:
X.shape, y.shape

((314163, 20, 78), (314163, 1))

In [11]:
X[0]

array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

What does this matrix say? `X[0]` is a single sequence of 20 characters. The first row says that the first character of this sequence is the character corresponding to integer `1`. The second row says that the second character corresponds to integer `1`, and so on (`1` is the space character `" "`, in case you're wondering).  

In [12]:
y[0]

array([42], dtype=int32)

This says that the 21st character (the character after the first 20 characters of the string) is the character with index `42`. 

Now we're ready to perform a train-test split: 

In [13]:
train_len = int(0.7*X.shape[0])
X_train = X[0:train_len]
X_val = X[train_len:]

y_train = y[0:train_len]
y_val  = y[train_len:]

## Modeling

Model time! We'll use a simple *Long Short-Term Memory* (LSTM) model for this example. LSTMs are one example of *recurrent* neural network layers. Here's a diagram illustrating the schematic functioning of a recurrent layer. 

![](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png)
*Image credit: [Chris Olah](https://colah.github.io/posts/2015-08-Understanding-LSTMs/), OpenAI*

On the lefthand side, we have a "zoomed out" picture of a recurrent neural network layer. On the righthand side, we see the "zoomed in" version. The key point here is that output $h_2$ depends not only on input $x_2$, but also, indirectly, on inputs $x_0$ and $x_1$. This means that recurrent neural networks are highly suitable for modeling processes that have temporal structure. Text is an example: the last few characters are the "history" of the text. Timeseries data are another clear example, and indeed, we can use a very similar workflow to the one we'll use today in order to do forecasting in timeseries. 

Since training for this kind of task gets expensive fast, we'll use just one LSTM layer followed by a `Dense` output layer. 

In [14]:
model = tf.keras.models.Sequential([
    layers.LSTM(128, name = "LSTM", input_shape=(max_len, len(chars))),
    layers.Dense(len(chars))        
])

In [15]:
model.compile(loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True), 
              optimizer = "adam")

Time for training. We'll do just one epoch for now, mostly just to prove that we've set up our model correctly. 

In [16]:
# code I used to train and save the model
# model.fit(X_train, 
#           y_train,
#           validation_data= (X_val, y_val),
#           batch_size=128, epochs = 200)
# model.save('/content/drive/MyDrive/DS9_model') 

model.fit(X_train, 
          y_train,
          validation_data= (X_val, y_val),
          batch_size=128, epochs = 1)



<keras.callbacks.History at 0x7f88eb6dd290>

Rather than training the entire model live during lecture, I'm instead going to load in a saved model that I previously trained for 200 epochs on Google Colab. On Colab, each epoch takes around 10s or so. 200 epochs corresponds to roughly 30 minutes. 

In [17]:
model = tf.keras.models.load_model('/content/drive/MyDrive/DS9_model')



Generative models define *probability distributions* over the space of possible outputs. So, our overall algorithm is going to generate new text in a partially randomized way. To make this happen, we define a `sample` function which will take the model outputs, turn them into probabilities, and then sample from the probabilities to produce a single character (well, technically, an integer corresponding to a single character). 

An important parameter here is the so-called *temperature* (this terminology comes from statistical physics. When the temperature is high, the model will more frequently choose low-probability characters. This is sometimes interpreted as "creativity," and leads to more unpredictable outputs. When the temperature is low, on the other hand, the model will "play it safe" and tend to stick to known patterns. In the extreme limiting case as the temperature approaches 0, the model will ultimately get stuck in "loops" in which it repeats common phrases over and over again. 

I'm not going to dwell on the math here, but if you're familiar with phrases like "softmax" or "Boltzmann distribution," this code implements and samples from such a distribution using the model predictions.

In [18]:
def sample(preds, temp):
    
    # format the model predictions
    preds = np.asarray(preds).astype("float64")
    
    # construct normalized Boltzman with temp
    probs = np.exp(preds/temp)
    probs = probs / probs.sum()
    
    # sample from Boltzman
    samp = np.random.multinomial(1, probs, 1)
    return np.argmax(samp)

Note that this function takes in some model predictions and returns a single integer, which we can interpret as a character. 

Now that we know how to sample from the model predictions to create a new character, let's now define a convenient function that will allow us to create entire strings of specified length using this process. There's some index management here that can be a little tricky. 

In [19]:
def generate_string(seed_index, temp, gen_length, model): 
    
    # sequence of integer indices for generated text
    gen_seq = np.zeros((max_len + gen_length, len(chars)))
    
    # first part of the generated indices actually corresponds to the real text
    seed = X[seed_index]
    gen_seq[0:max_len] = seed
    
    # character version
    gen_text = lines[seed_index]
    
    # main loop. 
    # at each stage we are going to get a single 
    # character from the model prediction (with the sample function)
    # and then feed that character BACK into the model as "data"
    # for the next prediction
    for i in range(0, gen_length):
        
        # this corresponds to the part of the generated
        # text that the model can "see"
        window = gen_seq[i: i + max_len]
        
        # get the prediction and sample a single index
        preds = model.predict(np.array([window]))[0]
        next_index = sample(preds, temp)
        
        # add sampled index to the current output
        gen_seq[max_len + i, next_index] = True
        
        # create the string version
        next_char = chars[next_index]
        gen_text += next_char
    
    # only return the string version because that's what we care about
    return(gen_text)

Let's try it out! We'll create strings of length 500, separating the real seed text from the generated text. We'll also vary the temperature parameter of the `sample()` function, which controls how random the model's predictions can be. 

In [20]:
gen_length = 200
seed_index = 10000

for temp in [0.01, 0.02, 0.03, 0.04, 0.05]:

    gen = generate_string(seed_index, temp, gen_length, model)

    print(4*"-")
    print("TEMPERATURE: " + str(temp))
    print(gen[:-gen_length], end="")
    print(" => ", end = "")
    print(gen[-gen_length:], "")

----
TEMPERATURE: 0.01
tioning. 
KIRA: No p => roblem still and with the system no one kink that I know that. 
ODO: I think you're get your own has now
the replions are anything throks we can to had a moment. 
ODO: I don't take you won, I was just age. 
BASHIR: Now deathsmest we could have become betterany. 
KOVAT: It's Curzoa, I'll be ride this there's all your something our security begnapt sutptends all and I can't seems a Fedenal ship. 
DAX: I've must kind out an actoptation enter) 
KIRA: I'm not sure in trill through the starting duct.  
----
TEMPERATURE: 0.02
tioning. 
KIRA: No p => rourcted to know why it was the station? 
KIRA: I'm not service, Commander, but I can't dectine that that you're going to risk's ready. 
ODO: What don't say her. 
GARAK: I'm sure you don't have a ship. 
DAX: I'm not sure our sensors areved. 
KIRA: They're not a lot for business to say. 
DAX: Light, I was justignten to reques the evencent way to be a Karama. Be'turatiss.) 
k'GYEUS KiAn and Hore]

SISKO

In [41]:
gen_length = 1000
seed_index = 10002

gen = generate_string(seed_index, 0.02, gen_length, model)

In [42]:
import re
cast = set(re.findall(r"[A-Z]+(?=:)",gen))
print("CAST OF CHARACTERS: ", end = "")
print(cast)
print("-"*80)
print(gen)

CAST OF CHARACTERS: {'DAX', 'SISKO', 'ORAT', 'BAREIL', 'KIRA'}
--------------------------------------------------------------------------------
KIRA: No problem. 
DAX: No. 
KIRA: Are you sure? 
(Enter adobs, Ber give an office and go planty starting the would werk to get it and we can be about to the station. 
KIRA: Thanks. I have a nood for the Necold. 
SISKO: I'm not surprised any here. 
(The only wined of realing. 
DAX: It much to Odo,
we had a man long. 
KIRA: Oh, the one whose hime. 
KIRA: I'm sorry, Che Sector warry hele with you. 
KIRA: Are you done that you're not then bus hasteres. 
KIRA: I'm not sure in trill through the starting condicy for the station? 
DAX: Your way. I'll be thrown the station to the station and see if we can tell my half to be a ) 
ORAT: The ollywate, Commander had nothing to talk to the station? 
BAREIL: He's edjonents are yoursely? 
DAX: No. 
KIRA: It's been to compution. 
KIRA: Kira, and no will took about your door around the security to Lemar ifnact 

Let's make a few observations. 

1. First of all, it can take a surprisingly long time to make predictions using our model. This is because we have to call the `predict()` method *for each character*, in order to ensure that the model appropriately takes into account its recent predictions. This can take a pretty long time! 
2. Second, determining a good value for the temperature can take some experimentation. Note that low temperatures don't necessarily correspond to "more realistic" text -- they just correspond to highlighting common patterns in the text, possibly in excess. Higher temperatures also don't necessarily correspond to a "creative" algorithm in any normal sense of the word -- set the temperature too high, and you'll just get gibberish. 

## Specialization

In this case, we were able to create a model for generating Star Trek scripts using an instance of Google Colab in roughly 30 minutes. This model is highly limited. Although it clearly has learned some relevant features of Star Trek scripts, there's no way that you'd mistake the output of the model for an actual script by a screenwriter. Considering how hard this was, imagine how much effort and computational resources are required to create more general language models! Indeed, as highlighted in a [recent and controversial paper](https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf), training large language models in this day and age can require energy expenditure comparable to a trans-Atlantic flight! 