# textgenrnn 1.1 Demo

by [Max Woolf](http://minimaxir.com)

*Max's open-source projects are supported by his [Patreon](https://www.patreon.com/minimaxir). If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.*

## Intro

textgenrnn is a Python module on top of Keras/TensorFlow which can easily generate text using a pretrained recurrent neural network:

In [1]:
from textgenrnn import textgenrnn

textgen = textgenrnn()

Using TensorFlow backend.


## Generate Text

The `generate` function generates `n` text documents:

In [2]:
textgen.generate(5)

Some seeing settings for the most of a story of the Nexus 6 comments to be a book in a trip in the Ballshore Price of the Streets of Reddit

What is this crappy? I want to see this in the game. Any ideas to recover the bathroom and this is looking at the day of this game of the most story

[PC] [H] 5 GTX 1080 [W] PayPal

New Culture is in the team for the screenshots of the top of a holiday control and realistically a speaker to the trip in 2017 and there was a super of the world?

TIFU by telling someone please add this with a card



In addition, you can set the `temperature` to modify the amount of creativity (default 0.5; I do not recommend setting to more than 1.0), set a `prefix` to force the document to start with certain characters and generate characters accordingly, and set a `return_as_list` flag (default False) to use the generated texts elsewhere in your application (e.g. as an API)

In [3]:
generated_texts = textgen.generate(n=5, prefix="Trump", temperature=0.2, return_as_list=True)
print(generated_texts)

["Trump is the best thing and I don't know what to do anymore.", 'Trump is a lot of products and some of the subreddits is the best state of the price of the same to stop the only one of the starts and started the dead of the police of the state of the same time in the comments to be a lot of the same time in the state of the game to the season to a community in', 'Trump starting a post and a big state and what is the day?', 'Trump is a bit of the same subreddit and a bad story of the states of the state of the state of the party in the same to the state of the state of the same series is a good day.', 'Trump control the state of the season to the story of the state of the state of the state of the states of the state of the new company that has a state of the story of the state of the state of the season to the state of the state of the story of the state of the state of the state of the same ti']


Using `generate_samples()` is a great way to test the model at different temperatures.

In [4]:
textgen.generate_samples()

####################
Temperature: 0.2
####################
The same car can be a good day!

[Specific] Can someone please remove the state of the same community and send me the state of the most poster is a stream to the computer to the state of the parents in the problems in the same state with a stream on the same controller of the state of the same second of the background of the same

The best state of the same series is a community where the control is to show off the state of the state of the state of the state of the same second things and the most poster with a girl in the results.

####################
Temperature: 0.5
####################
Me irl

An Apparent Confirm is a card straight of the health store and the most beautiful story of the desk today. Need advice to start in the state today.

Why does this sub mean the man like the content was out of the personal. (Artiscouns the box)

####################
Temperature: 1.0
####################
(18f) when your dead spider chil

You may also `generate_to_file()` to make the generated texts easier to copy/paste to other sources (e.g. blog/social media):

In [5]:
textgen.generate_to_file('textgenrnn_texts.txt', n=5)

## Train on New Text

As shown above, the results on the pretrained model vary greatly, since it's a lot of data compressed in a small form. The `train_on_texts` function fine-tunes the model on a new dataset.

In [6]:
texts = ['Never gonna give you up, never gonna let you down',
 'Never gonna run around and desert you',
 'Never gonna make you cry, never gonna say goodbye',
 'Never gonna tell a lie and hurt you']

textgen.train_on_texts(texts, num_epochs=2, gen_epochs=2)

Training on 174 character sequences.
Epoch 1/2
Epoch 2/2
####################
Temperature: 0.2
####################
Never down and gonna gonna gonna you gonnally you cry and gonnalland you gonnallen you cry you gonnallen you and gonna gonna gonna you and gonna let you down

Never down and gonna gonna gonna you and desert you and gonna let yount you

Never down and gonna gonna gonna you cry and gonna gonna gonna you to gonna let you up gonna you gonnally you to gonna gonna you gonnally you to gonna gonna gonna gonna you cry and dennat you and gonna let yount you

####################
Temperature: 0.5
####################
Never dond genny gonnally granglet and gonnall you gonnallen you and gonna gonna year you and gonna let you

Never down and gonna gonna you gonnally you

Never down around you up, and gonna gonna revent you and dungeon crying

####################
Temperature: 1.0
####################
Gay you son nugete in today, and gonna givan>e gonnallacu gonna gonna gonna love you (

Although the network was only trained on 4 texts, the original network still transfers the latent knowledge of all modern grammar and can incorporate that knowledge into generated texts, which becomes evident at higher temperatures or when using a prefix containign a character not present in the original dataset.

You can reset a trained model back to the original state by calling `reset()`.

In [7]:
textgen.reset()

Included in the repository is a `hacker-news-2000.txt` file containing a list of the Top 2000 [Hacker News](https://news.ycombinator.com/news) submissions by score. Let's retrain the model using that dataset.

For this example, I only will use a single epoch to demonstrate how easily the model learns with just one pass of the data: I recommend leaving the default of 50 epochs, or set it even higher for complex datasets. On my 2016 15" MacBook Pro (quad-core Skylake CPU), the dataset trains at about 1.5 minutes per epoch.

In [8]:
textgen.train_from_file('../datasets/hacker_news_2000.txt', num_epochs=1)

2,000 texts collected.
Training on 83,491 character sequences.
Epoch 1/1
####################
Temperature: 0.2
####################
How to start to start to make a defallen to the top defenderes

The Firefox and Startup Continument

A Free Reserver

####################
Temperature: 0.5
####################
Tesla is a boa to die

GitHub is now opening to the protest and the internet piece you one on a show in the the sight on a crime to transformanet

How to Favon Problem and Operman News

####################
Temperature: 1.0
####################
Open Mission

"Gomain Freath Boots

How Perrating Interativation Zh.



Now, we can create very distinctly-HN titles, even with the very little amount of training, thanks to the pre-trained nature of the textgenrnn:

In [9]:
textgen.generate(5, prefix="Apple")

Apple Dedicter Has to Portal Asked A George To Be Winter About An AP Ask Defenderation

Apple Card Continumes Code

Apple and Hell Care Contersonal Program Consistent of The Port

Apple Honest used a tech probered your browser

Apple says he supports do your favount



Other runtime parameters for `train_on_text` and `train_from_file` are:

* `num_epochs`: Number of epochs to train for (default: 50)
* `gen_epochs`: Number of epochs to run between generating sample outputs; good for measuring model progress (default: 1)
* `batch_size`: Batch size for training; may want to increase if running on a GPU for faster training (default: 128)
* `train_size`: Random proportion of sequence samples to keep: good for controlling overfitting. The rest will be used to train as the validation set. (default: 1.0/all). To disable training on the validation set (for speed), set `validation=False`.
* `dropout`: Random number of tokens to ignore each epoch. Good for controlling overfitting/making more resilient against typos, but setting too high will cause network to converge prematurely. (default: 0.0)
* `is_csv`: Use with `train_from_file` if the source file is a one-column CSV (e.g. an export from BigQuery or Google Sheets) for proper quote/newline escaping.

## Save and Load the Model

The model saves the weights automatically after each epoch, or you can call `save()` and give a HDF5 filename. Those weights can then be loaded into a new textgenrnn model by specifying a path to the weights on construction. (Or use `load()` for an existing textgenrnn object).

In [10]:
textgen_2 = textgenrnn('textgenrnn_weights.hdf5')
textgen_2.generate_samples()

####################
Temperature: 0.2
####################
Google is a broke and down

A Starter in Amazering in a Google Startup

The Web Defines Supports For Conternation

####################
Temperature: 0.5
####################
Good Dropbox Down to die

Playing Source Aid of Adding Program For Operate

Will Claim To So Formate For Founder

####################
Temperature: 1.0
####################
We was direction factiongend fine and GoDake

John no over relied from new Stude

Leaching A “Yeuari Drawn”



In [11]:
textgen.model.get_layer('rnn_1').get_weights()[0] == textgen_2.model.get_layer('rnn_1').get_weights()[0]

array([[ True, True, True, ..., True, True, True],
 [ True, True, True, ..., True, True, True],
 [ True, True, True, ..., True, True, True],
 ...,
 [ True, True, True, ..., True, True, True],
 [ True, True, True, ..., True, True, True],
 [ True, True, True, ..., True, True, True]])

Indeed, the weights between the original model and the new model are equivalent.

You can use this functionality to load models from others which have been trained on larger datasets with many more epochs (and the model weights are small enough to fit in an email!).

In [12]:
textgen = textgenrnn('../weights/hacker_news.hdf5')
textgen.generate_samples(temperatures=[0.2, 0.5, 1.0, 1.2, 1.5])

####################
Temperature: 0.2
####################
A startup’s Firebase bill suddenly increased from $25 to $1750 per month

A Sister’s Eulogy for Steve Jobs

The Website That Got Me Expelled

####################
Temperature: 0.5
####################
How to Sleep

Ask HN: What are some examples of successful single-person businesses?

Solar Roof

####################
Temperature: 1.0
####################
Germanā was asked a disappeared to dismay explosions to just work

“Gat Vueal Cities, Weakphad

The New Internet Talk

####################
Temperature: 1.2
####################
In Iraq, for jhilious backg program in a comparn

Solar Rook Point: The NSA-indiepatorogamerace

Ask.jecket shilk

####################
Temperature: 1.5
####################
Nodelover Misses wringless for ununticed processing on Ach destroyer is a watched sysemer 

Police Leeh was ruins to work RTA-a litrame related

A $10K tiny homepage in RSH galace



## Training a New Model

You can train a new model using any modern RNN architecture you want by calling `train_new_model` if supplying texts, or adding a `new_model=True` parameter if training from a file. If you do, the model will save a `config` file and a `vocab` file in addition to the weights, and those must be also loaded into a `textgenrnn` instances.

The config parameters available are:

* `word_level`: Whether to train the model at the word level (default: False)
* `rnn_layers`: Number of recurrent LSTM layers in the model (default: 2)
* `rnn_size`: Number of cells in each LSTM layer (default: 128)
* `rnn_bidirectional`: Whether to use Bidirectional LSTMs, which account for sequences both forwards and backwards. Recommended if the input text follows a specific schema. (default: False)
* `max_length`: Maximum number of previous characters/words to use before predicting the next token. This value should be reduced for word-level models (default: 40)
* `max_words`: Maximum number of words (by frequency) to consider for training (default: 10000)
* `dim_embeddings`: Dimensionality of the character/word embeddings (default: 100)

You can also specify a `name` when creating a textgenrnn instance which will help name the output weights/config/vocab appropriately.

In [13]:
textgen = textgenrnn(name="new_model")

In [14]:
textgen.reset()
textgen.train_from_file('../datasets/hacker_news_2000.txt',
 new_model=True,
 rnn_bidirectional=True,
 rnn_size=64,
 dim_embeddings=300,
 num_epochs=1)

print(textgen.model.summary())

2,000 texts collected.
Training new model w/ 2-layer, 64-cell Bidirectional LSTMs
Training on 83,491 character sequences.
Epoch 1/1
####################
Temperature: 0.2
####################
A Stork HN: A Scroder the Introder and the for and for mage a the from dead sourcing from for croder a contion coner the from coner source

Chomer the Introder a decricting the and dercroder

Ander Crode Stor Start Staker Stor Storter

####################
Temperature: 0.5
####################
Google Deackic lame in I eried diter from from for dust for loce a coner the compler

A in Ancricting from ine gepleater

Engingractome to Beam Start the accent conertion from stict and conting honer and muniter sagers funce

####################
Temperature: 1.0
####################
I webshs FrrMack A PSRoerraubtal the Thinsed

Ja ComestoulR, mucting shespuled compernuctions iplionta sitok

Enot wart Daters

__________________________________________________________________________________________________
La

In [15]:
textgen_2 = textgenrnn(weights_path='new_model_weights.hdf5',
 vocab_path='new_model_vocab.json',
 config_path='new_model_config.json')

textgen_2.generate_samples()

####################
Temperature: 0.2
####################
An Corentict the wath and dester the and from deaction

A Inderer from conter died deand and deresiction

A Croder Introder a the wath a the wash and and and deacting gand a deard daser and deread and deand degring the source

####################
Temperature: 0.5
####################
Andorac Appler Fand Stolk Intricer Stor

Thot HN: Wak Cade to Stork Intreed the and a for darnes of frond the what a counting enerningers

Lict The Scrabe Stor New

####################
Temperature: 1.0
####################
SOvarmezSQLP Cutiders HTest Epane Bat Speare now

I CChlod's &tites Show HN: S halt Gefl, CPo”

V. from.0 act, a giment midiodup Snters you 19 peflity fram gocapes cnaninald for hnowser daing xpoict a frel legiens equm deer’s, ronet causter



## Train on Single Large Text

Although textgenrnn is intended to be trained on text documents, you can train it on a large text block using `train_from_largetext_file` (which loads the entire file and processes it as if it were a single document) and it should work fine. This is akin to more traditional char-rnn tutorials.

Training a new model is recommended (and is the default). When calling `generate`, you may want to increase the value of `max_gen_length`.

In [16]:
from keras.utils.data_utils import get_file

fulltext_path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')

textgen.reset()
textgen.train_from_largetext_file(fulltext_path, new_model=True, num_epochs=1)

Training new model w/ 2-layer, 128-cell LSTMs
Training on 600,852 character sequences.
Epoch 1/1
####################
Temperature: 0.2
####################
self-desire of the stronger of the most of the strong successions of the such and so such a self--and which have any the strength of the strong and in the sense of the longer and which is a self-sense of the successions of the specilar of the specilar of the strong and sense of the strength, and sup

f-superious nature of the stronger and self-superious of the superious and so success of the specilar of the strength and stronge of the sacribility of the sense of the strength and in the strength and in the strength, and sense of the stronger success of the spections of the species and self-desire

ny success of the sacrifice of the strength and stronger of the stronger of the stronger to the sacrifice of the specilar of the sacrifice of the sense of the specilar of the specilar of the superious of the sense of the specilar of the str

Training on text at the word level works great, although it's strongly recommended to reduce the `max_length` and `max_gen_length` during training!

In [17]:
textgen.reset()
textgen.train_from_largetext_file(fulltext_path, new_model=True, num_epochs=1,
 word_level=True,
 max_length=10,
 max_gen_length=50,
 max_words=5000)

Training new model w/ 2-layer, 128-cell LSTMs
Training on 131,763 word sequences.
Epoch 1/1
####################
Temperature: 0.2
####################
- the the of the truth . the

 has not a thing of the fact that which is
to be a thing of the " truth " " man " " - - " the " " man " " is a
" modern ideas "

- the the of the " soul " is a

 of the " modern ideas , " " and " "
" modern ideas , " " and " " modern ideas , " " and "
" modern ideas , " " and " "

- the of the soul . = - - the
most sense of the world , which is the most
of the " modern ideas , " " " the " " modern ideas , "
the " spirit " " and " " modern ideas ,

####################
Temperature: 0.5
####################
a man . = - - - - the first of the
sense of the value of the world , which must be
very much , the , in all , as the
of the " modern ideas , " " " - - " this ,

- the and that the old knowledge has the

 of his head and his own will to be
with the same time , as the problem is at
the same time that the most

# LICENSE

MIT License

Copyright (c) 2018 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.