{ "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3-final" }, "orig_nbformat": 2, "kernelspec": { "name": "Python 3.8.3 64-bit", "display_name": "Python 3.8.3 64-bit", "metadata": { "interpreter": { "hash": "0de36b31320ba4c88b4f85a74724f3d16c36a44df48581253710b1065e752d9e" } } } }, "nbformat": 4, "nbformat_minor": 2, "cells": [ { "source": [ "# Training a Nerual Network to generate Trump Nicknames\n", "\n", "The data was scraped directly from [Wikipedia from this link here](https://en.wikipedia.org/wiki/List_of_nicknames_used_by_Donald_Trump). The data was then cleaned and analyzed by me, [click this link here to see that analysis](https://github.com/Alexander-Kahanek/trump_nickname_gen/blob/main/intial_analysis.ipynb). \n", "\n", "This notebook is going to be a word-based approach to nickname generation. We will be going through the preprocessing required to create affix embeddings, tokenize, encode, decode, and finally train the Keras LSTM neural network. \n", "\n", "I also walk through this process [utilizing a character-based approach](https://github.com/Alexander-Kahanek/trump_nickname_gen/blob/main/2_character.nn.gen.ipynb); however, the results were hilariously bad.\n", "\n", "# Grabbing the data" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " fake name real name len fake len real \\\n", "0 dumbo randolph tex alles 1 3 \n", "1 wheres hunter hunter biden 2 2 \n", "2 1% joe joe biden 2 2 \n", "3 basement joe joe biden 3 2 \n", "4 beijing joe joe biden 3 2 \n", "\n", " category \\\n", "0 domestic political figures \n", "1 domestic political figures \n", "2 domestic political figures \n", "3 domestic political figures \n", "4 domestic political figures \n", "\n", " notes count \n", "0 director of the united states secret service 1 \n", "1 american lawyer and lobbyist who is the second... 1 \n", "2 47th vice president of the united states; form... 1 \n", "3 47th vice president of the united states; form... 1 \n", "4 47th vice president of the united states; form... 1 " ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
fake namereal namelen fakelen realcategorynotescount
0dumborandolph tex alles13domestic political figuresdirector of the united states secret service1
1wheres hunterhunter biden22domestic political figuresamerican lawyer and lobbyist who is the second...1
21% joejoe biden22domestic political figures47th vice president of the united states; form...1
3basement joejoe biden32domestic political figures47th vice president of the united states; form...1
4beijing joejoe biden32domestic political figures47th vice president of the united states; form...1
\n
" }, "metadata": {}, "execution_count": 1 } ], "source": [ "# data manip\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# model building\n", "from keras.models import Sequential\n", "from keras.layers import LSTM, Dense\n", "from keras.callbacks import LambdaCallback\n", "\n", "# load model\n", "from tensorflow import keras\n", "\n", "raw = pd.read_csv('data/cleaned.nicknames.csv')\n", "\n", "raw.head()" ] }, { "source": [ "Here is the data we are working with. We have the nicknames (fake name) and the corrosponding real name of the individual the nickname was given to by trump. We also have a few other columns, however those will not matter for this task.\n", "\n", "# Creating affix embeddings and tokenizing\n", "\n", "Here we need to seperate each word and add a tag. We will be adding a few created tags:\n", "\n", "+ real names:\n", " - these will be given a tag for each word in the name, for example,\n", " + `\"joe biden\"` will become `\"joe biden \"`\n", "\n", "+ nicknames:\n", " - these follow the rule that if a real name is in the nickname, it will be replaced with the corrosponding real name tag.\n", " + ie, `\"basement joe hidin\"` will become `\"basement \"`\n", " - a `` tag will be added to every other word before the real name substitution, and a `` tag to the words after. A `` tag will be added to the substitution itself.\n", " - if there are no substitutions, then a `` tag will be added to each word.\n", " - the final nickname would be `\"basement \"`\n", "\n", "-----------------------\n", "\n", "These modifications will be made because we will use the generated `` tags to use from the user input to directly replace into the generated output. For example, if a user inputs `\"Joe Biden\"`, and the generated name follows `\" \"` the generation algorithm can substitute the users `` with `joe` in this example, so that all we need to predict is the `` tag. Although the model will need to predict which name index to pull from the name tag. One fun issue for us is the outputted suffix tag is not truly always a true suffix. (i.e, a set of embeddings can be predicted as `\" \"`, etc.) This will help with training, as we only need to predict the length of the nickname, then the set of tags that follow, then finally predict the tokens to convert the embeddings to a real word. For example, if we predict a length 4 name as the following embeddings `\" \"`, then we can generate the best real word for each category. Meaning our generated nickname will end up with two prefix tokens, a name tag, and a suffix token. Then we replace the name tag with the users corresponding name token. This helps to generalize our model and keep the outputs quite random, which gives us more interesting and varied generated nicknames." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "real name | nickname\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "[('joe biden ',\n", " 'sleepy creepy '),\n", " ('joe biden ', 'slow '),\n", " ('joe biden ', ' hiden '),\n", " ('joe biden ', 'obiden '),\n", " ('michael bloomberg ',\n", " 'little ')]" ] }, "metadata": {}, "execution_count": 5 } ], "source": [ "def tokenize(realname, nickname):\n", " '''tokenizes and adds affix embeddings to the real name and nickname, then returns a tuple of both names fully tokenized and embedded as strings'''\n", "\n", " # get a dictionary for the real name and corrospoinding token, ie input\n", " real2token = {word: f'' for X, word in enumerate(realname.split(' '))}\n", " # convert dictionary to word tokenized groups and join into single string\n", " real_tokenized = ' '.join([f'{word} {real2token[word]}' for word in real2token])\n", "\n", " # change nickname into single string with tokenization and substitution\n", " # grab names to substitute\n", " subs = [sub for sub in realname.split(' ') if sub in nickname]\n", "\n", " if len(subs) == 0:\n", " # then there are no splits, tokens are \n", " nick_tokenized = ' '.join([f'{word} ' for word in nickname.split(' ')])\n", " return (real_tokenized, nick_tokenized)\n", "\n", " substituted = ' '.join([word if word not in subs else f'{real2token[word]}' for word in nickname.split(' ')])\n", "\n", " tokens = ['', '', '']\n", " index = 0\n", " tokenized = []\n", "\n", " for word in substituted.split(' '):\n", " if 'name' in word:\n", " index = 2\n", "\n", " tokenized.append(f'{word} {tokens[index]}')\n", " if index == 2:\n", " index = 1\n", " \n", " nick_tokenized = ' '.join(tokenized)\n", "\n", " return (real_tokenized, nick_tokenized)\n", "\n", "\n", "tokenized_names = [tokenize(realname, nickname) for i, nickname, realname in raw[['fake name', 'real name']].itertuples()]\n", "\n", "print('real name | nickname')\n", "tokenized_names[10:15]" ] }, { "source": [ "Okay, not that everything is tokenized, we can start to vectorize it!\n", "\n", "# Vectorizing the tokenized names\n", "\n", "To vectorize this we need to define our vocabulary, then create a matrix for each name. The matrix will have the maximum token length as rows, and the total vocab words as columns. Our resulting input will be a list of matrices, where each matrix is a defined name. The name will have a 1 in the column of token it is (which is the index for the vocabulary tokens), on the row for the index of the token." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "[[0. 0. 0. ... 0. 0. 0.]\n [1. 0. 0. ... 0. 0. 0.]\n [0. 0. 0. ... 0. 0. 0.]\n ...\n [0. 0. 0. ... 0. 0. 0.]\n [0. 0. 0. ... 0. 0. 0.]\n [0. 0. 0. ... 0. 0. 0.]]\n" ] } ], "source": [ "##### get probabilities of name length #####\n", "name_matrix = raw.groupby(['len real', 'len fake']).agg({'count':'count'}).reset_index()\n", "name_matrix = name_matrix[name_matrix['len real'] == 2]\n", "p_gen_len = name_matrix[['len fake']]\n", "p_gen_len['p'] = name_matrix['count']/name_matrix['count'].sum()\n", "\n", "##### get vocab ######\n", "# flatten paired list to get all names\n", "flatten = [name for pair in tokenized_names for name in pair]\n", "# make sure the created tokens go first in the vocad dictionaries\n", "flatten[:0] = ['', '', '', '', '', '', '', '', '', '']\n", "\n", "###### get dictionaries #########\n", "# get dictionary of nicknames to\n", "nick2real = {nickname:realname for (realname, nickname) in tokenized_names}\n", "# get dictionaries for vocab\n", "uni_tokens = {token:0 for name in flatten for token in name.split(' ')}\n", "# dictionary for token to index\n", "token2i = {token:i for i, token in enumerate(uni_tokens)}\n", "# dictionary for index to token\n", "i2token = {token2i[token]: token for token in token2i}\n", "# dictionary for index to real name token\n", "i2name = {token2i[word]:word for pair in tokenized_names for word in pair[0].split(' ') if '<' not in word}\n", "\n", "######### math ##########\n", "# find total number of nicknames\n", "n = len(tokenized_names)\n", "# find max number of tokens, ie, rows in matrix\n", "max_tokens = max([len(name.split(' ')) for name in flatten])\n", "# find total number of tokens, ie columns in matrix\n", "token_dimensions = len(i2token)\n", "\n", "##### get matricies ########\n", "# set up vectors for output = nicknames\n", "output = np.zeros((n, max_tokens, token_dimensions))\n", "# set up vectors for label = real names\n", "input = np.zeros((n, max_tokens, token_dimensions))\n", "\n", "#### vectorize names #######\n", "for i, nickname in enumerate(nick2real):\n", " # input and output assignment\n", " for row, token in enumerate(nickname.split(' ')):\n", " output[i, row, token2i[token]] = 1\n", " input[i, row, token2i[token]] = 1\n", "\n", "\n", "print(output[1])\n", " \n", "\n" ] }, { "source": [ "Awesome, here we can see a vectorized word! It doesn't look like much because the dimensions are very large, but there is a single 1 at every token index for the row of the token. For example, row 2 in this matrix says the first word is a prefix.\n", "\n", "From here we need to build a few functions for the model!\n", "\n", "# Building functions for the model\n", "\n", "Since this isnt a traditional training, we need to evaluate our model by actually generating nicknames. To do this, I will build a generate function and generate a few names at every nth epoch.\n", "\n", "This generation function acts as our decoder. In essense I will use a few logical rules to ensure the quality of output from the model, as we had too few training data and vocabulary to build good distributions. In essence the idea is to:\n", "\n", "+ predict the length of the nickname to give (n)\n", "+ create a dictionary and tokenize the real name user input\n", "+ predict n affix embedings\n", "+ predict real word tokens from those affix embeddings\n", " - uses all the previous words and embeddings to make its prediction\n", " - hardcoded to not predict two exact name tags consectutively\n", " + i.e, no `\" \"`\n", "+ returns the full, de-embedded nickname" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def generate_name(model, input):\n", " '''generates a nickname based on length of word and input given'''\n", "\n", " word_vec = np.zeros((1, max_tokens, token_dimensions))\n", "\n", " p_len = p_gen_len['p']\n", " gen_len = p_gen_len.loc[np.random.choice(range(len(p_len)), p=p_len),'len fake']\n", " # print(gen_len)\n", " # get a dictionary for the real name and corrospoinding token, ie input\n", " token2input = {f'': word for X, word in enumerate(input.split(' '))}\n", " # convert dictionary to word tokenized groups and join into single string\n", " input_tokenized = ' '.join([f'{token2input[token]} {token}' for token in token2input])\n", " nName = 0 # tracks the number of name affix given\n", "\n", " affixs = []\n", " for index in range(gen_len):\n", " index = (index+1)*2 -1\n", "\n", " # pull probabilities for each affix\n", " p_affix = list(model.predict(word_vec)[0,index])[0:4]\n", " # normalize probabilities\n", " p_affix_norm = p_affix / np.sum(p_affix)\n", " # guess a affix\n", " guess = np.random.choice(range(len(p_affix_norm)), p=p_affix_norm)\n", " affix = i2token[guess]\n", "\n", " # check affix to make sure names are not chosen more than the given user amount\n", " if 'name' in affix:\n", " if nName < len(token2input):\n", " nName += 1\n", " else:\n", " while 'name' in affix:\n", " guess = np.random.choice(range(len(p_affix_norm)), p=p_affix_norm) # choose an affix\n", " affix = i2token[guess]\n", " # assign value to generated name vector\n", " word_vec[0, index, guess] = 1\n", " # print(f'p={p_affix_norm}, g={i2token[guess]}, i={index}')\n", " affixs.append(affix)\n", "\n", " def predict_token(index, affix):\n", " '''generates a token based on the given affix''' \n", " # print(f'g={guess}, i={index}')\n", " if 'name' in affix:\n", " # pull probabilities for all names: index 4 - 4+given amount (max ind 9)\n", " p_name = list(model.predict(word_vec)[0,index])[4:4+len(token2input)]\n", " p_name_norm = p_name / np.sum(p_name)\n", " # generate name token\n", " guess = np.random.choice(range(4, len(p_name_norm)+4), p=p_name_norm)\n", "\n", " # prevent repeat name tokens\n", " while word_vec[0,index-2,guess] == 1.0:\n", " guess = np.random.choice(range(4, len(p_name_norm)+4), p=p_name_norm)\n", " # print(guess, len(p_name_norm)+4, len(token2input))\n", " # print(i2token[guess], input)\n", " token = i2token[guess]\n", " word = token2input[token]\n", " \n", " else:\n", " # affix is prefix, suffix, or nope\n", " # grab porbabilities of all non affix or name tokens: 10 - end\n", " p_tokens = list(model.predict(word_vec)[0,index])[10:]\n", " p_tokens_norm = p_tokens / np.sum(p_tokens)\n", " # generate a guess from probabilities\n", " guess = np.random.choice(range(10, len(p_tokens_norm)+10), p=p_tokens_norm)\n", "\n", " # prevent named vocab from being chosen and repeat tokens\n", " while guess in i2name or word_vec[0,index-2,guess] == 1.0:\n", " # print(i2token[guess])\n", " guess = np.random.choice(range(10, len(p_tokens_norm)+10), p=p_tokens_norm)\n", " word = i2token[guess]\n", " \n", " word_vec[0, index, guess] = 1\n", " return word\n", "\n", " # print(f'{input}: {affixs}')\n", " tokens = [predict_token(i*2, affix) for i, affix in enumerate(affixs)]\n", " return \" \".join(tokens)\n", "\n", "\n", "def generate_name_loop(epoch, _):\n", " '''tells the model when to generate names during training'''\n", " if epoch % (nEpochs//5) == 0 or epoch == nEpochs-1:\n", " \n", " print(f'Nicknames generated on epoch {epoch}')\n", "\n", " for i, name in enumerate(['karen','mitch mcconnel','dwayne the rock johnson']):\n", " print(f'{name} | {generate_name(model, name)}')\n", " \n", " print('-------------')" ] }, { "source": [ "Awesome, now that those functions are done lets move on to building and training the model. I am going to use a single LSTM layer with a softmax adam optimizer.\n", "\n", "# Training the model\n", "\n", "Lets create and train our model!" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 180, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Nicknames generated on epoch 0\n", "karen | karen lockheed cutie flailer\n", "mitch mcconnel | sour pakistani tropics\n", "dwayne the rock johnson | flailer dwayne morning\n", "-------------\n", "Nicknames generated on epoch 200\n", "karen | schitt guy edge michigan\n", "mitch mcconnel | lying original half\n", "dwayne the rock johnson | pakistani favorite\n", "-------------\n", "Nicknames generated on epoch 400\n", "karen | sick puppy man michigan dumbo\n", "mitch mcconnel | wacky woman edge flunkie cbs\n", "dwayne the rock johnson | johnson original canada\n", "-------------\n", "Nicknames generated on epoch 600\n", "karen | 0% karen congresswoman half\n", "mitch mcconnel | 41 from\n", "dwayne the rock johnson | goofball woman guy\n", "-------------\n", "Nicknames generated on epoch 800\n", "karen | dicky karen\n", "mitch mcconnel | high mitch mcconnel\n", "dwayne the rock johnson | sick dwayne the\n", "-------------\n", "Nicknames generated on epoch 999\n", "karen | half karen\n", "mitch mcconnel | wheres mitch mcconnel\n", "dwayne the rock johnson | michigan dwayne the\n", "-------------\n", "INFO:tensorflow:Assets written to: models/word/model.bs60.e1000.nL300.output\\assets\n" ] } ], "source": [ "# model attributes\n", "nBatch = 60 # number of sample in training epoch\n", "nEpochs = 1000 # number of epochs to train\n", "nUnits = 300 # number of loops in lstm\n", "\n", "model = Sequential()\n", "# add LSTM layer to model\n", "model.add(LSTM(nUnits, input_shape=(max_tokens, token_dimensions), return_sequences=True))\n", "# add model attributes\n", "model.add(Dense(token_dimensions, activation='softmax'))\n", "model.compile(loss='categorical_crossentropy', optimizer='adam')\n", "# generate names on epochs\n", "name_generator = LambdaCallback(on_epoch_end = generate_name_loop)\n", "\n", "model.fit(output, input, batch_size=nBatch, epochs=nEpochs, callbacks=[name_generator], verbose=0)\n", "\n", "model.save(f\"models/word/model.bs{nBatch}.e{nEpochs}.nL{nUnits}.output\")" ] }, { "source": [ "Now that the model is trained we can get a sense of how the name generation was at each epoch. However, we want to get a wider glance at how well the model does with other inputs and more name generation.\n", "\n", "# Evaluating the model\n", "\n", "Since there is no true measure for these nicknames, we will just need to manually look them over and decide which models to keep." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "model is loaded\n", "-----------------\n", "karen | texas karen moonbeam\n", "karen | gov karen wannabe\n", "karen | american karen bozo canada wannabe\n", "karen | karen apple canada\n", "karen | sneaky karen crazy\n", "-----------------------\n", "Link | china Link moonbeam\n", "Link | shady creepy\n", "Link | cheatin Link frankenstein sham wannabe\n", "Link | obiden Link\n", "Link | husband Link tropics\n", "-----------------------\n", "Donald trump | sick Donald trump lockheed\n", "Donald trump | and Donald trump\n", "Donald trump | leakin Donald\n", "Donald trump | pakistani Donald trump bozo\n", "Donald trump | wheres Donald\n", "-----------------------\n", "Joe Biden | guy Joe Biden flakey\n", "Joe Biden | sham Joe\n", "Joe Biden | no Biden Joe moonbeam\n", "Joe Biden | rocket Biden Joe\n", "Joe Biden | and professor flakey 44\n", "-----------------------\n", "mitch mcconnel | congresswoman mitch mcconnel\n", "mitch mcconnel | chin mitch mcconnel\n", "mitch mcconnel | low energy crazy\n", "mitch mcconnel | wheres mitch\n", "mitch mcconnel | slippery puppy tropics professor puppy\n", "-----------------------\n", "alexandria ocasio cortez | little cortez\n", "alexandria ocasio cortez | sick ocasio alexandria\n", "alexandria ocasio cortez | lying ocasio\n", "alexandria ocasio cortez | cryin alexandria\n", "alexandria ocasio cortez | psycho alexandria ocasio\n", "-----------------------\n", "dwayne the rock johnson | slippery dwayne the dwayne\n", "dwayne the rock johnson | little dwayne moonbeam the sham\n", "dwayne the rock johnson | a dwayne\n", "dwayne the rock johnson | fat dwayne the\n", "dwayne the rock johnson | deranged dwayne the rock\n", "-----------------------\n" ] } ], "source": [ "\n", "model = keras.models.load_model(f\"models/word/model.bs{nBatch}.e{nEpochs}.nL{nUnits}.output\")\n", "names = 'karen,Link,Donald trump,Joe Biden,mitch mcconnel,alexandria ocasio cortez,dwayne the rock johnson'.split(',')\n", "\n", "\n", "# model = keras.models.load_model(f\"models/word/keep/model.bs30.e500.nL50.output\")\n", "\n", "print('model is loaded')\n", "print('-----------------')\n", "print(f'model.bs{nBatch}.e{nEpochs}.nL{nUnits}.output')\n", "for name in names:\n", " for i in range(5):\n", " print(f'{name} | {generate_name(model, name)}')\n", " print('-----------------------')" ] }, { "source": [ "The results are pretty good! Nothing spectacular here, but that is part of the charm. This is a pretty unconventional method to do text generation, especially with an insanely low volume of data. Without a good quantifiable method, I would give this model a solid 7/10 on nickname generation, but a 9/10 on creativity.\n", "\n", "Lets save the model to later import into the Elixir Web Application.\n", "\n", "# Save the model\n", "\n", "Run the cell below if we want to save the model into our keep folder." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 182, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "INFO:tensorflow:Assets written to: models/word/keep/model.bs60.e1000.nL300.output\\assets\n" ] } ], "source": [ "# us this if want to save model in good\n", "model.save(f\"models/word/keep/model.bs{nBatch}.e{nEpochs}.nL{nUnits}.output\")" ] } ] }