{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Introduction

\n", "As part of a project course in my second semester, we were tasked with building a system of our chosing that encorporated or showcased any of the Computational Intelligence techniques we learned about in class. For our project, we decided to investigate the application of Recurrent Nueral Networks to the task of building a Subreddit recommender system for Reddit users. In this post, I outline some of the implementation details of the final system. A minimal webapp for the final model can be interacted with [here,](http://ponderinghydrogen.pythonanywhere.com/) The final research paper for the project can be found [here](http://cole-maclean.github.io/blog/files/subreddit-recommender.pdf) and my collaboraters on the project are Barbara Garza and Suren Oganesian. The github repo for the project can be found [here](https://github.com/cole-maclean/MAI-CI) with this jupyter notebook being [here.](https://github.com/cole-maclean/MAI-CI/blob/master/notebooks/blog%20post.ipynb)\n", "\n", "![title](spez.PNG)\n", "\n", "

Model Hypothesis

\n", "\n", "The goal of the project is to utilize the sequence prediction power of RNN's to predict possibly interesting subreddits to a user based on their comment history. The hypothesis of the recommender model is, given an ordered sequence of user subreddit interactions, patterns will emerge that favour the discovery of paticular new subreddits given that historical user interaction sequence. Intuitively speaking, as users interact with the Reddit ecosystem, they discover new subreddits of interest, but these new discoveries are influenced by the communities they have previously been interacting with. We can then train a model to recognize these emergent subreddit discoveries based on users historical subreddit discovery patterns. When the model is presented with a new sequence of user interaction, it \"remembers\" other users that historically had similiar interaction habits and recommends their subreddits that the current user has yet to discover. \n", " \n", "This sequential view of user interaction/subreddit discovery is similiar in structure to other problems being solved with the use of Recurrent Neural Networks, such as [Character Level Language Modelling](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and [Automatic Authorship Detection](http://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html). Due to the successes of these similiarly structured problems, we have decided to explore RNN models for the subbreddit Recommendator System.\n", "\n", "

The Data

\n", "The secret sauce in any machine learning system, we need data. Reddit provides a convenient API for scrapping its public facing data, and the python package [PRAW](https://praw.readthedocs.io/en/latest/) is a popular and well documented wrapper that we used in this project. With the aim of developing sequences of user subreddit interactions, all we need for our raw data is a list of 3-tuples in the form [username,subreddit,utc timestamp]. The following script provides a helper function to collect and store random user comment data from Reddit's streaming 'all' comments. Note that PRAW authentication config data needs to be stored in a file named 'secret.ini' with: \n", "[reddit] \n", "api_key: key \n", "client_id: id \n", "client_api_key: client key \n", "redirect_url: redir url \n", "user_agent: subreddit-recommender by /u/upcmaici v 0.0.1 " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import praw\n", "import configparser\n", "import random\n", "import pandas as pd\n", "import numpy as np\n", "import sys\n", "\n", "#Import configuration parameters, user agent for PRAW Reddit object\n", "config = configparser.ConfigParser()\n", "config.read('secrets.ini')\n", "\n", "#load user agent string\n", "reddit_user_agent = config.get('reddit', 'user_agent')\n", "client_id = config.get('reddit', 'client_id')\n", "client_secret = config.get('reddit', 'client_api_key')\n", "\n", "#main data scrapping script\n", "def scrape_data(n_scrape_loops = 10):\n", " reddit_data = []\n", " #initialize the praw Reddit object\n", " r = praw.Reddit(user_agent=reddit_user_agent,client_id = client_id,client_secret=client_secret) \n", " for scrape_loop in range(n_scrape_loops):\n", " try:\n", " all_comments = r.get_comments('all')\n", " print (\"Scrape Loop \" + str(scrape_loop))\n", " for cmt in all_comments:\n", " user = cmt.author \n", " if user:\n", " for user_comment in user.get_comments(limit=None):\n", " reddit_data.append([user.name,user_comment.subreddit.display_name,\n", " user_comment.created_utc])\n", " except Exception as e:\n", " print(e)\n", " return reddit_data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Version 3.5.0 of praw is outdated. Version 4.3.0 was released Thursday January 19, 2017.\n", "Scrape Loop 0\n", "Scrape Loop 1\n", "Scrape Loop 2\n", "Scrape Loop 3\n", "Scrape Loop 4\n", "Scrape Loop 5\n", "Scrape Loop 6\n", "Scrape Loop 7\n", "Scrape Loop 8\n", "Scrape Loop 9\n" ] } ], "source": [ "raw_data = scrape_data(10)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collected 158914 comments\n" ] }, { "data": { "text/plain": [ "[['Illuminate1738', 'MapPorn', 1486680909.0],\n", " ['Illuminate1738', 'MapPorn', 1486471452.0],\n", " ['Illuminate1738', 'nova', 1486228887.0],\n", " ['Illuminate1738', 'nova', 1485554669.0],\n", " ['Illuminate1738', 'nova', 1485549461.0],\n", " ['Illuminate1738', 'MapPorn', 1485297397.0],\n", " ['Illuminate1738', 'ShitRedditSays', 1485261592.0],\n", " ['Illuminate1738', 'ShittyMapPorn', 1483836164.0],\n", " ['Illuminate1738', 'MapPorn', 1483798990.0],\n", " ['Illuminate1738', 'MapPorn', 1483503268.0]]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"Collected \" + str(len(raw_data)) + \" comments\")\n", "raw_data[0:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Data Munging

\n", "We need to parse the raw data into a structure consumpable by a supervised learning algorithm like RNN's. First we build a model vocabulary and ditribution of subreddit popularity from the collect raw data. We use this to build the training dataset, the subreddit interaction sequence for each user, ordered and then split into chunks representing different periods of Reddit interaction and discovery. From each chunk, we can randomly remove a single subreddit from the interaction as the \"discovered\" subreddit and use it as our training label for the interaction sequences. This formulation brings with it a hyperparameter that will require tuning, namely the sequence size of each chunk of user interaction periods. The proposed model utilizes the distribution of subreddits existing in the dataset to weight the random selection of a subreddit as the sequence label, which gives a higher probability of selection to rarer subreddits. This will smoothen the distribution of training labels across the models vocabulary of subreddits in the dataset. Also, each users interaction sequence has been compressed to only represent the sequence of non-repeating subreddits, to eliminate the repeatative structure of users constantly commenting in a single subreddit, while providing information of the users habits in the reddit ecosystem more generally, allowing the model to distinguish broader patterns from the compressed sequences." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def chunks(l, n):\n", " n = max(1, n)\n", " return (l[i:i+n] for i in range(0, len(l), n))\n", "\n", "def normalize(lst):\n", " s = sum(lst)\n", " normed = [itm/s for itm in lst]\n", " normed[-1] = (normed[-1] + (1-sum(normed)))#pad last value with what ever difference neeeded to make sum to exactly 1\n", " return normed" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Vocab size = 3546\n" ] } ], "source": [ "\"\"\"This routine develops the models vocabulary and vocab_probs is also built, representing the inverse probability \n", "of encounting a paticular subreddit in the given dataset, which is then used to bias the selection of rarer\n", "subreddits as labels to \n", "smooth the distribution of training labels across all subreddits in the vocabulary\"\"\"\n", "\n", "df = pd.DataFrame(raw_data,columns=['user','subreddit','utc_stamp'])\n", "train_data = None#free up train_data memory\n", "vocab_counts = df[\"subreddit\"].value_counts()\n", "tmp_vocab = list(vocab_counts.keys())\n", "total_counts = sum(vocab_counts.values)\n", "inv_prob = [total_counts/vocab_counts[sub] for sub in tmp_vocab]\n", "vocab = [\"Unseen-Sub\"] + tmp_vocab #build place holder, Unseen-Sub, for all subs not in vocab\n", "tmp_vocab_probs = normalize(inv_prob)\n", "#force probs sum to 1 by adding differenc to \"Unseen-sub\" probability\n", "vocab_probs = [1-sum(tmp_vocab_probs)] + tmp_vocab_probs\n", "print(\"Vocab size = \" + str(len(vocab)))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sequence_chunk_size = 15\n", "def remove_repeating_subs(raw_data):\n", " cache_data = {}\n", " prev_usr = None\n", " past_sub = None\n", " for comment_data in raw_data:\n", " current_usr = comment_data[0]\n", " if current_usr != prev_usr:#New user found in sorted comment data, begin sequence extraction for new user\n", " if prev_usr != None and prev_usr not in cache_data.keys():#dump sequences to cache for previous user if not in cache\n", " cache_data[prev_usr] = usr_sub_seq\n", " usr_sub_seq = [comment_data[1]] #initialize user sub sequence list with first sub for current user\n", " past_sub = comment_data[1]\n", " else:#if still iterating through the same user, add new sub to sequence if not a repeat\n", " if comment_data[1] != past_sub:#Check that next sub comment is not a repeat of the last interacted with sub,\n", " #filtering out repeated interactions\n", " usr_sub_seq.append(comment_data[1])\n", " past_sub = comment_data[1]\n", " prev_usr = current_usr #update previous user to being the current one before looping to next comment\n", " return cache_data\n", "\n", "def build_training_sequences(usr_data):\n", " train_seqs = []\n", " #split user sub sequences into provided chunks of size sequence_chunk_size\n", " for usr,usr_sub_seq in usr_data.items():\n", " comment_chunks = chunks(usr_sub_seq,sequence_chunk_size)\n", " for chnk in comment_chunks:\n", " #for each chunk, filter out potential labels to select as training label, filter by the top subs filter list\n", " filtered_subs = [vocab.index(sub) for sub in chnk]\n", " if filtered_subs:\n", " #randomly select the label from filtered subs, using the vocab probability distribution to smooth out\n", " #representation of subreddit labels\n", " filter_probs = normalize([vocab_probs[sub_indx] for sub_indx in filtered_subs])\n", " label = np.random.choice(filtered_subs,1,p=filter_probs)[0]\n", " #build sequence by ensuring users sub exists in models vocabulary and filtering out the selected\n", " #label for this subreddit sequence\n", " chnk_seq = [vocab.index(sub) for sub in chnk if sub in vocab and vocab.index(sub) != label] \n", " train_seqs.append([chnk_seq,label,len(chnk_seq)]) \n", " return train_seqs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We transform the munged-data into a pandas dataframe for easier manipulation. Note that the subreddits have been integer encoded, indexed by their order in the vocabulary." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seq_lengthsub_labelsub_seqs
013432[46, 157, 46, 483, 157, 46, 157, 856, 157, 856...
1946[157, 432, 157, 432, 157, 432, 157, 157, 157]
2746[432, 432, 432, 432, 856, 856, 157]
313432[46, 157, 46, 157, 46, 157, 856, 157, 46, 157,...
4131048[46, 157, 46, 157, 46, 157, 46, 157, 46, 157, ...
\n", "
" ], "text/plain": [ " seq_length sub_label sub_seqs\n", "0 13 432 [46, 157, 46, 483, 157, 46, 157, 856, 157, 856...\n", "1 9 46 [157, 432, 157, 432, 157, 432, 157, 157, 157]\n", "2 7 46 [432, 432, 432, 432, 856, 856, 157]\n", "3 13 432 [46, 157, 46, 157, 46, 157, 856, 157, 46, 157,...\n", "4 13 1048 [46, 157, 46, 157, 46, 157, 46, 157, 46, 157, ..." ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp_user_data = remove_repeating_subs(raw_data)\n", "train_data = build_training_sequences(pp_user_data)\n", "seqs,lbls,lngths = zip(*train_data)\n", "train_df = pd.DataFrame({'sub_seqs':seqs,\n", " 'sub_label':lbls,\n", " 'seq_length':lngths})\n", "train_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Tensorflow Model Architecture

\n", "\n", "Originally, we built the model directly on-top of tensorflow, using the fantastic tutorials from [R2RT](http://r2rt.com/) as reference. However, building and managing various nueral network architectures with Tensorflow can be cumbersome, and higher level wrapper packages exist to abstract away some of the more tedious variable and graph definition steps required for tensorflow models. We chose the [tflearn](http://tflearn.org/) python package, which has an API similiar to sklearn, which the team had more experience with. With tflearn, it's rather easy to plug and play with different layers, and we experimented with LSTM, GRU and multi-layered Bi-Directional RNN architectures." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\python35\\lib\\site-packages\\tensorflow\\python\\util\\deprecation.py:155: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead\n", " arg_spec = inspect.getargspec(func)\n", "c:\\python35\\lib\\site-packages\\tensorflow\\python\\util\\deprecation.py:155: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead\n", " arg_spec = inspect.getargspec(func)\n", "c:\\python35\\lib\\site-packages\\tensorflow\\contrib\\labeled_tensor\\python\\ops\\_typecheck.py:233: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead\n", " spec = inspect.getargspec(f)\n" ] } ], "source": [ "import tensorflow as tf\n", "import tflearn\n", "from tflearn.data_utils import to_categorical, pad_sequences\n", "import numpy as np\n", "\n", "def train_model(train,test,vocab_size,n_epoch=2,n_units=128,dropout=0.6,learning_rate=0.0001):\n", "\n", " trainX = train['sub_seqs']\n", " trainY = train['sub_label']\n", " testX = test['sub_seqs']\n", " testY = test['sub_label']\n", "\n", " # Sequence padding\n", " trainX = pad_sequences(trainX, maxlen=sequence_chunk_size, value=0.,padding='post')\n", " testX = pad_sequences(testX, maxlen=sequence_chunk_size, value=0.,padding='post')\n", "\n", " # Converting labels to binary vectors\n", " trainY = to_categorical(trainY, nb_classes=vocab_size)\n", " testY = to_categorical(testY, nb_classes=vocab_size)\n", "\n", " # Network building\n", " net = tflearn.input_data([None, 15])\n", " net = tflearn.embedding(net, input_dim=vocab_size, output_dim=128,trainable=True)\n", " net = tflearn.gru(net, n_units=n_units, dropout=dropout,weights_init=tflearn.initializations.xavier(),return_seq=False)\n", " net = tflearn.fully_connected(net, vocab_size, activation='softmax',weights_init=tflearn.initializations.xavier())\n", " net = tflearn.regression(net, optimizer='adam', learning_rate=learning_rate,\n", " loss='categorical_crossentropy')\n", "\n", " # Training\n", " model = tflearn.DNN(net, tensorboard_verbose=2)\n", "\n", " model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=False,\n", " batch_size=256,n_epoch=n_epoch)\n", " \n", " return model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Model Training

\n", "We split the model into train/test sets and begin training. Here we use the default training parameters, but the model can be tuned for epochs, internal units, dropout, learning-rate and other hyperparameters of the chosen RNN structure." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training Step: 29 | total loss: \u001b[1m\u001b[32m8.17396\u001b[0m\u001b[0m | time: 1.104s\n", "| Adam | epoch: 002 | loss: 8.17396 -- iter: 3584/3775\n", "Training Step: 30 | total loss: \u001b[1m\u001b[32m8.17391\u001b[0m\u001b[0m | time: 2.222s\n", "| Adam | epoch: 002 | loss: 8.17391 | val_loss: 8.17437 -- iter: 3775/3775\n", "--\n" ] } ], "source": [ "split_perc=0.8\n", "train_len, test_len = np.floor(len(train_df)*split_perc), np.floor(len(train_df)*(1-split_perc))\n", "train, test = train_df.ix[:train_len-1], train_df.ix[train_len:train_len + test_len]\n", "model = train_model(train,test,len(vocab))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It can be difficult to tell how well the model is performing simply by staring at the flipping numbers above, but tensorflow provides a visualization tool called [tensorboard](https://www.tensorflow.org/how_tos/summaries_and_tensorboard/) and tflearn has different prebuilt dashboards which can be changed using the tensorboard_verbose option of the DNN layer. \n", "\n", "![title](tensorboard.PNG)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Visualizng the Model

\n", "As part of the model, a high dimension embedding space is learnt representing the subreddits in the vocabulary as vectors that can be reasoned about with \"distance\" from each other in the embedding space, and visualized with dimensionality reduction techniques, similiar to the concepts used in [word2vec.](http://www.deeplearningweekly.com/blog/demystifying-word2vec) The tutorial by Arthur Juliani [here](https://medium.com/@awjuliani/visualizing-deep-learning-with-t-sne-tutorial-and-video-e7c59ee4080c#.xdlzpd34w) was used to build the embedding visualization." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.manifold import TSNE\n", "#retrieve the embedding layer fro mthe model by default name 'Embedding'\n", "embedding = tflearn.get_layer_variables_by_name(\"Embedding\")[0]\n", "finalWs = model.get_weights(embedding)\n", "tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)\n", "lowDWeights = tsne.fit_transform(finalWs)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\python35\\lib\\site-packages\\bokeh\\core\\json_encoder.py:52: DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future\n", " NP_EPOCH = np.datetime64('1970-01-01T00:00:00Z')\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " Loading BokehJS ...\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "\n", "(function(global) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " var force = true;\n", "\n", " if (typeof (window._bokeh_onload_callbacks) === \"undefined\" || force === true) {\n", " window._bokeh_onload_callbacks = [];\n", " window._bokeh_is_loading = undefined;\n", " }\n", "\n", "\n", " \n", " if (typeof (window._bokeh_timeout) === \"undefined\" || force === true) {\n", " window._bokeh_timeout = Date.now() + 5000;\n", " window._bokeh_failed_load = false;\n", " }\n", "\n", " var NB_LOAD_WARNING = {'data': {'text/html':\n", " \"
\\n\"+\n", " \"

\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"

\\n\"+\n", " \"\\n\"+\n", " \"\\n\"+\n", " \"from bokeh.resources import INLINE\\n\"+\n", " \"output_notebook(resources=INLINE)\\n\"+\n", " \"\\n\"+\n", " \"
\"}};\n", "\n", " function display_loaded() {\n", " if (window.Bokeh !== undefined) {\n", " document.getElementById(\"d2adfea4-106b-4ad7-afef-620b2772cb31\").textContent = \"BokehJS successfully loaded.\";\n", " } else if (Date.now() < window._bokeh_timeout) {\n", " setTimeout(display_loaded, 100)\n", " }\n", " }\n", "\n", " function run_callbacks() {\n", " window._bokeh_onload_callbacks.forEach(function(callback) { callback() });\n", " delete window._bokeh_onload_callbacks\n", " console.info(\"Bokeh: all callbacks have finished\");\n", " }\n", "\n", " function load_libs(js_urls, callback) {\n", " window._bokeh_onload_callbacks.push(callback);\n", " if (window._bokeh_is_loading > 0) {\n", " console.log(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", " return null;\n", " }\n", " if (js_urls == null || js_urls.length === 0) {\n", " run_callbacks();\n", " return null;\n", " }\n", " console.log(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", " window._bokeh_is_loading = js_urls.length;\n", " for (var i = 0; i < js_urls.length; i++) {\n", " var url = js_urls[i];\n", " var s = document.createElement('script');\n", " s.src = url;\n", " s.async = false;\n", " s.onreadystatechange = s.onload = function() {\n", " window._bokeh_is_loading--;\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: all BokehJS libraries loaded\");\n", " run_callbacks()\n", " }\n", " };\n", " s.onerror = function() {\n", " console.warn(\"failed to load library \" + url);\n", " };\n", " console.log(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", " document.getElementsByTagName(\"head\")[0].appendChild(s);\n", " }\n", " };var element = document.getElementById(\"d2adfea4-106b-4ad7-afef-620b2772cb31\");\n", " if (element == null) {\n", " console.log(\"Bokeh: ERROR: autoload.js configured with elementid 'd2adfea4-106b-4ad7-afef-620b2772cb31' but no matching script tag was found. \")\n", " return false;\n", " }\n", "\n", " var js_urls = [\"https://cdn.pydata.org/bokeh/release/bokeh-0.12.4.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.4.min.js\"];\n", "\n", " var inline_js = [\n", " function(Bokeh) {\n", " Bokeh.set_log_level(\"info\");\n", " },\n", " \n", " function(Bokeh) {\n", " \n", " document.getElementById(\"d2adfea4-106b-4ad7-afef-620b2772cb31\").textContent = \"BokehJS is loading...\";\n", " },\n", " function(Bokeh) {\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-0.12.4.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-0.12.4.min.css\");\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.4.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.4.min.css\");\n", " }\n", " ];\n", "\n", " function run_inline_js() {\n", " \n", " if ((window.Bokeh !== undefined) || (force === true)) {\n", " for (var i = 0; i < inline_js.length; i++) {\n", " inline_js[i](window.Bokeh);\n", " }if (force === true) {\n", " display_loaded();\n", " }} else if (Date.now() < window._bokeh_timeout) {\n", " setTimeout(run_inline_js, 100);\n", " } else if (!window._bokeh_failed_load) {\n", " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", " window._bokeh_failed_load = true;\n", " } else if (force !== true) {\n", " var cell = $(document.getElementById(\"d2adfea4-106b-4ad7-afef-620b2772cb31\")).parents('.cell').data().cell;\n", " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", " }\n", "\n", " }\n", "\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", " run_inline_js();\n", " } else {\n", " load_libs(js_urls, function() {\n", " console.log(\"Bokeh: BokehJS plotting callback run at\", now());\n", " run_inline_js();\n", " });\n", " }\n", "}(this));" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from bokeh.plotting import figure, show, output_notebook,output_file\n", "from bokeh.models import ColumnDataSource, LabelSet\n", "\n", "#control the number of labelled subreddits to display\n", "sparse_labels = [lbl if random.random() <=0.01 else '' for lbl in vocab]\n", "source = ColumnDataSource({'x':lowDWeights[:,0],'y':lowDWeights[:,1],'labels':sparse_labels})\n", "\n", "\n", "TOOLS=\"hover,crosshair,pan,wheel_zoom,zoom_in,zoom_out,box_zoom,undo,redo,reset,tap,save,box_select,poly_select,lasso_select,\"\n", "\n", "p = figure(tools=TOOLS)\n", "\n", "p.scatter(\"x\", \"y\", radius=0.1, fill_alpha=0.6,\n", " line_color=None,source=source)\n", "\n", "labels = LabelSet(x=\"x\", y=\"y\", text=\"labels\", y_offset=8,\n", " text_font_size=\"10pt\", text_color=\"#555555\", text_align='center',\n", " source=source)\n", "p.add_layout(labels)\n", "\n", "#output_file(\"embedding.html\")\n", "output_notebook()\n", "show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Saving the Model

\n", "To save the model for use in making real-world predictions, potentially as part of a webapp, we need to freeze the tensorflow graph and transform the variables into constants to maintain the final network. The tutorial [here](https://blog.metaflow.fr/tensorflow-how-to-freeze-a-model-and-serve-it-with-a-python-api-d4f3596b3adc#.eopkd8pys) walks us through how to accomplish this." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from tensorflow.python.framework import graph_util\n", "def freeze_graph(model):\n", " # We precise the file fullname of our freezed graph\n", " output_graph = \"/tmp/frozen_model.pb\"\n", "\n", " # Before exporting our graph, we need to precise what is our output node\n", " # This is how TF decides what part of the Graph he has to keep and what part it can dump\n", " # NOTE: this variable is plural, because you can have multiple output nodes\n", " output_node_names = \"InputData/X,FullyConnected/Softmax\"\n", "\n", " # We clear devices to allow TensorFlow to control on which device it will load operations\n", " clear_devices = True\n", " \n", " # We import the meta graph and retrieve a Saver\n", " #saver = tf.train.import_meta_graph(input_checkpoint + '.meta', clear_devices=clear_devices)\n", "\n", " # We retrieve the protobuf graph definition\n", " graph = model.net.graph\n", " input_graph_def = graph.as_graph_def()\n", "\n", " # We start a session and restore the graph weights\n", " # We use a built-in TF helper to export variables to constants\n", " sess = model.session\n", " output_graph_def = graph_util.convert_variables_to_constants(\n", " sess, # The session is used to retrieve the weights\n", " input_graph_def, # The graph_def is used to retrieve the nodes \n", " output_node_names.split(\",\") # The output node names are used to select the usefull nodes\n", " ) \n", "\n", " # Finally we serialize and dump the output graph to the filesystem\n", " with tf.gfile.GFile(output_graph, \"wb\") as f:\n", " f.write(output_graph_def.SerializeToString())\n", " print(\"%d ops in the final graph.\" % len(output_graph_def.node))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Froze 152 variables.\n", "Converted 8 variables to const ops.\n", "607 ops in the final graph.\n" ] } ], "source": [ "freeze_graph(model)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def load_graph(frozen_graph_filename):\n", " # We load the protobuf file from the disk and parse it to retrieve the \n", " # unserialized graph_def\n", " with tf.gfile.GFile(frozen_graph_filename, \"rb\") as f:\n", " graph_def = tf.GraphDef()\n", " graph_def.ParseFromString(f.read())\n", "\n", " # Then, we can use again a convenient built-in function to import a graph_def into the \n", " # current default Graph\n", " with tf.Graph().as_default() as graph:\n", " tf.import_graph_def(\n", " graph_def, \n", " input_map=None, \n", " return_elements=None, \n", " name=\"prefix\", \n", " op_dict=None, \n", " producer_op_list=None\n", " )\n", " return graph" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "grph = load_graph(\"/tmp/frozen_model.pb\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 0.00028193 0.00028239 0.00028253 ..., 0.00028406 0.00028214\n", " 0.00028208]]\n" ] } ], "source": [ "x = grph.get_tensor_by_name('prefix/InputData/X:0')\n", "y = grph.get_tensor_by_name(\"prefix/FullyConnected/Softmax:0\")\n", "\n", "# We launch a Session\n", "with tf.Session(graph=grph) as sess:\n", " # Note: we didn't initialize/restore anything, everything is stored in the graph_def\n", " y_out = sess.run(y, feed_dict={\n", " x: [[1]*sequence_chunk_size] \n", " })\n", " print(y_out) # [[ False ]] Yay, it works!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Final Recommender

\n", "Using the frozen model, we can predict the most likely subreddits to be of interest to a user by collecting Reddit data for a specific user and provide final recommendations based on the most common subreddits with the highest probabilities from the RNN predictions for each of the subreddit sequence chunks of the user." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from collections import Counter\n", "def collect_user_data(user):\n", " print(user)\n", " #Import configuration parameters, user agent for PRAW Reddit object\n", " config = configparser.ConfigParser()\n", " config.read('secrets.ini')\n", "\n", " #load user agent string\n", " reddit_user_agent = config.get('reddit', 'user_agent')\n", " client_id = config.get('reddit', 'client_id')\n", " client_secret = config.get('reddit', 'client_api_key')\n", " #initialize the praw Reddit object\n", " r = praw.Reddit(user_agent=reddit_user_agent,client_id = client_id,client_secret=client_secret) \n", " praw_user = r.get_redditor(user)\n", " user_data = [(user_comment.subreddit.display_name,\n", " user_comment.created_utc) for user_comment in praw_user.get_comments(limit=None)]\n", " return sorted(user_data,key=lambda x: x[1]) #sort by ascending utc timestamp\n", "\n", "def user_recs(user,n_recs=10,chunk_size=sequence_chunk_size):\n", " user_data = collect_user_data(user)\n", " user_sub_seq = [vocab.index(data[0]) if data[0] in vocab else 0 for data in user_data]\n", " non_repeating_subs = []\n", " for i,sub in enumerate(user_sub_seq):\n", " if i == 0:\n", " non_repeating_subs.append(sub)\n", " elif sub != user_sub_seq[i-1]:\n", " non_repeating_subs.append(sub)\n", " user_subs = set([vocab[sub_index] for sub_index in non_repeating_subs])\n", " sub_chunks = list(chunks(non_repeating_subs,chunk_size))\n", " user_input = pad_sequences(sub_chunks, maxlen=chunk_size, value=0.,padding='post')\n", " x = grph.get_tensor_by_name('prefix/InputData/X:0')\n", " y = grph.get_tensor_by_name(\"prefix/FullyConnected/Softmax:0\")\n", " with tf.Session(graph=grph) as sess:\n", " sub_probs = sess.run(y, feed_dict={\n", " x: user_input\n", " })\n", " #select the subreddit with highest prediction prob for each of the input subreddit sequences of the user\n", " recs = [np.argmax(probs) for probs in sub_probs]\n", " filtered_recs = [filt_rec for filt_rec in recs if filt_rec not in user_sub_seq]\n", " top_x_recs,cnt = zip(*Counter(filtered_recs).most_common(n_recs))\n", " sub_recs = [vocab[sub_index] for sub_index in top_x_recs]\n", " return sub_recs" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['fantasyfootball', 'PS3']" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "user_recs(\"ponderinghydrogen\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

The Web App

\n", "Those are all the pieces required to build a functioning subreddit recommender system that users can try! Using Flask, a simple web app can be made taking as input any valid reddit user name and outputting recommendations for that user. A minimal web app doing just that can be interacted with [here](http://ponderinghydrogen.pythonanywhere.com/)\n", "\n", "![title](wepapp.PNG)\n", "\n", "

Final Thoughts

\n", "The model being served in the above webapp is an under-tuned and under-dataed proof-of-concept single layer RNN, but it is still surprisingly capable of suggesting interesting subreddits to some testers I've had use the app. Nueral Networks really are powerful methods for tackling difficult problems, and with better and better Machine Learning research and tooling being released daily, and increasingly powerful computers, the pool of potential problems solvable by a determined engineer keeps getting larger. I'm looking forward to tackling the next one." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }