{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "listwise-movie-recommendations-using-rl-methods.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true, "mount_file_id": "1Guok-lDpcc5mJhz6W4eaFNERAvU0FgRI", "authorship_tag": "ABX9TyOTdHQQ3hPyWAyakOXAfBNw" }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "WlyEOQzwccnK" }, "source": [ "# Movie List Recommender using Actor-critic based RL method\n", "> Training a list-wise movie recommender using actor-critic policy and evaluating offline using experience replay method\n", "\n", "- toc: true\n", "- badges: true\n", "- comments: true\n", "- categories: [RL, Movie, Tensorflow 1x]\n", "- image:" ] }, { "cell_type": "markdown", "metadata": { "id": "cvYLG7qBvB6W" }, "source": [ "### Introduction" ] }, { "cell_type": "markdown", "metadata": { "id": "URFAhP2vvD4T" }, "source": [ "We will model the sequential interactions between users and a recommender system as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal strategies via recommending trial-and-error items and receiving reinforcements of these items from users’ feedbacks.\n", "\n", "Efforts have been made on utilizing reinforcement learning for recommender systems, such as POMDP and Q-learning. However, these methods may become inflexible with the increasing number of items for recommendations. This prevents them to be adopted by practical recommender systems." ] }, { "cell_type": "markdown", "metadata": { "id": "TkzYH9X2vlOp" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "qepaUtlovTsw" }, "source": [ "\n", "Generally, there exist two Deep Q-learning architectures, shown in the above figure. Traditional deep Q-learning adopts the first architecture as shown in (a), which inputs only the state space and outputs Q-values of all actions. This architecture is suitable for the scenario with high state space and small action space, like playing Atari. However, one drawback is that it cannot handle large and dynamic action space scenario, like recommender systems. The second Q-learning architecture, shown (b), treats the state and the action as the input of Neural Networks and outputs the Q-value corresponding to this action. This architecture does not need to store each Q-value in memory and thus can deal with large action space or even continuous action space. A challenging problem of leveraging the second architecture is temporal complexity, i.e., this architecture computes Q-value for all potential actions, separately.\n", "\n", "Recommending a list of items is more desirable (especially for sellers) than recommending a single item. To achieve this, one option is to score the items seperately and select the top ones. For example, DQN can calculate Q-values of all recalled items separately, and recommend a list of items with highest Q-values. But this strategy do not consider the relation between items. e.g. If the next best item is egg, all kind of different eggs will get high scores, like white eggs, brown eggs, farm eggs etc. But these are similar items, not complimentary. And the whole purpose of list-wise recommendation is to recommend complimentary items. That's where DQN fails. \n", "\n", "One option to resolve this issue is by adding a simple rule - select only 1 top-scoring item from each category. This is a good rule and will improve the list quality but we have to compromise with some missed opportunities here because let's say we recommend a 12-brown-eggs 🥚 tray and a small brown bread 🍞. Now it is possible that if a 24-brown-eggs tray is scored higher than 12-brown-eggs tray but in bread category, small-brown-bread is still the highest score item. As per business sense, we should recommend a large brown bread with 24-brown-eggs tray. This is what we missed - either customer will manually select the large bread (lost opportunity for higher customer satisfaction) or just buy the small one (lost opportunity for higher revenue). \n", "\n", "In this tutorial, our goal is to fill this gap. We will evaluate the RL agent offline using experience replay method. Also, it is possible that productionizing this model cost more than the benefit, especially for small businesses, because if we are getting 1% revenue gain, on $1M, it might not be sufficient, and on $1B, the same model will become one of the highest priority model to productionize 💵." ] }, { "cell_type": "markdown", "metadata": { "id": "xO7879psvqw6" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "GnQebSD9vYXL" }, "source": [ "To tackle this problem, in this paper, our recommending policy builds upon the Actor-Critic framework. We model this problem as a Markov Decision Process (MDP), which includes a sequence of states, actions and rewards. More formally, MDP consists of a tuple of five elements $(\\mathcal{S}, \\mathcal{A}, \\mathcal{P}, \\mathcal{R}, \\gamma)$ as follows:\n", "\n", "- State space $\\mathcal{S}$: A state $s_t \\in S$ is defined as the browsing history of a user, i.e., previous $N$ items that a user browsed before time $t$. The items in $s_t$ are sorted in chronological order.\n", "- Action space $\\mathcal{A}$: An action $a_t \\in \\mathcal{A}$ is to recommend a list of items to a user at time $t$ based on current state $s_t$.\n", "- Reward $\\mathcal{R}$: After the recommender agent takes an action $a_t$ at the state $s_t$ , i.e., recommending a list of items to a user, the user browses these items and provides her feedback. She can skip (not click), click, or order these items, and the agent receives immediate reward $r(s_t,a_t)$ according to the user’s feedback.\n", "- Transition probability $\\mathcal{P}$: Transition probability defines the probability of state transition from $s_t$ to $s_{t+1}$ when RA takes action $a_t$. If user skips all the recommended items, then the next state $s_{t+1}$ = $s_t$; while if the user clicks/orders part of items, then the next state $s_{t+1}$ updates.\n", "- Discount factor $\\gamma$ : $\\gamma \\in [0,1]$ defines the discount factor when we measure the present value of future reward. In particular, when $\\gamma$ = 0, RA only considers the immediate reward. In other words, when $\\gamma$ = 1, all future rewards can be counted fully into that of the current action.\n", "\n", "With the notations and definitions above, the problem of listwise item recommendation can be formally de!ned as follows: Given the historical MDP, i.e., $(\\mathcal{S}, \\mathcal{A}, \\mathcal{P}, \\mathcal{R}, \\gamma)$, the goal is to find a recommendation policy $\\pi : \\mathcal{S} \\to A$, which can maximize the cumulative reward for the recommender system.\n", "\n", "According to collaborative filtering techniques, users with similar interests will make similar decisions on the same item. With this intuition, we match the current state and action to existing historical state-action pairs, and stochastically generate a simulated reward. To be more specific, we first build a memory $M = {m_1,m_2, ···}$ to store users’ historical browsing history, where $m_i$ is a user-agent interaction triplet $((s_i, a_i) \\to r_i)$. The procedure to build the online simulator memory is illustrated in the following figure. Given a historical recommendation session ${a_1, ··· , a_L}$, we can observe the initial state $s_0 = {s_0^1, ··· ,s_0^N}$ from the previous sessions (line 2). Each time we observe $K$ items in temporal order (line 3), which means that each iteration we will move forward a window of K. We can observe the current state (line 4), current $K$ items (line 5), and the user’s feedbacks for these items (line 6). Then we store triplet $((s_i, a_i) \\to r_i)$ in memory (line 7). Finally we update the state (lines 8-13), and move to the next $K$ items. Since we keep a fixed length state $s = {s_1, ··· ,s_N }$, each time a user clicked/ordered some items in the recommended list, we add these items to the end of state and remove the same number of items in the top of the state. For example, the RA recommends a list of !ve items ${a_1, ··· , a_5}$ to a user, if the user clicks $a_1$ and orders $a_5$, then update $s = {s_3, ··· ,s_N , a_1, a_5}$." ] }, { "cell_type": "markdown", "metadata": { "id": "FCYiZWs4vvPT" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "f8NJ1rPavbi3" }, "source": [ "Then we calculated the similarity of the current state-action pair, say $p_t(s_t,a_t)$, to each existing historical state-action pair in the memory. In this work, we adopt cosine similarity as:" ] }, { "cell_type": "markdown", "metadata": { "id": "LlkT0ah2vyYQ" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "eded1f-OvdnJ" }, "source": [ "where the first term measures the state similarity and the second term evaluates the action similarity. Parameter $\\alpha$ controls the balance of two similarities.\n", "\n", "The proposed framework is as follows:" ] }, { "cell_type": "markdown", "metadata": { "id": "yDst-u-dv1sk" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "CvHJ6duyvgSA" }, "source": [ "The framework works like this:\n", "\n", "**Input**: Current state $s_t$ , Item space $\\mathcal{I}$, the length of recommendation list $K$.\n", "**Output**: Recommendation list $a_t$.\n", "\n", "1. Generate $w_t = {w_t^1 , ··· , w_t^K}$ according to $f_\\theta\\pi : s_t \\to w_t$ where $f_\\theta\\pi$ is a function parametrized by $\\theta^\\pi$, mapping from the state space to the weight representation space\n", "2. For $k = 1:K$ do\n", " 1. Score items in $\\mathcal{I}$ according to $score_i = w_t^ke_i^T$\n", " 2. Select an item with highest score $a_t^k$\n", " 3. Add item $a_t^k$ in the bottom of $a_t$\n", " 4. Remove item $a_t^k$ from $\\mathcal{I}$\n", "3. end for\n", "4. return $a_t$" ] }, { "cell_type": "markdown", "metadata": { "id": "n5mSyGfby5iE" }, "source": [ "### Setup" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_7FcQ8w7tIVf", "outputId": "a3bd33ed-ef59-403c-8db0-e3989ef68d89" }, "source": [ "%tensorflow_version 1.x" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "TensorFlow 1.x selected.\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6DvK7jdgtLpr", "outputId": "ab5fddba-53ed-433a-874a-42535e2483af" }, "source": [ "import itertools\n", "import pandas as pd\n", "import numpy as np\n", "import random\n", "import csv\n", "import time\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "import tensorflow as tf\n", "\n", "import keras.backend as K\n", "from keras import Sequential\n", "from keras.layers import Dense, Dropout" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ], "name": "stderr" } ] }, { "cell_type": "markdown", "metadata": { "id": "NPnJFRj8y8Mm" }, "source": [ "### Download data" ] }, { "cell_type": "markdown", "metadata": { "id": "8c2XxhlZy_ZD" }, "source": [ "Downloading Movielens dataset from official source" ] }, { "cell_type": "code", "metadata": { "id": "SlRThma7tMNq" }, "source": [ "!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip\n", "!unzip -q ml-100k.zip" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "-YBKO19ZzFmo" }, "source": [ "### DataGenerator `class`\n", "1. Load the data into pandas dataframe\n", "2. List down user's rating history in chronological order\n", "3. Generate a sample of state-action pair\n", "4. Split the data into train/test\n", "5. Store the data back into csv file format" ] }, { "cell_type": "code", "metadata": { "id": "BvX6KQirtgJV" }, "source": [ "#collapse-hide\n", "class DataGenerator():\n", " def __init__(self, datapath, itempath):\n", " '''\n", " Load data from the DB MovieLens\n", " List the users and the items\n", " List all the users histories\n", " '''\n", " self.data = self.load_data(datapath, itempath)\n", " self.users = self.data['userId'].unique() #list of all users\n", " self.items = self.data['itemId'].unique() #list of all items\n", " self.histo = self.generate_history()\n", " self.train = []\n", " self.test = []\n", "\n", " def load_data(self, datapath, itempath):\n", " '''\n", " Load the data and merge the name of each movie. \n", " A row corresponds to a rate given by a user to a movie.\n", "\n", " Parameters\n", " ----------\n", " datapath : string\n", " path to the data 100k MovieLens\n", " contains usersId;itemId;rating \n", " itempath : string\n", " path to the data 100k MovieLens\n", " contains itemId;itemName\n", " Returns\n", " -------\n", " result : DataFrame\n", " Contains all the ratings \n", " '''\n", " data = pd.read_csv(datapath, sep='\\t', \n", " names=['userId', 'itemId', 'rating', 'timestamp'])\n", " movie_titles = pd.read_csv(itempath, sep='|', names=['itemId', 'itemName'],\n", " usecols=range(2), encoding='latin-1')\n", " return data.merge(movie_titles,on='itemId', how='left')\n", "\n", "\n", " def generate_history(self):\n", " '''\n", " Group all rates given by users and store them from older to most recent.\n", " \n", " Returns\n", " -------\n", " result : List(DataFrame)\n", " List of the historic for each user\n", " '''\n", " historic_users = []\n", " for i, u in enumerate(self.users):\n", " temp = self.data[self.data['userId'] == u]\n", " temp = temp.sort_values('timestamp').reset_index()\n", " temp.drop('index', axis=1, inplace=True)\n", " historic_users.append(temp)\n", " return historic_users\n", "\n", " def sample_history(self, user_histo, action_ratio=0.8, max_samp_by_user=5, max_state=100, max_action=50, nb_states=[], nb_actions=[]):\n", " '''\n", " For a given history, make one or multiple sampling.\n", " If no optional argument given for nb_states and nb_actions, then the sampling\n", " is random and each sample can have differents size for action and state.\n", " To normalize sampling we need to give list of the numbers of states and actions\n", " to be sampled.\n", "\n", " Parameters\n", " ----------\n", " user_histo : DataFrame\n", " historic of user\n", " delimiter : string, optional\n", " delimiter for the csv\n", " action_ratio : float, optional\n", " ratio form which movies in history will be selected\n", " max_samp_by_user: int, optional\n", " Nulber max of sample to make by user\n", " max_state : int, optional\n", " Number max of movies to take for the 'state' column\n", " max_action : int, optional\n", " Number max of movies to take for the 'action' action\n", " nb_states : array(int), optional\n", " Numbers of movies to be taken for each sample made on user's historic\n", " nb_actions : array(int), optional\n", " Numbers of rating to be taken for each sample made on user's historic\n", " \n", " Returns\n", " -------\n", " states : List(String)\n", " All the states sampled, format of a sample: itemId&rating\n", " actions : List(String)\n", " All the actions sampled, format of a sample: itemId&rating\n", " \n", "\n", " Notes\n", " -----\n", " States must be before(timestamp<) the actions.\n", " If given, size of nb_states is the numbller of sample by user\n", " sizes of nb_states and nb_actions must be equals\n", " '''\n", "\n", " n = len(user_histo)\n", " sep = int(action_ratio * n)\n", " nb_sample = random.randint(1, max_samp_by_user)\n", " if not nb_states:\n", " nb_states = [min(random.randint(1, sep), max_state) for i in range(nb_sample)]\n", " if not nb_actions:\n", " nb_actions = [min(random.randint(1, n - sep), max_action) for i in range(nb_sample)]\n", " assert len(nb_states) == len(nb_actions), 'Given array must have the same size'\n", " \n", " states = []\n", " actions = []\n", " # SELECT SAMPLES IN HISTORY\n", " for i in range(len(nb_states)):\n", " sample_states = user_histo.iloc[0:sep].sample(nb_states[i])\n", " sample_actions = user_histo.iloc[-(n - sep):].sample(nb_actions[i])\n", " \n", " sample_state = []\n", " sample_action = []\n", " for j in range(nb_states[i]):\n", " row = sample_states.iloc[j]\n", " # FORMAT STATE\n", " state = str(row.loc['itemId']) + '&' + str(row.loc['rating'])\n", " sample_state.append(state)\n", " \n", " for j in range(nb_actions[i]):\n", " row = sample_actions.iloc[j]\n", " # FORMAT ACTION\n", " action = str(row.loc['itemId']) + '&' + str(row.loc['rating'])\n", " sample_action.append(action)\n", "\n", " states.append(sample_state)\n", " actions.append(sample_action)\n", " return states, actions\n", "\n", " def gen_train_test(self, test_ratio, seed=None):\n", " '''\n", " Shuffle the historic of users and separate it in a train and a test set.\n", " Store the ids for each set.\n", " An user can't be in both set.\n", "\n", " Parameters\n", " ----------\n", " test_ratio : float\n", " Ratio to control the sizes of the sets\n", " seed : float\n", " Seed on the shuffle\n", " '''\n", " n = len(self.histo)\n", "\n", " if seed is not None:\n", " random.Random(seed).shuffle(self.histo)\n", " else:\n", " random.shuffle(self.histo)\n", "\n", " self.train = self.histo[:int((test_ratio * n))]\n", " self.test = self.histo[int((test_ratio * n)):]\n", " self.user_train = [h.iloc[0,0] for h in self.train]\n", " self.user_test = [h.iloc[0,0] for h in self.test]\n", " \n", "\n", " def write_csv(self, filename, histo_to_write, delimiter=';', action_ratio=0.8, max_samp_by_user=5, max_state=100, max_action=50, nb_states=[], nb_actions=[]):\n", " '''\n", " From a given historic, create a csv file with the format:\n", " columns : state;action_reward;n_state\n", " rows : itemid&rating1 | itemid&rating2 | ... ; itemid&rating3 | ... | itemid&rating4; itemid&rating1 | itemid&rating2 | itemid&rating3 | ... | item&rating4\n", " at filename location.\n", "\n", " Parameters\n", " ----------\n", " filename : string\n", " path to the file to be produced\n", " histo_to_write : List(DataFrame)\n", " List of the historic for each user\n", " delimiter : string, optional\n", " delimiter for the csv\n", " action_ratio : float, optional\n", " ratio form which movies in history will be selected\n", " max_samp_by_user: int, optional\n", " Nulber max of sample to make by user\n", " max_state : int, optional\n", " Number max of movies to take for the 'state' column\n", " max_action : int, optional\n", " Number max of movies to take for the 'action' action\n", " nb_states : array(int), optional\n", " Numbers of movies to be taken for each sample made on user's historic\n", " nb_actions : array(int), optional\n", " Numbers of rating to be taken for each sample made on user's historic\n", "\n", " Notes\n", " -----\n", " if given, size of nb_states is the numbller of sample by user\n", " sizes of nb_states and nb_actions must be equals\n", "\n", " '''\n", " with open(filename, mode='w') as file:\n", " f_writer = csv.writer(file, delimiter=delimiter)\n", " f_writer.writerow(['state', 'action_reward', 'n_state'])\n", " for user_histo in histo_to_write:\n", " states, actions = self.sample_history(user_histo, action_ratio, max_samp_by_user, max_state, max_action, nb_states, nb_actions)\n", " for i in range(len(states)):\n", " # FORMAT STATE\n", " state_str = '|'.join(states[i])\n", " # FORMAT ACTION\n", " action_str = '|'.join(actions[i])\n", " # FORMAT N_STATE\n", " n_state_str = state_str + '|' + action_str\n", " f_writer.writerow([state_str, action_str, n_state_str])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "VQbGG-QC0ZeJ" }, "source": [ "### EmbeddingsGenerator `class`\n", "1. Load the data\n", "2. Build a keras sequential model\n", "3. Convert train and test set into required format\n", "4. Train and evaluate the model\n", "5. Generate item embeddings for each movie id\n", "6. Save the embeddings into a csv file" ] }, { "cell_type": "code", "metadata": { "id": "O45OFa8ytgGG" }, "source": [ "#collapse-hide\n", "class EmbeddingsGenerator:\n", " def __init__(self, train_users, data):\n", " self.train_users = train_users\n", "\n", " #preprocess\n", " self.data = data.sort_values(by=['timestamp'])\n", " #make them start at 0\n", " self.data['userId'] = self.data['userId'] - 1\n", " self.data['itemId'] = self.data['itemId'] - 1\n", " self.user_count = self.data['userId'].max() + 1\n", " self.movie_count = self.data['itemId'].max() + 1\n", " self.user_movies = {} #list of rated movies by each user\n", " for userId in range(self.user_count):\n", " self.user_movies[userId] = self.data[self.data.userId == userId]['itemId'].tolist()\n", " self.m = self.model()\n", "\n", " def model(self, hidden_layer_size=100):\n", " m = Sequential()\n", " m.add(Dense(hidden_layer_size, input_shape=(1, self.movie_count)))\n", " m.add(Dropout(0.2))\n", " m.add(Dense(self.movie_count, activation='softmax'))\n", " m.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])\n", " return m\n", " \n", " def generate_input(self, user_id):\n", " '''\n", " Returns a context and a target for the user_id\n", " context: user's history with one random movie removed\n", " target: id of random removed movie\n", " '''\n", " user_movies_count = len(self.user_movies[user_id])\n", " #picking random movie\n", " random_index = np.random.randint(0, user_movies_count-1) # -1 avoids taking the last movie\n", " #setting target\n", " target = np.zeros((1, self.movie_count))\n", " target[0][self.user_movies[user_id][random_index]] = 1\n", " #setting context\n", " context = np.zeros((1, self.movie_count))\n", " context[0][self.user_movies[user_id][:random_index] + self.user_movies[user_id][random_index+1:]] = 1\n", " return context, target\n", "\n", " def train(self, nb_epochs = 300, batch_size = 10000):\n", " '''\n", " Trains the model from train_users's history\n", " '''\n", " for i in range(nb_epochs):\n", " print('%d/%d' % (i+1, nb_epochs))\n", " batch = [self.generate_input(user_id=np.random.choice(self.train_users) - 1) for _ in range(batch_size)]\n", " X_train = np.array([b[0] for b in batch])\n", " y_train = np.array([b[1] for b in batch])\n", " self.m.fit(X_train, y_train, epochs=1, validation_split=0.5)\n", "\n", " def test(self, test_users, batch_size = 100000):\n", " '''\n", " Returns [loss, accuracy] on the test set\n", " '''\n", " batch_test = [self.generate_input(user_id=np.random.choice(test_users) - 1) for _ in range(batch_size)]\n", " X_test = np.array([b[0] for b in batch_test])\n", " y_test = np.array([b[1] for b in batch_test])\n", " return self.m.evaluate(X_test, y_test)\n", "\n", " def save_embeddings(self, file_name):\n", " '''\n", " Generates a csv file containg the vector embedding for each movie.\n", " '''\n", " inp = self.m.input # input placeholder\n", " outputs = [layer.output for layer in self.m.layers] # all layer outputs\n", " functor = K.function([inp, K.learning_phase()], outputs ) # evaluation function\n", "\n", " #append embeddings to vectors\n", " vectors = []\n", " for movie_id in range(self.movie_count):\n", " movie = np.zeros((1, 1, self.movie_count))\n", " movie[0][0][movie_id] = 1\n", " layer_outs = functor([movie])\n", " vector = [str(v) for v in layer_outs[0][0][0]]\n", " vector = '|'.join(vector)\n", " vectors.append([movie_id, vector])\n", "\n", " #saves as a csv file\n", " embeddings = pd.DataFrame(vectors, columns=['item_id', 'vectors']).astype({'item_id': 'int32'})\n", " embeddings.to_csv(file_name, sep=';', index=False)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "GPDw4iPw1N3U" }, "source": [ "### Embeddings `helper class`" ] }, { "cell_type": "code", "metadata": { "id": "Pkl50IeYtgCE" }, "source": [ "#collapse-hide\n", "class Embeddings:\n", " def __init__(self, item_embeddings):\n", " self.item_embeddings = item_embeddings\n", " \n", " def size(self):\n", " return self.item_embeddings.shape[1]\n", " \n", " def get_embedding_vector(self):\n", " return self.item_embeddings\n", " \n", " def get_embedding(self, item_index):\n", " return self.item_embeddings[item_index]\n", "\n", " def embed(self, item_list):\n", " return np.array([self.get_embedding(item) for item in item_list])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "5e1gKdTW1ezk" }, "source": [ "### read_file `helper function`\n", "\n", "This function will read the stored data csv files into pandas dataframe" ] }, { "cell_type": "code", "metadata": { "id": "eQMNYF2R1XmX" }, "source": [ "#collapse-hide\n", "def read_file(data_path):\n", " ''' Load data from train.csv or test.csv. '''\n", "\n", " data = pd.read_csv(data_path, sep=';')\n", " for col in ['state', 'n_state', 'action_reward']:\n", " data[col] = [np.array([[np.int(k) for k in ee.split('&')] for ee in e.split('|')]) for e in data[col]]\n", " for col in ['state', 'n_state']:\n", " data[col] = [np.array([e[0] for e in l]) for l in data[col]]\n", "\n", " data['action'] = [[e[0] for e in l] for l in data['action_reward']]\n", " data['reward'] = [tuple(e[1] for e in l) for l in data['action_reward']]\n", " data.drop(columns=['action_reward'], inplace=True)\n", "\n", " return data" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "jG9pMXDf1sYS" }, "source": [ "### read_embeddings `helper function`\n", "\n", "This function will read the stored embedding csv file into pandas dataframe and return as multi-dimensional array" ] }, { "cell_type": "code", "metadata": { "id": "jIa8hSX0uZWx" }, "source": [ "def read_embeddings(embeddings_path):\n", " ''' Load embeddings (a vector for each item). '''\n", " \n", " embeddings = pd.read_csv(embeddings_path, sep=';')\n", "\n", " return np.array([[np.float64(k) for k in e.split('|')]\n", " for e in embeddings['vectors']])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "VrPAuw7W2QYx" }, "source": [ "### Environment `class`\n", "\n", "This is the simulator. It will help orchestrating the whole process of learning list recommendations by our actor-critic based MDP agent. " ] }, { "cell_type": "code", "metadata": { "id": "ImgTOp-kujq7" }, "source": [ "#collapse-hide\n", "class Environment():\n", " def __init__(self, data, embeddings, alpha, gamma, fixed_length):\n", " self.embeddings = embeddings\n", "\n", " self.embedded_data = pd.DataFrame()\n", " self.embedded_data['state'] = [np.array([embeddings.get_embedding(item_id) \n", " for item_id in row['state']]) for _, row in data.iterrows()]\n", " self.embedded_data['action'] = [np.array([embeddings.get_embedding(item_id) \n", " for item_id in row['action']]) for _, row in data.iterrows()]\n", " self.embedded_data['reward'] = data['reward']\n", "\n", " self.alpha = alpha # α (alpha) in Equation (1)\n", " self.gamma = gamma # Γ (Gamma) in Equation (4)\n", " self.fixed_length = fixed_length\n", " self.current_state = self.reset()\n", " self.groups = self.get_groups()\n", "\n", " def reset(self):\n", " self.init_state = self.embedded_data['state'].sample(1).values[0]\n", " return self.init_state\n", "\n", " def step(self, actions):\n", " '''\n", " Compute reward and update state.\n", " Args:\n", " actions: embedded chosen items.\n", " Returns:\n", " cumulated_reward: overall reward.\n", " current_state: updated state.\n", " '''\n", "\n", " # '18: Compute overall reward r_t according to Equation (4)'\n", " simulated_rewards, cumulated_reward = self.simulate_rewards(self.current_state.reshape((1, -1)), actions.reshape((1, -1)))\n", "\n", " # '11: Set s_t+1 = s_t' <=> self.current_state = self.current_state\n", "\n", " for k in range(len(simulated_rewards)): # '12: for k = 1, K do'\n", " if simulated_rewards[k] > 0: # '13: if r_t^k > 0 then'\n", " # '14: Add a_t^k to the end of s_t+1'\n", " self.current_state = np.append(self.current_state, [actions[k]], axis=0)\n", " if self.fixed_length: # '15: Remove the first item of s_t+1'\n", " self.current_state = np.delete(self.current_state, 0, axis=0)\n", "\n", " return cumulated_reward, self.current_state\n", "\n", " def get_groups(self):\n", " ''' Calculate average state/action value for each group. Equation (3). '''\n", "\n", " groups = []\n", " for rewards, group in self.embedded_data.groupby(['reward']):\n", " size = group.shape[0]\n", " states = np.array(list(group['state'].values))\n", " actions = np.array(list(group['action'].values))\n", " groups.append({\n", " 'size': size, # N_x in article\n", " 'rewards': rewards, # U_x in article (combination of rewards)\n", " 'average state': (np.sum(states / np.linalg.norm(states, 2, axis=1)[:, np.newaxis], axis=0) / size).reshape((1, -1)), # s_x^-\n", " 'average action': (np.sum(actions / np.linalg.norm(actions, 2, axis=1)[:, np.newaxis], axis=0) / size).reshape((1, -1)) # a_x^-\n", " })\n", " return groups\n", "\n", " def simulate_rewards(self, current_state, chosen_actions, reward_type='grouped cosine'):\n", " '''\n", " Calculate simulated rewards.\n", " Args:\n", " current_state: history, list of embedded items.\n", " chosen_actions: embedded chosen items.\n", " reward_type: from ['normal', 'grouped average', 'grouped cosine'].\n", " Returns:\n", " returned_rewards: most probable rewards.\n", " cumulated_reward: probability weighted rewards.\n", " '''\n", "\n", " # Equation (1)\n", " def cosine_state_action(s_t, a_t, s_i, a_i):\n", " cosine_state = np.dot(s_t, s_i.T) / (np.linalg.norm(s_t, 2) * np.linalg.norm(s_i, 2))\n", " cosine_action = np.dot(a_t, a_i.T) / (np.linalg.norm(a_t, 2) * np.linalg.norm(a_i, 2))\n", " return (self.alpha * cosine_state + (1 - self.alpha) * cosine_action).reshape((1,))\n", "\n", " if reward_type == 'normal':\n", " # Calculate simulated reward in normal way: Equation (2)\n", " probabilities = [cosine_state_action(current_state, chosen_actions, row['state'], row['action'])\n", " for _, row in self.embedded_data.iterrows()]\n", " elif reward_type == 'grouped average':\n", " # Calculate simulated reward by grouped average: Equation (3)\n", " probabilities = np.array([g['size'] for g in self.groups]) *\\\n", " [(self.alpha * (np.dot(current_state, g['average state'].T) / np.linalg.norm(current_state, 2))\\\n", " + (1 - self.alpha) * (np.dot(chosen_actions, g['average action'].T) / np.linalg.norm(chosen_actions, 2)))\n", " for g in self.groups]\n", " elif reward_type == 'grouped cosine':\n", " # Calculate simulated reward by grouped cosine: Equations (1) and (3)\n", " probabilities = [cosine_state_action(current_state, chosen_actions, g['average state'], g['average action'])\n", " for g in self.groups]\n", "\n", " # Normalize (sum to 1)\n", " probabilities = np.array(probabilities) / sum(probabilities)\n", "\n", " # Get most probable rewards\n", " if reward_type == 'normal':\n", " returned_rewards = self.embedded_data.iloc[np.argmax(probabilities)]['reward']\n", " elif reward_type in ['grouped average', 'grouped cosine']:\n", " returned_rewards = self.groups[np.argmax(probabilities)]['rewards']\n", "\n", " # Equation (4)\n", " def overall_reward(rewards, gamma):\n", " return np.sum([gamma**k * reward for k, reward in enumerate(rewards)])\n", "\n", " if reward_type in ['normal', 'grouped average']:\n", " # Get cumulated reward: Equation (4)\n", " cumulated_reward = overall_reward(returned_rewards, self.gamma)\n", " elif reward_type == 'grouped cosine':\n", " # Get probability weighted cumulated reward\n", " cumulated_reward = np.sum([p * overall_reward(g['rewards'], self.gamma)\n", " for p, g in zip(probabilities, self.groups)])\n", "\n", " return returned_rewards, cumulated_reward" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "sqaLWAyQ3Eep" }, "source": [ "### Actor `class`\n", "\n", "This is the policy approximator actor " ] }, { "cell_type": "code", "metadata": { "id": "o2oyBorPul4U" }, "source": [ "#collapse-hide\n", "class Actor():\n", " ''' Policy function approximator. '''\n", " \n", " def __init__(self, sess, state_space_size, action_space_size, batch_size, ra_length, history_length, embedding_size, tau, learning_rate, scope='actor'):\n", " self.sess = sess\n", " self.state_space_size = state_space_size\n", " self.action_space_size = action_space_size\n", " self.batch_size = batch_size\n", " self.ra_length = ra_length\n", " self.history_length = history_length\n", " self.embedding_size = embedding_size\n", " self.tau = tau\n", " self.learning_rate = learning_rate\n", " self.scope = scope\n", "\n", " with tf.variable_scope(self.scope):\n", " # Build Actor network\n", " self.action_weights, self.state, self.sequence_length = self._build_net('estimator_actor')\n", " self.network_params = tf.trainable_variables()\n", "\n", " # Build target Actor network\n", " self.target_action_weights, self.target_state, self.target_sequence_length = self._build_net('target_actor')\n", " self.target_network_params = tf.trainable_variables()[len(self.network_params):] # TODO: why sublist [len(x):]? Maybe because its equal to network_params + target_network_params\n", "\n", " # Initialize target network weights with network weights (θ^π′ ← θ^π)\n", " self.init_target_network_params = [self.target_network_params[i].assign(self.network_params[i])\n", " for i in range(len(self.target_network_params))]\n", " \n", " # Update target network weights (θ^π′ ← τθ^π + (1 − τ)θ^π′)\n", " self.update_target_network_params = [self.target_network_params[i].assign(\n", " tf.multiply(self.tau, self.network_params[i]) +\n", " tf.multiply(1 - self.tau, self.target_network_params[i]))\n", " for i in range(len(self.target_network_params))]\n", "\n", " # Gradient computation from Critic's action_gradients\n", " self.action_gradients = tf.placeholder(tf.float32, [None, self.action_space_size])\n", " gradients = tf.gradients(tf.reshape(self.action_weights, [self.batch_size, self.action_space_size], name='42222222222'),\n", " self.network_params,\n", " self.action_gradients)\n", " params_gradients = list(map(lambda x: tf.div(x, self.batch_size * self.action_space_size), gradients))\n", " \n", " # Compute ∇_a.Q(s, a|θ^µ).∇_θ^π.f_θ^π(s)\n", " self.optimizer = tf.train.AdamOptimizer(self.learning_rate).apply_gradients(\n", " zip(params_gradients, self.network_params))\n", "\n", " def _build_net(self, scope):\n", " ''' Build the (target) Actor network. '''\n", "\n", " def gather_last_output(data, seq_lens):\n", " def cli_value(x, v):\n", " y = tf.constant(v, shape=x.get_shape(), dtype=tf.int64)\n", " x = tf.cast(x, tf.int64)\n", " return tf.where(tf.greater(x, y), x, y)\n", "\n", " batch_range = tf.range(tf.cast(tf.shape(data)[0], dtype=tf.int64), dtype=tf.int64)\n", " tmp_end = tf.map_fn(lambda x: cli_value(x, 0), seq_lens - 1, dtype=tf.int64)\n", " indices = tf.stack([batch_range, tmp_end], axis=1)\n", " return tf.gather_nd(data, indices)\n", "\n", " with tf.variable_scope(scope):\n", " # Inputs: current state, sequence_length\n", " # Outputs: action weights to compute the score Equation (6)\n", " state = tf.placeholder(tf.float32, [None, self.state_space_size], 'state')\n", " state_ = tf.reshape(state, [-1, self.history_length, self.embedding_size])\n", " sequence_length = tf.placeholder(tf.int32, [None], 'sequence_length')\n", " cell = tf.nn.rnn_cell.GRUCell(self.embedding_size,\n", " activation=tf.nn.relu,\n", " kernel_initializer=tf.initializers.random_normal(),\n", " bias_initializer=tf.zeros_initializer())\n", " outputs, _ = tf.nn.dynamic_rnn(cell, state_, dtype=tf.float32, sequence_length=sequence_length)\n", " last_output = gather_last_output(outputs, sequence_length) # TODO: replace by h\n", " x = tf.keras.layers.Dense(self.ra_length * self.embedding_size)(last_output)\n", " action_weights = tf.reshape(x, [-1, self.ra_length, self.embedding_size])\n", "\n", " return action_weights, state, sequence_length\n", "\n", " def train(self, state, sequence_length, action_gradients):\n", " ''' Compute ∇_a.Q(s, a|θ^µ).∇_θ^π.f_θ^π(s). '''\n", " self.sess.run(self.optimizer,\n", " feed_dict={\n", " self.state: state,\n", " self.sequence_length: sequence_length,\n", " self.action_gradients: action_gradients})\n", "\n", " def predict(self, state, sequence_length):\n", " return self.sess.run(self.action_weights,\n", " feed_dict={\n", " self.state: state,\n", " self.sequence_length: sequence_length})\n", "\n", " def predict_target(self, state, sequence_length):\n", " return self.sess.run(self.target_action_weights,\n", " feed_dict={\n", " self.target_state: state,\n", " self.target_sequence_length: sequence_length})\n", "\n", " def init_target_network(self):\n", " self.sess.run(self.init_target_network_params)\n", "\n", " def update_target_network(self):\n", " self.sess.run(self.update_target_network_params)\n", " \n", " def get_recommendation_list(self, ra_length, noisy_state, embeddings, target=False):\n", " '''\n", " Algorithm 2\n", " Args:\n", " ra_length: length of the recommendation list.\n", " noisy_state: current/remembered environment state with noise.\n", " embeddings: Embeddings object.\n", " target: boolean to use Actor's network or target network.\n", " Returns:\n", " Recommendation List: list of embedded items as future actions.\n", " '''\n", "\n", " def get_score(weights, embedding, batch_size):\n", " '''\n", " Equation (6)\n", " Args:\n", " weights: w_t^k shape=(embedding_size,).\n", " embedding: e_i shape=(embedding_size,).\n", " Returns:\n", " score of the item i: score_i=w_t^k.e_i^T shape=(1,).\n", " '''\n", " ret = np.dot(weights, embedding.T)\n", " return ret\n", "\n", " batch_size = noisy_state.shape[0]\n", "\n", " # '1: Generate w_t = {w_t^1, ..., w_t^K} according to Equation (5)'\n", " method = self.predict_target if target else self.predict\n", " weights = method(noisy_state, [ra_length] * batch_size)\n", "\n", " # '3: Score items in I according to Equation (6)'\n", " scores = np.array([[[get_score(weights[i][k], embedding, batch_size)\n", " for embedding in embeddings.get_embedding_vector()]\n", " for k in range(ra_length)]\n", " for i in range(batch_size)])\n", "\n", " # '8: return a_t'\n", " return np.array([[embeddings.get_embedding(np.argmax(scores[i][k]))\n", " for k in range(ra_length)]\n", " for i in range(batch_size)])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "7O8wgnzS3PIU" }, "source": [ "### Critic `class`\n", "\n", "This is the value approximator critic" ] }, { "cell_type": "code", "metadata": { "id": "wjl-YYbOup2V" }, "source": [ "#collapse-hide\n", "class Critic():\n", " ''' Value function approximator. '''\n", " \n", " def __init__(self, sess, state_space_size, action_space_size, history_length, embedding_size, tau, learning_rate, scope='critic'):\n", " self.sess = sess\n", " self.state_space_size = state_space_size\n", " self.action_space_size = action_space_size\n", " self.history_length = history_length\n", " self.embedding_size = embedding_size\n", " self.tau = tau\n", " self.learning_rate = learning_rate\n", " self.scope = scope\n", "\n", " with tf.variable_scope(self.scope):\n", " # Build Critic network\n", " self.critic_Q_value, self.state, self.action, self.sequence_length = self._build_net('estimator_critic')\n", " self.network_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='estimator_critic')\n", "\n", " # Build target Critic network\n", " self.target_Q_value, self.target_state, self.target_action, self.target_sequence_length = self._build_net('target_critic')\n", " self.target_network_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='target_critic')\n", "\n", " # Initialize target network weights with network weights (θ^µ′ ← θ^µ)\n", " self.init_target_network_params = [self.target_network_params[i].assign(self.network_params[i])\n", " for i in range(len(self.target_network_params))]\n", "\n", " # Update target network weights (θ^µ′ ← τθ^µ + (1 − τ)θ^µ′)\n", " self.update_target_network_params = [self.target_network_params[i].assign(\n", " tf.multiply(self.tau, self.network_params[i]) +\n", " tf.multiply(1 - self.tau, self.target_network_params[i]))\n", " for i in range(len(self.target_network_params))]\n", "\n", " # Minimize MSE between Critic's and target Critic's outputed Q-values\n", " self.expected_reward = tf.placeholder(tf.float32, [None, 1])\n", " self.loss = tf.reduce_mean(tf.squared_difference(self.expected_reward, self.critic_Q_value))\n", " self.optimizer = tf.train.AdamOptimizer(self.learning_rate).minimize(self.loss)\n", "\n", " # Compute ∇_a.Q(s, a|θ^µ)\n", " self.action_gradients = tf.gradients(self.critic_Q_value, self.action)\n", "\n", " def _build_net(self, scope):\n", " ''' Build the (target) Critic network. '''\n", "\n", " def gather_last_output(data, seq_lens):\n", " def cli_value(x, v):\n", " y = tf.constant(v, shape=x.get_shape(), dtype=tf.int64)\n", " return tf.where(tf.greater(x, y), x, y)\n", "\n", " this_range = tf.range(tf.cast(tf.shape(seq_lens)[0], dtype=tf.int64), dtype=tf.int64)\n", " tmp_end = tf.map_fn(lambda x: cli_value(x, 0), seq_lens - 1, dtype=tf.int64)\n", " indices = tf.stack([this_range, tmp_end], axis=1)\n", " return tf.gather_nd(data, indices)\n", "\n", " with tf.variable_scope(scope):\n", " # Inputs: current state, current action\n", " # Outputs: predicted Q-value\n", " state = tf.placeholder(tf.float32, [None, self.state_space_size], 'state')\n", " state_ = tf.reshape(state, [-1, self.history_length, self.embedding_size])\n", " action = tf.placeholder(tf.float32, [None, self.action_space_size], 'action')\n", " sequence_length = tf.placeholder(tf.int64, [None], name='critic_sequence_length')\n", " cell = tf.nn.rnn_cell.GRUCell(self.history_length,\n", " activation=tf.nn.relu,\n", " kernel_initializer=tf.initializers.random_normal(),\n", " bias_initializer=tf.zeros_initializer())\n", " predicted_state, _ = tf.nn.dynamic_rnn(cell, state_, dtype=tf.float32, sequence_length=sequence_length)\n", " predicted_state = gather_last_output(predicted_state, sequence_length)\n", "\n", " inputs = tf.concat([predicted_state, action], axis=-1)\n", " layer1 = tf.layers.Dense(32, activation=tf.nn.relu)(inputs)\n", " layer2 = tf.layers.Dense(16, activation=tf.nn.relu)(layer1)\n", " critic_Q_value = tf.layers.Dense(1)(layer2)\n", " return critic_Q_value, state, action, sequence_length\n", "\n", " def train(self, state, action, sequence_length, expected_reward):\n", " ''' Minimize MSE between expected reward and target Critic's Q-value. '''\n", " return self.sess.run([self.critic_Q_value, self.loss, self.optimizer],\n", " feed_dict={\n", " self.state: state,\n", " self.action: action,\n", " self.sequence_length: sequence_length,\n", " self.expected_reward: expected_reward})\n", "\n", " def predict(self, state, action, sequence_length):\n", " ''' Returns Critic's predicted Q-value. '''\n", " return self.sess.run(self.critic_Q_value,\n", " feed_dict={\n", " self.state: state,\n", " self.action: action,\n", " self.sequence_length: sequence_length})\n", "\n", " def predict_target(self, state, action, sequence_length):\n", " ''' Returns target Critic's predicted Q-value. '''\n", " return self.sess.run(self.target_Q_value,\n", " feed_dict={\n", " self.target_state: state,\n", " self.target_action: action,\n", " self.target_sequence_length: sequence_length})\n", "\n", " def get_action_gradients(self, state, action, sequence_length):\n", " ''' Returns ∇_a.Q(s, a|θ^µ). '''\n", " return np.array(self.sess.run(self.action_gradients,\n", " feed_dict={\n", " self.state: state,\n", " self.action: action,\n", " self.sequence_length: sequence_length})[0])\n", "\n", " def init_target_network(self):\n", " self.sess.run(self.init_target_network_params)\n", "\n", " def update_target_network(self):\n", " self.sess.run(self.update_target_network_params)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "EpWire7m3cnB" }, "source": [ "### ReplayMemory `class`" ] }, { "cell_type": "code", "metadata": { "id": "N9b_is_Su_C-" }, "source": [ "#collapse-hide\n", "class ReplayMemory():\n", " ''' Replay memory D. '''\n", " \n", " def __init__(self, buffer_size):\n", " self.buffer_size = buffer_size\n", " # self.buffer = [[row['state'], row['action'], row['reward'], row['n_state']] for _, row in data.iterrows()][-self.buffer_size:] TODO: empty or not?\n", " self.buffer = []\n", "\n", " def add(self, state, action, reward, n_state):\n", " self.buffer.append([state, action, reward, n_state])\n", " if len(self.buffer) > self.buffer_size:\n", " self.buffer.pop(0)\n", "\n", " def size(self):\n", " return len(self.buffer)\n", "\n", " def sample_batch(self, batch_size):\n", " return random.sample(self.buffer, batch_size)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "5cOXOxdV3lde" }, "source": [ "### experience_replay `function`" ] }, { "cell_type": "code", "metadata": { "id": "9foofY8zvHYE" }, "source": [ "#collapse-hide\n", "def experience_replay(replay_memory, batch_size, actor, critic, embeddings, ra_length, state_space_size, action_space_size, discount_factor):\n", " '''\n", " Experience replay.\n", " Args:\n", " replay_memory: replay memory D in article.\n", " batch_size: sample size.\n", " actor: Actor network.\n", " critic: Critic network.\n", " embeddings: Embeddings object.\n", " state_space_size: dimension of states.\n", " action_space_size: dimensions of actions.\n", " Returns:\n", " Best Q-value, loss of Critic network for printing/recording purpose.\n", " '''\n", "\n", " # '22: Sample minibatch of N transitions (s, a, r, s′) from D'\n", " samples = replay_memory.sample_batch(batch_size)\n", " states = np.array([s[0] for s in samples])\n", " actions = np.array([s[1] for s in samples])\n", " rewards = np.array([s[2] for s in samples])\n", " n_states = np.array([s[3] for s in samples]).reshape(-1, state_space_size)\n", "\n", " # '23: Generate a′ by target Actor network according to Algorithm 2'\n", " n_actions = actor.get_recommendation_list(ra_length, states, embeddings, target=True).reshape(-1, action_space_size)\n", "\n", " # Calculate predicted Q′(s′, a′|θ^µ′) value\n", " target_Q_value = critic.predict_target(n_states, n_actions, [ra_length] * batch_size)\n", "\n", " # '24: Set y = r + γQ′(s′, a′|θ^µ′)'\n", " expected_rewards = rewards + discount_factor * target_Q_value\n", " \n", " # '25: Update Critic by minimizing (y − Q(s, a|θ^µ))²'\n", " critic_Q_value, critic_loss, _ = critic.train(states, actions, [ra_length] * batch_size, expected_rewards)\n", " \n", " # '26: Update the Actor using the sampled policy gradient'\n", " action_gradients = critic.get_action_gradients(states, n_actions, [ra_length] * batch_size)\n", " actor.train(states, [ra_length] * batch_size, action_gradients)\n", "\n", " # '27: Update the Critic target networks'\n", " critic.update_target_network()\n", "\n", " # '28: Update the Actor target network'\n", " actor.update_target_network()\n", "\n", " return np.amax(critic_Q_value), critic_loss" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "8uRr4bJg36El" }, "source": [ "### OrnsteinUhlenbeckNoise `class`" ] }, { "cell_type": "code", "metadata": { "id": "-VYMfzSJvNHI" }, "source": [ "#collapse-hide\n", "class OrnsteinUhlenbeckNoise:\n", " ''' Noise for Actor predictions. '''\n", " def __init__(self, action_space_size, mu=0, theta=0.5, sigma=0.2):\n", " self.action_space_size = action_space_size\n", " self.mu = mu\n", " self.theta = theta\n", " self.sigma = sigma\n", " self.state = np.ones(self.action_space_size) * self.mu\n", "\n", " def get(self):\n", " self.state += self.theta * (self.mu - self.state) + self.sigma * np.random.rand(self.action_space_size)\n", " return self.state\n", "\n", "def train(sess, environment, actor, critic, embeddings, history_length, ra_length, buffer_size, batch_size, discount_factor, nb_episodes, filename_summary):\n", " ''' Algorithm 3 in article. '''\n", "\n", " # Set up summary operators\n", " def build_summaries():\n", " episode_reward = tf.Variable(0.)\n", " tf.summary.scalar('reward', episode_reward)\n", " episode_max_Q = tf.Variable(0.)\n", " tf.summary.scalar('max_Q_value', episode_max_Q)\n", " critic_loss = tf.Variable(0.)\n", " tf.summary.scalar('critic_loss', critic_loss)\n", "\n", " summary_vars = [episode_reward, episode_max_Q, critic_loss]\n", " summary_ops = tf.summary.merge_all()\n", " return summary_ops, summary_vars\n", "\n", " summary_ops, summary_vars = build_summaries()\n", " sess.run(tf.global_variables_initializer())\n", " writer = tf.summary.FileWriter(filename_summary, sess.graph)\n", "\n", " # '2: Initialize target network f′ and Q′'\n", " actor.init_target_network()\n", " critic.init_target_network()\n", "\n", " # '3: Initialize the capacity of replay memory D'\n", " replay_memory = ReplayMemory(buffer_size) # Memory D in article\n", " replay = False\n", "\n", "\n", " start_time = time.time()\n", " for i_session in range(nb_episodes): # '4: for session = 1, M do'\n", " session_reward = 0\n", " session_Q_value = 0\n", " session_critic_loss = 0\n", "\n", " # '5: Reset the item space I' is useless because unchanged.\n", "\n", " states = environment.reset() # '6: Initialize state s_0 from previous sessions'\n", " \n", " if (i_session + 1) % 10 == 0: # Update average parameters every 10 episodes\n", " environment.groups = environment.get_groups()\n", " \n", " exploration_noise = OrnsteinUhlenbeckNoise(history_length * embeddings.size())\n", "\n", " for t in range(nb_rounds): # '7: for t = 1, T do'\n", " # '8: Stage 1: Transition Generating Stage'\n", "\n", " # '9: Select an action a_t = {a_t^1, ..., a_t^K} according to Algorithm 2'\n", " actions = actor.get_recommendation_list(\n", " ra_length,\n", " states.reshape(1, -1), # TODO + exploration_noise.get().reshape(1, -1),\n", " embeddings).reshape(ra_length, embeddings.size())\n", "\n", " # '10: Execute action a_t and observe the reward list {r_t^1, ..., r_t^K} for each item in a_t'\n", " rewards, next_states = environment.step(actions)\n", "\n", " # '19: Store transition (s_t, a_t, r_t, s_t+1) in D'\n", " replay_memory.add(states.reshape(history_length * embeddings.size()),\n", " actions.reshape(ra_length * embeddings.size()),\n", " [rewards],\n", " next_states.reshape(history_length * embeddings.size()))\n", "\n", " states = next_states # '20: Set s_t = s_t+1'\n", "\n", " session_reward += rewards\n", " \n", " # '21: Stage 2: Parameter Updating Stage'\n", " if replay_memory.size() >= batch_size: # Experience replay\n", " replay = True\n", " replay_Q_value, critic_loss = experience_replay(replay_memory, batch_size,\n", " actor, critic, embeddings, ra_length, history_length * embeddings.size(),\n", " ra_length * embeddings.size(), discount_factor)\n", " session_Q_value += replay_Q_value\n", " session_critic_loss += critic_loss\n", "\n", " summary_str = sess.run(summary_ops,\n", " feed_dict={summary_vars[0]: session_reward,\n", " summary_vars[1]: session_Q_value,\n", " summary_vars[2]: session_critic_loss})\n", " \n", " writer.add_summary(summary_str, i_session)\n", "\n", " '''\n", " print(state_to_items(embeddings.embed(data['state'][0]), actor, ra_length, embeddings),\n", " state_to_items(embeddings.embed(data['state'][0]), actor, ra_length, embeddings, True))\n", " '''\n", "\n", " str_loss = str('Loss=%0.4f' % session_critic_loss)\n", " print(('Episode %d/%d Reward=%d Time=%ds ' + (str_loss if replay else 'No replay')) % (i_session + 1, nb_episodes, session_reward, time.time() - start_time))\n", " start_time = time.time()\n", "\n", " writer.close()\n", " tf.train.Saver().save(sess, 'models.h5', write_meta_graph=False)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "J8v1uNNs3-eU" }, "source": [ "### Hyperparameters" ] }, { "cell_type": "code", "metadata": { "id": "2YWoY788vTvE" }, "source": [ "# Hyperparameters\n", "history_length = 12 # N in article\n", "ra_length = 4 # K in article\n", "discount_factor = 0.99 # Gamma in Bellman equation\n", "actor_lr = 0.0001\n", "critic_lr = 0.001\n", "tau = 0.001 # τ in Algorithm 3\n", "batch_size = 64\n", "nb_episodes = 100\n", "nb_rounds = 50\n", "filename_summary = 'summary.txt'\n", "alpha = 0.5 # α (alpha) in Equation (1)\n", "gamma = 0.9 # Γ (Gamma) in Equation (4)\n", "buffer_size = 1000000 # Size of replay memory D in article\n", "fixed_length = True # Fixed memory length" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "UfjpsYgE4Ait" }, "source": [ "### Data generation" ] }, { "cell_type": "code", "metadata": { "id": "ZboYOTAmyPBg" }, "source": [ "dg = DataGenerator('ml-100k/u.data', 'ml-100k/u.item')\n", "dg.gen_train_test(0.8, seed=42)\n", "\n", "dg.write_csv('train.csv', dg.train, nb_states=[history_length], nb_actions=[ra_length])\n", "dg.write_csv('test.csv', dg.test, nb_states=[history_length], nb_actions=[ra_length])\n", "\n", "data = read_file('train.csv')" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "vLfbBCivyTCf", "outputId": "57164d2a-3d21-44f1-e419-106619690c4c" }, "source": [ "data.head()" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
| \n", " | state | \n", "n_state | \n", "action | \n", "reward | \n", "
|---|---|---|---|---|
| 0 | \n", "[732, 257, 507, 602, 481, 568, 1286, 50, 501, ... | \n", "[732, 257, 507, 602, 481, 568, 1286, 50, 501, ... | \n", "[731, 525, 80, 88] | \n", "(3, 4, 3, 3) | \n", "
| 1 | \n", "[1226, 855, 339, 124, 16, 147, 59, 827, 323, 2... | \n", "[1226, 855, 339, 124, 16, 147, 59, 827, 323, 2... | \n", "[52, 1005, 347, 70] | \n", "(4, 5, 4, 3) | \n", "
| 2 | \n", "[316, 286, 313, 748, 258, 272, 300, 302, 347, ... | \n", "[316, 286, 313, 748, 258, 272, 300, 302, 347, ... | \n", "[751, 271, 689, 289] | \n", "(4, 4, 4, 5) | \n", "
| 3 | \n", "[235, 433, 96, 117, 429, 7, 471, 201, 276, 55,... | \n", "[235, 433, 96, 117, 429, 7, 471, 201, 276, 55,... | \n", "[31, 198, 724, 654] | \n", "(3, 5, 3, 4) | \n", "
| 4 | \n", "[77, 241, 98, 423, 71, 157, 955, 186, 121, 421... | \n", "[77, 241, 98, 423, 71, 157, 955, 186, 121, 421... | \n", "[316, 427, 313, 959] | \n", "(4, 5, 4, 5) | \n", "