{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "9aiFD8Z_U9h8"
},
"source": [
"# Recurrent neural networks\n",
"\n",
"The goal is to learn to use LSTM layers in keras for sentiment analysis and time series prediction. The code for sentiment analysis is adapted from . The code for time series prediction is adapted from ."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "vMCNVPC3U9h_"
},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import tensorflow as tf"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Q09VlxB2U9iA"
},
"source": [
"## Sentiment analysis\n",
"\n",
"The goal to use recurrent neural networks (LSTM) to perform **sentiment analysis** on short sentences, i.e. to predict whether the sentence has a positive or negative meaning.\n",
"\n",
"The following cells represent your training and test data. They are lists of lists, where the first element is the sentence as a string, and the second a boolean, with `True` for positive sentences, `False` for negative ones.\n",
"\n",
"Notice how some sentences are ambiguous (if you do not notice the \"not\", the sentiment might be very different)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "g9e62tfqU9iB"
},
"outputs": [],
"source": [
"train_data = [\n",
" ['good', True],\n",
" ['bad', False],\n",
" ['happy', True],\n",
" ['sad', False],\n",
" ['not good', False],\n",
" ['not bad', True],\n",
" ['not happy', False],\n",
" ['not sad', True],\n",
" ['very good', True],\n",
" ['very bad', False],\n",
" ['very happy', True],\n",
" ['very sad', False],\n",
" ['i am happy', True],\n",
" ['this is good', True],\n",
" ['i am bad', False],\n",
" ['this is bad', False],\n",
" ['i am sad', False],\n",
" ['this is sad', False],\n",
" ['i am not happy', False],\n",
" ['this is not good', False],\n",
" ['i am not bad', True],\n",
" ['this is not sad', True],\n",
" ['i am very happy', True],\n",
" ['this is very good', True],\n",
" ['i am very bad', False],\n",
" ['this is very sad', False],\n",
" ['this is very happy', True],\n",
" ['i am good not bad', True],\n",
" ['this is good not bad', True],\n",
" ['i am bad not good', False],\n",
" ['i am good and happy', True],\n",
" ['this is not good and not happy', False],\n",
" ['i am not at all good', False],\n",
" ['i am not at all bad', True],\n",
" ['i am not at all happy', False],\n",
" ['this is not at all sad', True],\n",
" ['this is not at all happy', False],\n",
" ['i am good right now', True],\n",
" ['i am bad right now', False],\n",
" ['this is bad right now', False],\n",
" ['i am sad right now', False],\n",
" ['i was good earlier', True],\n",
" ['i was happy earlier', True],\n",
" ['i was bad earlier', False],\n",
" ['i was sad earlier', False],\n",
" ['i am very bad right now', False],\n",
" ['this is very good right now', True],\n",
" ['this is very sad right now', False],\n",
" ['this was bad earlier', False],\n",
" ['this was very good earlier', True],\n",
" ['this was very bad earlier', False],\n",
" ['this was very happy earlier', True],\n",
" ['this was very sad earlier', False],\n",
" ['i was good and not bad earlier', True],\n",
" ['i was not good and not happy earlier', False],\n",
" ['i am not at all bad or sad right now', True],\n",
" ['i am not at all good or happy right now', False],\n",
" ['this was not happy and not good earlier', False],\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "55E2a6QzU9iD"
},
"outputs": [],
"source": [
"test_data = [\n",
" ['this is happy', True],\n",
" ['i am good', True],\n",
" ['this is not happy', False],\n",
" ['i am not good', False],\n",
" ['this is not bad', True],\n",
" ['i am not sad', True],\n",
" ['i am very good', True],\n",
" ['this is very bad', False],\n",
" ['i am very sad', False],\n",
" ['this is bad not good', False],\n",
" ['this is good and happy', True],\n",
" ['i am not good and not happy', False],\n",
" ['i am not at all sad', True],\n",
" ['this is not at all good', False],\n",
" ['this is not at all bad', True],\n",
" ['this is good right now', True],\n",
" ['this is sad right now', False],\n",
" ['this is very bad right now', False],\n",
" ['this was good earlier', True],\n",
" ['i was not happy and not good earlier', False],\n",
" ['earlier i was good and not bad', True],\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "AdJBGX0rU9iE",
"outputId": "1dc753ca-ade2-4723-ae3b-e2cd75bff941"
},
"outputs": [],
"source": [
"N_train = len(train_data)\n",
"N_test = len(test_data)\n",
"print(N_train, \"training sentences.\")\n",
"print(N_test, \"test sentences.\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KjWNlqg_U9iF"
},
"source": [
"### Data preparation\n",
"\n",
"The most boring part when training LSTMs on text is to prepare the data correctly. Sentences are sequences of words (possibly a huge number of words), with a variable length (some sentences are shorter than others).\n",
"\n",
"What neural networks expect as input is a fixed-length sequence of numerical vectors $\\{\\mathbf{x}_t\\}_{t=0}^T$, i.e. they must have a fixed size. So we need to transform each sentence into this format.\n",
"\n",
"The first thing to do is to identify the vocabulary, i.e. the **unique** words in the training set (fortunately, the test set uses the same exact words) as well as the maximal number of words in each sentence (again, the test set does not have longer sentences).\n",
"\n",
"**Q:** Create a list `vocabulary` of unique words in the training set and compute the maximal length `nb_words` of a sentence.\n",
"\n",
"To extract the words in each sentence, the `split()` method of Python strings might come handy:\n",
"\n",
"```python\n",
"sentence = \"I fear this exercise will be difficult\"\n",
"print(sentence.split(\" \"))\n",
"```\n",
"\n",
"You will also find the `set` Python object useful to identify unique works. Check the doc. But there are many ways to do that (for loops), just do it the way you prefer."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "LhX_HdPSU9iG",
"outputId": "0f8ec1f4-7f7e-43e9-ecef-d8a07e822e7d"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "2nBAOO0UU9iH"
},
"source": [
"Now that we have found our list of 18 unique words, we need to able to perform **one-hot encoding** of each word, i.e. write a method `def one_hot_encoding(word, vocabulary)` that takes a word (e.g. \"good\") and the vocabulary, and returns a vector of size 18, with mostly zeros, except for a `1.0` at the location of the word in the vocabulary.\n",
"\n",
"For example, if your vocabulary is `[\"I\", \"love\", \"you\"]`, the one-hot encoding of \"I\" should be `np.array([1., 0., 0.])`, the one of \"love\" is `np.array([0., 1., 0.])`, etc.\n",
"\n",
"**Q:** Implement the `one_hot_encoding()` method for single words.\n",
"\n",
"*Hint:* you might find the method `index()` of list objects interesting."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qzWPJV_AU9iI"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "xlZAe7XyU9iJ"
},
"source": [
"**Q:** You can now create the training set `X_train, T_train` and the test set `X_test, T_test`.\n",
"\n",
"The training input data `X_train` should be a numpy array with 3 dimensions:\n",
"\n",
"```python\n",
" X_train = np.zeros((N_train, nb_words, len(vocabulary)))\n",
"```\n",
"\n",
"The first index corresponds to each sentence. The second index represents the index of each word in the sentence (maximally `nb_words=10`). The third index is for the one-hot encoding (18 elements).\n",
"\n",
"**Beware:** most sentences are shorter than `nb_words=10`. In that case, the words should be set **at the end of the sequence**, i.e. you prepend zero vectors. \n",
"\n",
"For example, \"I love you\" should be encoded as:\n",
"\n",
"```python\n",
"\"\", \"\", \"\", \"\", \"\", \"\", \"\", \"I\", \"love\", \"you\"\n",
"```\n",
"\n",
"not as:\n",
"\n",
"```python\n",
"\"I\", \"love\", \"you\", \"\", \"\", \"\", \"\", \"\", \"\", \"\"\n",
"```\n",
"\n",
"The reason for that is that the LSTM will get the words one by one and only respond \"positive\" or \"negative\" after the last word has been seen. If the words are provided at the beginning of the sequence, vanishing gradients might delete them.\n",
"\n",
"The same holds for the test set, it only has less sentences."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dTNX0OROU9iK"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "IgQFJr8jU9iM"
},
"source": [
"### Training the LSTM\n",
"\n",
"Now we just have to provide the data to a recurrent network. The problem is not very complicated, so we will need a single LSTM layer, followed by a single output neuron (with the logistic transfer function) whose role is to output 1 for the positive class, 0 for the negative one.\n",
"\n",
"**Q:** Check the documentation for the LSTM layer of `keras`: . It has many parameters:\n",
"\n",
"```python\n",
"tf.keras.layers.LSTM(\n",
" units, \n",
" activation='tanh', \n",
" recurrent_activation='sigmoid', \n",
" use_bias=True, \n",
" kernel_initializer='glorot_uniform', \n",
" recurrent_initializer='orthogonal', \n",
" bias_initializer='zeros', \n",
" unit_forget_bias=True, \n",
" kernel_regularizer=None, \n",
" recurrent_regularizer=None, bias_regularizer=None, \n",
" activity_regularizer=None, kernel_constraint=None, \n",
" recurrent_constraint=None, bias_constraint=None, \n",
" dropout=0.0, recurrent_dropout=0.0, \n",
" implementation=2, \n",
" return_sequences=False, return_state=False, \n",
" go_backwards=False, stateful=False, unroll=False)\n",
"```\n",
"\n",
"The default value for the parameters is the vanilla LSTM seen in the lectures, but you have the possibility to change the activation functions for the inputs and outputs (not the gates, it must be a sigmoid!), initialize the weights differently, add regularization or dropout, use biases or not, etc. That's a lot to play with. For this exercise, stick to the default parameters at the beginning. The only thing you need to define is the number of neurons `units` of the layer. \n",
"\n",
"```python\n",
"tf.keras.layers.LSTM(units=N)\n",
"```\n",
"\n",
"Note that an important parameter is `return_sequences`. When set to False (the default), the LSTM layer will process the complete sequence of 10 word vectors, and output a single vector of $N$ values (the number of units). When set to True, the layer would return a sequence of 10 vectors of size $N$.\n",
"\n",
"Here, we only want the LSTM layer to encode the sentence and feed a single vector to the output layer, so we can leave it to False. If we wanted to stack two LSTM layers on top of each other, we would need to set `return_sequences` to True for the first layer and False for the second one (you can try that later):\n",
"\n",
"```python\n",
"tf.keras.layers.LSTM(N, return_sequences=True)\n",
"tf.keras.layers.LSTM(M, return_sequences=False)\n",
"```\n",
"\n",
"**Q:** Create a model with one LSTM layer (with enough units) and one output layer with one neuron (`'sigmoid'` activation function). Choose an optimizer (SGD, RMSprop, Adam, etc) and a good learning rate. When compiling the model, use the `'binary_crossentropy'` loss function as it is a binary classification.\n",
"\n",
"The input layer of the network must take a `(nb_words, len(vocabulary))` matrix as input, i.e. (window, nb_features).\n",
"\n",
"```python\n",
"tf.keras.layers.Input((nb_words, len(vocabulary)))\n",
"```\n",
"\n",
"When training the model with `model.fit()`, you can pass the test set as validation data, as we do not have too many examples:\n",
"\n",
"```python\n",
"model.fit(X_train, T_train, validation_data=(X_test, T_test), ...)\n",
"```\n",
"\n",
"Train the model for enough epochs, using a batch size big enough but not too big. In other terms: do the hyperparameter search yourself ;). \n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wyUSAGVPU9iO"
},
"outputs": [],
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XH9l639TQyJE"
},
"source": [
"**Q.** Once you have been able to successfully train the network, vary the different parts of the model to understand their influence: learning rate, number of units, optimizer, etc. Add another LSTM layer to see what happens. Exchange the LSTM layer with the GRU layer."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jpdgHD7PU9iR"
},
"source": [
"## Time series prediction\n",
"\n",
"Another useful function of RNNs is forecasting, i.e. predicting the rest of a sequence (financial markets, weather, etc.) based on its history.\n",
"\n",
"Let's generate a dummy one-dimensional signal with 10000 points:\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 374
},
"id": "-jEpsMPyU9iR",
"outputId": "c882753f-9b17-45a8-c91b-b6f226e76122"
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"