{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "colab": { "name": "WordVectorsTuning.ipynb", "provenance": [] }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "6z77Evxmaoby", "colab_type": "text" }, "source": [ "# Instructions\n", "1. Go to https://colab.research.google.com and choose the \\\"Upload\\\" option to upload this notebook file.\n", "1. In the Edit menu, choose \\\"Notebook Settings\\\" and then set the \\\"Hardware Accelerator\\\" dropdown to GPU.\n", "1. Read through the code in the following sections:\n", " * [IMDB Dataset](#scrollTo=NPa7eLiaaof0)\n", " * [Define Model](#scrollTo=ihsQ5xEoaog6)\n", " * [Train Model](#scrollTo=OlXYR7KNaohE)\n", " * [Assess Model](#scrollTo=LkS3AAQraohK)\n", "1. Complete at least one of these exercises. Remember to keep notes about what you do!\n", " * [Exercise Option #1 - Standard Difficulty](#scrollTo=VU4-GCUxaohS)\n", " * [Exercise Option #2 - Advanced Difficulty](#scrollTo=VU4-GCUxaohS)" ] }, { "cell_type": "markdown", "metadata": { "id": "fCSZ2tLNaocV", "colab_type": "text" }, "source": [ "## Documentation/Sources\n", "* [https://radimrehurek.com/gensim/models/word2vec.html](https://radimrehurek.com/gensim/models/word2vec.html) for more information about how to use gensim word2vec in general\n", "* _Blog post has been removed_ [https://codekansas.github.io/blog/2016/gensim.html](https://codekansas.github.io/blog/2016/gensim.html) for information about using it to create embedding layers for neural networks.\n", "* [https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/) for information on sequence classification with keras\n", "* [https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) for using pre-trained embeddings with keras (though the syntax they use for the model layers is different than most other tutorials).\n", "* [https://keras.io/](https://keras.io/) Keras API documentation" ] }, { "cell_type": "code", "metadata": { "id": "sSArlh7obXaG", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "outputId": "00dfdeb3-9e7d-4d44-8188-c7e6c1df0ac9" }, "source": [ "# upgrade tensorflow to tensorflow 2\n", "%tensorflow_version 2.x\n", "# display matplotlib plots\n", "%matplotlib inline" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "TensorFlow 2.x selected.\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "NPa7eLiaaof0", "colab_type": "text" }, "source": [ "# IMDB Dataset\n", "The [IMDB dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) consists of movie reviews that have been marked as positive or negative. (There is also a built-in dataset of [Reuters newswires](https://keras.io/datasets/#reuters-newswire-topics-classification) that have been classified by topic.)" ] }, { "cell_type": "code", "metadata": { "id": "aQR2EqyHaof1", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 69 }, "outputId": "eddb6a85-9735-4f59-e77b-d4a0c6005546" }, "source": [ "from keras.datasets import imdb\n", "(x_train, y_train), (x_test, y_test) = imdb.load_data()" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz\n", "17465344/17464789 [==============================] - 0s 0us/step\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "IMyq3IcQaof-", "colab_type": "text" }, "source": [ "It looks like our labels consist of 0 or 1, which makes sense for positive and negative." ] }, { "cell_type": "code", "metadata": { "id": "2JxnVYvqaogF", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 69 }, "outputId": "22a4e2e5-f60a-4d40-84e5-6a22a255226f" }, "source": [ "print(y_train[0:9])\n", "print(max(y_train))\n", "print(min(y_train))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[1 0 0 1 0 0 1 0 1]\n", "1\n", "0\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "2mr-2B8BaogJ", "colab_type": "text" }, "source": [ "But x is a bit more trouble. The words have already been converted to numbers -- numbers that have nothing to do with the word embeddings we spent time learning!" ] }, { "cell_type": "code", "metadata": { "id": "w9aIWApiaogL", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "outputId": "10f3483b-bed8-46ac-e0bb-ba98ac881b4e" }, "source": [ "x_train[0]" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[1,\n", " 14,\n", " 22,\n", " 16,\n", " 43,\n", " 530,\n", " 973,\n", " 1622,\n", " 1385,\n", " 65,\n", " 458,\n", " 4468,\n", " 66,\n", " 3941,\n", " 4,\n", " 173,\n", " 36,\n", " 256,\n", " 5,\n", " 25,\n", " 100,\n", " 43,\n", " 838,\n", " 112,\n", " 50,\n", " 670,\n", " 22665,\n", " 9,\n", " 35,\n", " 480,\n", " 284,\n", " 5,\n", " 150,\n", " 4,\n", " 172,\n", " 112,\n", " 167,\n", " 21631,\n", " 336,\n", " 385,\n", " 39,\n", " 4,\n", " 172,\n", " 4536,\n", " 1111,\n", " 17,\n", " 546,\n", " 38,\n", " 13,\n", " 447,\n", " 4,\n", " 192,\n", " 50,\n", " 16,\n", " 6,\n", " 147,\n", " 2025,\n", " 19,\n", " 14,\n", " 22,\n", " 4,\n", " 1920,\n", " 4613,\n", " 469,\n", " 4,\n", " 22,\n", " 71,\n", " 87,\n", " 12,\n", " 16,\n", " 43,\n", " 530,\n", " 38,\n", " 76,\n", " 15,\n", " 13,\n", " 1247,\n", " 4,\n", " 22,\n", " 17,\n", " 515,\n", " 17,\n", " 12,\n", " 16,\n", " 626,\n", " 18,\n", " 19193,\n", " 5,\n", " 62,\n", " 386,\n", " 12,\n", " 8,\n", " 316,\n", " 8,\n", " 106,\n", " 5,\n", " 4,\n", " 2223,\n", " 5244,\n", " 16,\n", " 480,\n", " 66,\n", " 3785,\n", " 33,\n", " 4,\n", " 130,\n", " 12,\n", " 16,\n", " 38,\n", " 619,\n", " 5,\n", " 25,\n", " 124,\n", " 51,\n", " 36,\n", " 135,\n", " 48,\n", " 25,\n", " 1415,\n", " 33,\n", " 6,\n", " 22,\n", " 12,\n", " 215,\n", " 28,\n", " 77,\n", " 52,\n", " 5,\n", " 14,\n", " 407,\n", " 16,\n", " 82,\n", " 10311,\n", " 8,\n", " 4,\n", " 107,\n", " 117,\n", " 5952,\n", " 15,\n", " 256,\n", " 4,\n", " 31050,\n", " 7,\n", " 3766,\n", " 5,\n", " 723,\n", " 36,\n", " 71,\n", " 43,\n", " 530,\n", " 476,\n", " 26,\n", " 400,\n", " 317,\n", " 46,\n", " 7,\n", " 4,\n", " 12118,\n", " 1029,\n", " 13,\n", " 104,\n", " 88,\n", " 4,\n", " 381,\n", " 15,\n", " 297,\n", " 98,\n", " 32,\n", " 2071,\n", " 56,\n", " 26,\n", " 141,\n", " 6,\n", " 194,\n", " 7486,\n", " 18,\n", " 4,\n", " 226,\n", " 22,\n", " 21,\n", " 134,\n", " 476,\n", " 26,\n", " 480,\n", " 5,\n", " 144,\n", " 30,\n", " 5535,\n", " 18,\n", " 51,\n", " 36,\n", " 28,\n", " 224,\n", " 92,\n", " 25,\n", " 104,\n", " 4,\n", " 226,\n", " 65,\n", " 16,\n", " 38,\n", " 1334,\n", " 88,\n", " 12,\n", " 16,\n", " 283,\n", " 5,\n", " 16,\n", " 4472,\n", " 113,\n", " 103,\n", " 32,\n", " 15,\n", " 16,\n", " 5345,\n", " 19,\n", " 178,\n", " 32]" ] }, "metadata": { "tags": [] }, "execution_count": 4 } ] }, { "cell_type": "markdown", "metadata": { "id": "r1K2hwGEaogR", "colab_type": "text" }, "source": [ "Looking at the help page for imdb, it appears there is a way to get the word back. Phew." ] }, { "cell_type": "code", "metadata": { "id": "-PQ0HAPWaogT", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 992 }, "outputId": "d72dfd54-56c1-4efe-eb05-d7589e04c58a" }, "source": [ "help(imdb)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Help on module keras.datasets.imdb in keras.datasets:\n", "\n", "NAME\n", " keras.datasets.imdb - IMDB sentiment classification dataset.\n", "\n", "FUNCTIONS\n", " get_word_index(path='imdb_word_index.json')\n", " Retrieves the dictionary mapping words to word indices.\n", " \n", " # Arguments\n", " path: where to cache the data (relative to `~/.keras/dataset`).\n", " \n", " # Returns\n", " The word index dictionary.\n", " \n", " load_data(path='imdb.npz', num_words=None, skip_top=0, maxlen=None, seed=113, start_char=1, oov_char=2, index_from=3, **kwargs)\n", " Loads the IMDB dataset.\n", " \n", " # Arguments\n", " path: where to cache the data (relative to `~/.keras/dataset`).\n", " num_words: max number of words to include. Words are ranked\n", " by how often they occur (in the training set) and only\n", " the most frequent words are kept\n", " skip_top: skip the top N most frequently occurring words\n", " (which may not be informative).\n", " maxlen: sequences longer than this will be filtered out.\n", " seed: random seed for sample shuffling.\n", " start_char: The start of a sequence will be marked with this character.\n", " Set to 1 because 0 is usually the padding character.\n", " oov_char: words that were cut out because of the `num_words`\n", " or `skip_top` limit will be replaced with this character.\n", " index_from: index actual words with this index and higher.\n", " \n", " # Returns\n", " Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.\n", " \n", " # Raises\n", " ValueError: in case `maxlen` is so low\n", " that no input sequence could be kept.\n", " \n", " Note that the 'out of vocabulary' character is only used for\n", " words that were present in the training set but are not included\n", " because they're not making the `num_words` cut here.\n", " Words that were not seen in the training set but are in the test set\n", " have simply been skipped.\n", "\n", "DATA\n", " absolute_import = _Feature((2, 5, 0, 'alpha', 1), (3, 0, 0, 'alpha', 0...\n", " division = _Feature((2, 2, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0), 8192...\n", " print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...\n", "\n", "FILE\n", " /usr/local/lib/python3.6/dist-packages/keras/datasets/imdb.py\n", "\n", "\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "ZUEWtPOUaogX", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 52 }, "outputId": "3a7b360e-affb-429d-cb9f-fea1ed420775" }, "source": [ "imdb_offset = 3\n", "imdb_map = dict((index + imdb_offset, word) for (word, index) in imdb.get_word_index().items())\n", "imdb_map[0] = 'PADDING'\n", "imdb_map[1] = 'START'\n", "imdb_map[2] = 'UNKNOWN'" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json\n", "1646592/1641221 [==============================] - 0s 0us/step\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "6lw05emzaoga", "colab_type": "text" }, "source": [ "The knowledge about the initial indices and offset came from [this stack overflow post](https://stackoverflow.com/questions/42821330/restore-original-text-from-keras-s-imdb-dataset) after I got gibberish when I tried to translate the first review, below. It looks coherent now!" ] }, { "cell_type": "code", "metadata": { "id": "xhE_s9MEaogc", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 54 }, "outputId": "c1234fbb-573b-47cf-992e-45a9902dd380" }, "source": [ "' '.join([imdb_map[word_index] for word_index in x_train[0]])" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "\"START this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all\"" ] }, "metadata": { "tags": [] }, "execution_count": 7 } ] }, { "cell_type": "markdown", "metadata": { "id": "MhVB73mBaogs", "colab_type": "text" }, "source": [ "For this exercise, we're going to keep all inputs the same length (we'll see how to do variable-length later). This means we need to choose a maximum length for the review, cutting off longer ones and adding padding to shorter ones. What should we make the length? Let's understand our data." ] }, { "cell_type": "code", "metadata": { "id": "lFpDGoqnaogt", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "outputId": "13dd08d9-ad63-4de4-a4d3-012462387e11" }, "source": [ "lengths = [len(review) for review in x_train + x_test]\n", "print('Longest review: {} Shortest review: {}'.format(max(lengths), min(lengths)))\n" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Longest review: 2697 Shortest review: 70\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "GcZ5Mlk3aogx", "colab_type": "text" }, "source": [ "2697 words! Wow. Well, let's see how many reviews would get cut off at a particular cutoff." ] }, { "cell_type": "code", "metadata": { "id": "MxU71oxpaogx", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "outputId": "afecbebe-766a-403e-bbdf-805ad69efe03" }, "source": [ "cutoff = 500\n", "print('{} reviews out of {} are over {}.'.format(\n", " sum([1 for length in lengths if length > cutoff]), \n", " len(lengths), \n", " cutoff))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "8485 reviews out of 25000 are over 500.\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "F8f7lJnLaog1", "colab_type": "code", "colab": {} }, "source": [ "from keras.preprocessing import sequence\n", "x_train_padded = sequence.pad_sequences(x_train, maxlen=cutoff)\n", "x_test_padded = sequence.pad_sequences(x_test, maxlen=cutoff)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ihsQ5xEoaog6", "colab_type": "text" }, "source": [ "# Define Model" ] }, { "cell_type": "code", "metadata": { "id": "D8gypoeaaog7", "colab_type": "code", "colab": {} }, "source": [ "from tensorflow.keras.models import Sequential\n", "from tensorflow.keras.layers import Embedding, Conv1D, Dense, Flatten\n", "from tensorflow import test\n", "from tensorflow import device" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "oHCuR7fcaog-", "colab_type": "text" }, "source": [ "The embedding layer here learns the 100-dimensional vector embedding within the overall classification problem training. That is usually what we want, unless we have a bunch of un-tagged data that could be used to train word vectors but not a classification model." ] }, { "cell_type": "code", "metadata": { "id": "jBqJ-tvhaohA", "colab_type": "code", "colab": {} }, "source": [ "not_pretrained_model = Sequential()\n", "not_pretrained_model.add(Embedding(input_dim=len(imdb_map), output_dim=100, input_length=cutoff))\n", "not_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))\n", "not_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))\n", "not_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))\n", "not_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))\n", "not_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))\n", "not_pretrained_model.add(Flatten())\n", "not_pretrained_model.add(Dense(units=128, activation='relu'))\n", "not_pretrained_model.add(Dense(units=1, activation='sigmoid')) # because at the end, we want one yes/no answer\n", "not_pretrained_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "OlXYR7KNaohE", "colab_type": "text" }, "source": [ "# Train Model" ] }, { "cell_type": "code", "metadata": { "id": "UiloJX2ZaohF", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 52 }, "outputId": "7f215d3b-ea82-4f8c-8430-e53672755704" }, "source": [ "# Train using GPU acceleration\n", "# (see https://colab.research.google.com/notebooks/gpu.ipynb#scrollTo=Y04m-jvKRDsJ)\n", "device_name = test.gpu_device_name()\n", "if device_name != '/device:GPU:0':\n", " print(\n", " '\\n\\nThis error most likely means that this notebook is not '\n", " 'configured to use a GPU. Change this in Notebook Settings via the '\n", " 'command palette (cmd/ctrl-shift-P) or the Edit menu.\\n\\n')\n", " raise SystemError('GPU device not found')\n", "\n", "with device('/device:GPU:0'):\n", " not_pretrained_model.fit(x_train_padded, y_train, epochs=1, batch_size=64)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Train on 25000 samples\n", "25000/25000 [==============================] - 48s 2ms/sample - loss: 0.4890 - binary_accuracy: 0.7079\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "LkS3AAQraohK", "colab_type": "text" }, "source": [ "# Assess Model" ] }, { "cell_type": "code", "metadata": { "id": "ESSw58cFaohM", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 52 }, "outputId": "4588e8ba-820e-4b83-af07-46e417cda5a6" }, "source": [ "with device('/device:GPU:0'):\n", " not_pretrained_scores = not_pretrained_model.evaluate(x_test_padded, y_test)\n", "print('loss: {} accuracy: {}'.format(*not_pretrained_scores))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "25000/25000 [==============================] - 4s 168us/sample - loss: 0.2966 - binary_accuracy: 0.8779\n", "loss: 0.2965693319511414 accuracy: 0.8778799772262573\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "VU4-GCUxaohS", "colab_type": "text" }, "source": [ "## For any model that you try in these exercises, take notes about the performance you see and anything you notice about the differences between the models.\n", "\n", "## Exercise Option #1 - Standard Difficulty\n", "Try changing different hyperparameters of the not_pretrained model. Keep notes on how the performance changes.\n", "\n", "## Exercise Option #2 - Advanced Difficulty\n", "Make a model for the reuters classification problem, using the not_pretrained model above as a reference." ] }, { "cell_type": "code", "metadata": { "id": "TB-PieR3aohT", "colab_type": "code", "colab": {} }, "source": [ "" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "1EigSVrxaohY", "colab_type": "code", "colab": {} }, "source": [ "" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "QFtsrlp6aohb", "colab_type": "code", "colab": {} }, "source": [ "" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "BvCgIqS7aohf", "colab_type": "code", "colab": {} }, "source": [ "" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "nvo0B2Z2aohj", "colab_type": "code", "colab": {} }, "source": [ "" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ukqX2slYaohn", "colab_type": "code", "colab": {} }, "source": [ "" ], "execution_count": null, "outputs": [] } ] }