{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Sequence based Text Classification .ipynb", "version": "0.3.2", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "accelerator": "GPU" }, "cells": [ { "metadata": { "id": "ryq1YEA-tYEo", "colab_type": "text" }, "cell_type": "markdown", "source": [ "In this notebook I tried to explore the basic sequence model i.e using RNN(and its variants) to build a N:1 sequence model. Here N:1 means, the model takes a sequence (a natural language text) as input and makes a prediction based on the training data. In NLP literature these models are called RNN acceptors but Andrej karpathy made the notion of calling RNN's as N:1 (or N:N or or 1:N) in his famous blog post, [The unreasonable effectiveness of RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)\n", "\n", "**This is tested on tensorflow-gpu=1.13.1**" ] }, { "metadata": { "id": "C0a5YxGSsrUe", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "outputId": "167f9eab-a979-4b4f-cfbf-370fd9a82776" }, "cell_type": "code", "source": [ "train_data_raw.head()" ], "execution_count": 4, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hourinwordsminute
016Time flies its thirty one minutes past four in the after noon now31
10Time flies its nine minutes past mid night now9
213fifty five minutes past one after noon, was the time on the clock when I entered the house55
38four minutes past eight in the morning, was the time on the clock when I entered the house4
416ten minutes past four in the after noon, was the time on the clock when I entered the house10
\n", "
" ], "text/plain": [ " hour \\\n", "0 16 \n", "1 0 \n", "2 13 \n", "3 8 \n", "4 16 \n", "\n", " inwords \\\n", "0 Time flies its thirty one minutes past four in the after noon now \n", "1 Time flies its nine minutes past mid night now \n", "2 fifty five minutes past one after noon, was the time on the clock when I entered the house \n", "3 four minutes past eight in the morning, was the time on the clock when I entered the house \n", "4 ten minutes past four in the after noon, was the time on the clock when I entered the house \n", "\n", " minute \n", "0 31 \n", "1 9 \n", "2 55 \n", "3 4 \n", "4 10 " ] }, "metadata": { "tags": [] }, "execution_count": 4 } ] }, { "metadata": { "id": "6yuHK0sZcZdz", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "outputId": "d2b1590d-1397-40a1-d755-b4201638f30e" }, "cell_type": "code", "source": [ "import tensorflow as tf\n", "from keras import backend\n", "import logging\n", "import numpy as np\n", "import pandas as pd\n", "import random \n", "pd.set_option('display.max_colwidth', -1)\n", "\n", "\n", "#To have reproducability: Set all the seeds, make sure multithreading is off, if possible don't use GPU. \n", "tf.set_random_seed(7)\n", "np.random.seed(7)\n", "session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)\n", "backend.set_session(tf.Session(graph=tf.get_default_graph(), config=session_conf))\n", "\n", "\n" ], "execution_count": 1, "outputs": [ { "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ], "name": "stderr" } ] }, { "metadata": { "id": "a7EKHNYBBJMV", "colab_type": "code", "colab": {} }, "cell_type": "code", "source": [ "# Synthetic Training and Test data\n", "\n", "def generate_data(hour, minute, sentence=''):\n", " \n", " special = [15,30]\n", " suffix = \"\"\n", "\n", " #print(hour, minute)\n", "\n", " dictionary = {1:\"one\", 2:\"two\", 3:\"three\", 4:\"four\", 5:\"five\", 6:\"six\", 7:\"seven\", 8:\"eight\", 9:\"nine\", 10:\"ten\", 11:\"eleven\", 12:\"twelve\", 13:\"thirteen\", \n", " 14:\"fourteen\", 16:\"sixteen\", 17:\"seventeen\", 18:\"eighteen\", 19:\"nineteen\", 20:\"twenty\", 30:\"thirty\",\n", " 40:\"forty\", 50:\"fifty\"}\n", " result = \"\"\n", " if minute == 15:\n", " result= \"quarter past\"\n", " elif minute == 30:\n", " result= \"half past\" \n", " elif minute == 0:\n", " pass\n", " else:\n", "\n", " if minute in dictionary:\n", " result = dictionary[minute] + \" minutes past\"\n", " else: \n", " minute1 = int(str(minute // 10 ) + \"0\") \n", " minute2 = minute % 10\n", " result = dictionary[minute1] + ' ' + dictionary[minute2] + \" minutes past\"\n", "\n", " if hour == 0:\n", " suffix = \"mid night\"\n", " elif hour >= 1 and hour <= 11:\n", " suffix = \"morning\"\n", " elif hour == 12:\n", " suffix = \"noon\"\n", " elif hour > 12 and hour <=16: \n", " suffix = \"after noon\"\n", " elif hour > 16 and hour <=19: \n", " suffix = \"evening\"\n", " elif hour > 20 and hour <=23: \n", " suffix = \"night\"\n", "\n", " save_hour = hour \n", " if hour > 12:\n", " hour = hour - 12\n", " \n", " if hour > 0:\n", " # Lets introduce some variation in the way how hours an sufffixes are formed, just for randomness\n", " if hour % 2 == 0:\n", " result = result + \" \" + dictionary[hour]+ \" in the \" + suffix \n", " else: \n", " result = result + \" \" + dictionary[hour]+ \" \" + suffix \n", " else:\n", " result = result + \" \" + suffix \n", " \n", " if sentence != '':\n", " result = sentence.replace('#@#', result)\n", " \n", " return save_hour, minute, result\n", "\n", " \n", " \n", "# Random sentence templates to shove our time compnents into to form propert english sentences\n", "sentence=[\n", " 'The murder happened exactly #@#',\n", " '#@#, was the time on the clock when I entered the house',\n", " 'Time flies its #@# now',\n", " 'Really was it #@# twice in a row?'\n", "]\n", "\n", "\n", "def train(): \n", " data = []\n", " i = 0\n", " while i < 200000:\n", " hour = random.randint(0,23)\n", " minute = random.randint(0,59)\n", " sent = random.randint(0,3)\n", " hour, minute, result = generate_data(hour, minute, sentence[sent])\n", " inwords = result\n", " data.append({\"inwords\":inwords, \"hour\": hour, \"minute\":minute})\n", " i += 1\n", " df = pd.DataFrame(data)\n", " #df.columns = ['inwords', 'hour', 'minute']\n", " return df\n", "\n", "def test(): \n", " data = []\n", " i = 0\n", " while i < 20000:\n", " hour = random.randint(10,15)\n", " minute = random.randint(0,59)\n", " sent = random.randint(0,3)\n", " hour, minute, result = generate_data(hour, minute, sentence[sent])\n", " inwords = result\n", " data.append({\"inwords\":inwords, \"hour\": hour, \"minute\":minute})\n", " i += 1\n", " df = pd.DataFrame(data) \n", " #df.columns = ['inwords', 'hour', 'minute']\n", " return df\n", " \n", " \n", "\n", "train_data_raw = train()\n", "test_data_raw = test()\n", "\n", " " ], "execution_count": 0, "outputs": [] }, { "metadata": { "id": "Hq0FlIWSgdrO", "colab_type": "code", "colab": {} }, "cell_type": "code", "source": [ "# import os\n", "# from google.colab import drive\n", "# drive.mount('/content/drive')\n", "# print(os.listdir(\"/content/drive/My Drive\"))\n" ], "execution_count": 0, "outputs": [] }, { "metadata": { "id": "w0oNY3PaBVkc", "colab_type": "code", "outputId": "ed53577a-dc20-4395-a3fa-7977ff3ba13b", "colab": { "base_uri": "https://localhost:8080/", "height": 204 } }, "cell_type": "code", "source": [ "train_data_raw.head()" ], "execution_count": 3, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hourinwordsminute
016Time flies its thirty one minutes past four in the after noon now31
10Time flies its nine minutes past mid night now9
213fifty five minutes past one after noon, was the time on the clock when I entered the house55
38four minutes past eight in the morning, was the time on the clock when I entered the house4
416ten minutes past four in the after noon, was the time on the clock when I entered the house10
\n", "
" ], "text/plain": [ " hour \\\n", "0 16 \n", "1 0 \n", "2 13 \n", "3 8 \n", "4 16 \n", "\n", " inwords \\\n", "0 Time flies its thirty one minutes past four in the after noon now \n", "1 Time flies its nine minutes past mid night now \n", "2 fifty five minutes past one after noon, was the time on the clock when I entered the house \n", "3 four minutes past eight in the morning, was the time on the clock when I entered the house \n", "4 ten minutes past four in the after noon, was the time on the clock when I entered the house \n", "\n", " minute \n", "0 31 \n", "1 9 \n", "2 55 \n", "3 4 \n", "4 10 " ] }, "metadata": { "tags": [] }, "execution_count": 3 } ] }, { "metadata": { "id": "Ax7yhuUvBb3-", "colab_type": "code", "outputId": "583d7492-37e2-4779-91ec-3fe0ba6405b1", "colab": { "base_uri": "https://localhost:8080/", "height": 85 } }, "cell_type": "code", "source": [ "from keras.preprocessing import text, sequence\n", "from keras.preprocessing.text import Tokenizer\n", "from keras.preprocessing.sequence import pad_sequences\n", "\n", "\n", "vocab_size = 5000 # based on words in the entire corpus\n", "max_len = 25 # based on word count in phrases\n", "\n", "train_phrases = list(train_data_raw['inwords'].values) \n", "test_phrases = list(test_data_raw['inwords'].values) \n", "train_target = pd.get_dummies(train_data_raw['hour'].values)\n", "\n", "#Vocabulary-Indexing of the train and test phrases, make sure \"filters\" parm doesn't clean out punctuations which you we dont intend to\n", "\n", "tokenizer = Tokenizer(num_words=vocab_size, lower=True, filters=',?.\\n\\t')\n", "tokenizer.fit_on_texts(train_phrases + test_phrases)\n", "encoded_train_phrases = tokenizer.texts_to_sequences(train_phrases)\n", "encoded_test_phrases = tokenizer.texts_to_sequences(test_phrases)\n", "\n", "\n", "#Watch for a POST padding, as opposed to the default PRE padding\n", "X_train_words = sequence.pad_sequences(encoded_train_phrases, maxlen=max_len, padding='post')\n", "X_test_words = sequence.pad_sequences(encoded_test_phrases, maxlen=max_len, padding='post')\n", "\n", "\n", "print (X_train_words.shape)\n", "print (X_test_words.shape)\n", "print (train_target.shape)\n", "\n", "print ('Done Tokenizing and indexing phrases based on the vocabulary learned from the entire Train and Test corpus')" ], "execution_count": 5, "outputs": [ { "output_type": "stream", "text": [ "(200000, 25)\n", "(20000, 25)\n", "(200000, 24)\n", "Done Tokenizing and indexing phrases based on the vocabulary learned from the entire Train and Test corpus\n" ], "name": "stdout" } ] }, { "metadata": { "id": "LrFPxWMIBt-e", "colab_type": "code", "outputId": "d05617ec-2253-4116-9a1a-9c8e368a271d", "colab": { "base_uri": "https://localhost:8080/", "height": 513 } }, "cell_type": "code", "source": [ "from keras.callbacks import EarlyStopping\n", "from keras.layers import Dense, Input, Embedding, Dropout, CuDNNLSTM, CuDNNGRU, Flatten, TimeDistributed, RepeatVector\n", "from keras.layers import Bidirectional\n", "from keras.models import Model\n", "\n", "\n", "\n", "print(\"Building layers\") \n", "\n", "print('starting to stitch and compile model')\n", "\n", "# Embedding layer for text inputs\n", "input_words = Input((max_len,))\n", "x_words = Embedding(vocab_size, 300, input_length=max_len)(input_words)\n", "x_words = Bidirectional(CuDNNLSTM(128))(x_words)\n", "x_words = Dropout(0.2)(x_words)\n", "x_words = Dense(32, activation=\"relu\")(x_words)\n", "predictions = Dense(24, activation=\"softmax\")(x_words)\n", "model = Model(inputs=input_words, outputs=predictions)\n", "model.compile(optimizer='rmsprop' ,loss='categorical_crossentropy', metrics=['accuracy'])\n", "print(model.summary())\n", "\n" ], "execution_count": 6, "outputs": [ { "output_type": "stream", "text": [ "Building layers\n", "starting to stitch and compile model\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Colocations handled automatically by placer.\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\n", "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "input_1 (InputLayer) (None, 25) 0 \n", "_________________________________________________________________\n", "embedding_1 (Embedding) (None, 25, 300) 1500000 \n", "_________________________________________________________________\n", "bidirectional_1 (Bidirection (None, 256) 440320 \n", "_________________________________________________________________\n", "dropout_1 (Dropout) (None, 256) 0 \n", "_________________________________________________________________\n", "dense_1 (Dense) (None, 32) 8224 \n", "_________________________________________________________________\n", "dense_2 (Dense) (None, 24) 792 \n", "=================================================================\n", "Total params: 1,949,336\n", "Trainable params: 1,949,336\n", "Non-trainable params: 0\n", "_________________________________________________________________\n", "None\n" ], "name": "stdout" } ] }, { "metadata": { "id": "BoJfVd6WCUrz", "colab_type": "code", "outputId": "b3a3b179-ec24-4477-bab7-9b3bec6c13c4", "colab": { "base_uri": "https://localhost:8080/", "height": 479 } }, "cell_type": "code", "source": [ "early_stop = EarlyStopping(monitor = \"val_loss\", mode=\"min\", patience = 3, verbose=1)\n", "#fit the model\n", "nb_epoch = 10\n", "history = model.fit(X_train_words, train_target, epochs=nb_epoch, verbose=1, batch_size = 256, callbacks=[early_stop], validation_split = 0.2, shuffle=True)\n", "train_loss = np.mean(history.history['loss'])\n", "val_loss = np.mean(history.history['val_loss'])\n", "print('Train loss: %f' % (train_loss*100))\n", "print('Validation loss: %f' % (val_loss*100))" ], "execution_count": 7, "outputs": [ { "output_type": "stream", "text": [ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.cast instead.\n", "Train on 160000 samples, validate on 40000 samples\n", "Epoch 1/10\n", "160000/160000 [==============================] - 11s 69us/step - loss: 0.3757 - acc: 0.8870 - val_loss: 0.0012 - val_acc: 1.0000\n", "Epoch 2/10\n", "160000/160000 [==============================] - 9s 54us/step - loss: 0.0039 - acc: 0.9991 - val_loss: 8.9929e-06 - val_acc: 1.0000\n", "Epoch 3/10\n", "160000/160000 [==============================] - 9s 54us/step - loss: 4.1068e-04 - acc: 0.9999 - val_loss: 2.3119e-07 - val_acc: 1.0000\n", "Epoch 4/10\n", "160000/160000 [==============================] - 9s 54us/step - loss: 4.9124e-06 - acc: 1.0000 - val_loss: 1.2814e-07 - val_acc: 1.0000\n", "Epoch 5/10\n", "160000/160000 [==============================] - 8s 53us/step - loss: 3.2206e-07 - acc: 1.0000 - val_loss: 1.2507e-07 - val_acc: 1.0000\n", "Epoch 6/10\n", "160000/160000 [==============================] - 8s 52us/step - loss: 1.2597e-07 - acc: 1.0000 - val_loss: 1.1973e-07 - val_acc: 1.0000\n", "Epoch 7/10\n", "160000/160000 [==============================] - 8s 52us/step - loss: 1.2542e-07 - acc: 1.0000 - val_loss: 1.1957e-07 - val_acc: 1.0000\n", "Epoch 8/10\n", "160000/160000 [==============================] - 8s 52us/step - loss: 1.2197e-07 - acc: 1.0000 - val_loss: 1.1935e-07 - val_acc: 1.0000\n", "Epoch 9/10\n", "160000/160000 [==============================] - 8s 53us/step - loss: 1.2094e-07 - acc: 1.0000 - val_loss: 1.1927e-07 - val_acc: 1.0000\n", "Epoch 10/10\n", "160000/160000 [==============================] - 8s 52us/step - loss: 1.2126e-07 - acc: 1.0000 - val_loss: 1.1924e-07 - val_acc: 1.0000\n", "Train loss: 3.799861\n", "Validation loss: 0.012004\n" ], "name": "stdout" } ] }, { "metadata": { "id": "bDOwoLFJ4rfU", "colab_type": "code", "outputId": "2494d8ef-37bd-48a2-9265-538d8828366b", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "cell_type": "code", "source": [ "pred_test = model.predict(X_test_words, batch_size=128, verbose = 0)\n", "print (pred_test.shape) \n", "max_pred = np.floor(np.argmax(pred_test, axis=1)).astype(int)\n", "submission = pd.DataFrame({'Inwords':test_data_raw['inwords'],'Predicted': max_pred, 'Truth': test_data_raw['hour']})\n", "submission = submission[['Inwords', 'Truth','Predicted']]\n" ], "execution_count": 8, "outputs": [ { "output_type": "stream", "text": [ "(20000, 24)\n" ], "name": "stdout" } ] }, { "metadata": { "id": "w1zVQDoP58tI", "colab_type": "code", "outputId": "f7129544-9232-459a-aa12-02cd2bcb936a", "colab": { "base_uri": "https://localhost:8080/", "height": 204 } }, "cell_type": "code", "source": [ "submission.head()" ], "execution_count": 9, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InwordsTruthPredicted
0Time flies its forty four minutes past ten in the morning now1010
1The murder happened exactly thirteen minutes past one after noon1313
2twenty nine minutes past one after noon, was the time on the clock when I entered the house1313
3Time flies its twenty five minutes past three after noon now1515
4Really was it thirty nine minutes past two in the after noon twice in a row?1414
\n", "
" ], "text/plain": [ " Inwords \\\n", "0 Time flies its forty four minutes past ten in the morning now \n", "1 The murder happened exactly thirteen minutes past one after noon \n", "2 twenty nine minutes past one after noon, was the time on the clock when I entered the house \n", "3 Time flies its twenty five minutes past three after noon now \n", "4 Really was it thirty nine minutes past two in the after noon twice in a row? \n", "\n", " Truth Predicted \n", "0 10 10 \n", "1 13 13 \n", "2 13 13 \n", "3 15 15 \n", "4 14 14 " ] }, "metadata": { "tags": [] }, "execution_count": 9 } ] }, { "metadata": { "id": "Sf28pxY3ie9Z", "colab_type": "code", "outputId": "cd45144e-a40e-40d9-f48a-7b6fb9f392b0", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "cell_type": "code", "source": [ "unseen = [\"Lets say, we meet three morning tommorrow ?\"]\n", "tokenizer.fit_on_texts(unseen)\n", "encoded_unseen_phrases = tokenizer.texts_to_sequences(unseen)\n", "X_unseen_words = sequence.pad_sequences(encoded_unseen_phrases, maxlen=max_len, padding='post')\n", "pred_unseen = model.predict(X_unseen_words, batch_size=128, verbose = 0)\n", "max_pred_unseen = np.floor(np.argmax(pred_unseen, axis=1)).astype(int)\n", "print(max_pred_unseen)" ], "execution_count": 10, "outputs": [ { "output_type": "stream", "text": [ "[3]\n" ], "name": "stdout" } ] } ] }