{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Seminar: simple question answering\n", "\n", "\n", "Today we're going to build a retrieval-based question answering model with metric learning models.\n", "\n", "_this seminar is based on original notebook by [Oleg Vasilev](https://github.com/Omrigan/)_\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset\n", "\n", "Today's data is Stanford Question Answering Dataset (SQuAD). Given a paragraph of text and a question, our model's task is to select a snippet that answers the question.\n", "\n", "We are not going to solve the full task today. Instead, we'll train a model to __select the sentence containing answer__ among several options.\n", "\n", "As usual, you are given an utility module with data reader and some helper functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import utils\n", "!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -O squad-v2.0.json 2> log\n", "# backup download link: https://www.dropbox.com/s/q4fuihaerqr0itj/squad.tar.gz?dl=1\n", "train, test = utils.build_dataset('./squad-v2.0.json')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pid, question, options, correct_indices, wrong_indices = train.iloc[40]\n", "print('QUESTION', question, '\\n')\n", "for i, cand in enumerate(options):\n", " print(['[ ]', '[v]'][i in correct_indices], cand)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Universal Sentence Encoder\n", "\n", "We've already solved quite a few tasks from scratch, training our own embeddings and convolutional/recurrent layers. However, one can often achieve higher quality by using pre-trained models. So today we're gonna use pre-trained Universal Sentence Encoder from [Tensorflow Hub](https://tfhub.dev/google/universal-sentence-encoder/2).\n", "\n", "\n", "[__Universal Sentence Encoder__](https://arxiv.org/abs/1803.11175) is a model that encoders phrases, sentences or short paragraphs into a fixed-size vector. It was trained simultaneosly on a variety of tasks to achieve versatility.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import tensorflow as tf\n", "import keras.layers as L\n", "import tensorflow_hub as hub\n", "tf.reset_default_graph()\n", "sess = tf.InteractiveSession()\n", "\n", "universal_sentence_encoder = hub.Module(\"https://tfhub.dev/google/universal-sentence-encoder/2\", \n", " trainable=False)\n", "# consider as well:\n", "# * lite: https://tfhub.dev/google/universal-sentence-encoder-lite/2\n", "# * large: https://tfhub.dev/google/universal-sentence-encoder-large/2\n", "\n", "sess.run([tf.global_variables_initializer(), tf.tables_initializer()]);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# tfhub implementation does tokenization for you\n", "dummy_ph = tf.placeholder(tf.string, shape=[None])\n", "dummy_vectors = universal_sentence_encoder(dummy_ph)\n", "\n", "dummy_lines = [\n", " \"How old are you?\", # 0\n", " \"In what mythology do two canines watch over the Chinvat Bridge?\", # 1\n", " \"I'm sorry, okay, I'm not perfect, but I'm trying.\", # 2\n", " \"What is your age?\", # 3\n", " \"Beware, for I am fearless, and therefore powerful.\", # 4\n", "]\n", "\n", "dummy_vectors_np = sess.run(dummy_vectors, {\n", " dummy_ph: dummy_lines\n", "})\n", "\n", "plt.title('phrase similarity')\n", "plt.imshow(dummy_vectors_np.dot(dummy_vectors_np.T), interpolation='none', cmap='gray')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, __the strongest similarity is between lines 0 and 3__. Indeed they correspond to \"How old are you?\" and \"What is your age?\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model (2 points)\n", "\n", "Our goal for today is to build a model that measures similarity between question and answer. In particular, it maps both question and answer into fixed-size vectors such that:\n", "\n", "Our model is a pair of $V_q(q)$ and $V_a(a)$ - networks that turn phrases into vectors. \n", "\n", "__Objective:__ Question vector $V_q(q)$ should be __closer__ to correct answer vectors $V_a(a^+)$ than to incorrect ones $V_a(a^-)$ .\n", "\n", "Both vectorizers can be anything you wish. For starters, let's use a couple of dense layers on top of the Universal Sentence Encoder.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import keras.layers as L\n", "class Vectorizer:\n", " def __init__(self, output_size=256, hid_size=256, universal_sentence_encoder=universal_sentence_encoder):\n", " \"\"\" A small feedforward network on top of universal sentence encoder. 2-3 layers should be enough \"\"\"\n", " self.universal_sentence_encoder = universal_sentence_encoder\n", " \n", " # define a few layers to be applied on top of u.s.e.\n", " # note: please make sure your final layer comes with _linear_ activation\n", " <YOUR CODE HERE>\n", " \n", "\n", " def __call__(self, input_phrases, is_train=True):\n", " \"\"\"\n", " Apply vectorizer. Use dropout and any other hacks at will.\n", " :param input_phrases: [batch_size] of tf.string\n", " :param is_train: if True, apply dropouts and other ops in train mode, \n", " if False - evaluation mode\n", " :returns: predicted phrase vectors, [batch_size, output_size]\n", " \"\"\"\n", " <YOUR CODE>\n", " return <...>" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "question_vectorizer = Vectorizer()\n", "answer_vectorizer = Vectorizer()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dummy_v_q = question_vectorizer(dummy_ph, is_train=True)\n", "dummy_v_q_det = question_vectorizer(dummy_ph, is_train=False)\n", "utils.initialize_uninitialized()\n", "assert sess.run(dummy_v_q, {dummy_ph: dummy_lines}).shape == (5, 256)\n", "assert np.allclose(\n", " sess.run(dummy_v_q_det, {dummy_ph: dummy_lines}),\n", " sess.run(dummy_v_q_det, {dummy_ph: dummy_lines})\n", "), \"make sure your model doesn't use dropout/noise or non-determinism if is_train=False\"\n", "\n", "print(\"Well done!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training: minibatches\n", "\n", "Our model learns on triples $(q, a^+, a^-)$: \n", "* q - __q__uestion\n", "* (a+) - correct __a__nswer\n", "* (a-) - wrong __a__nswer \n", "\n", "Below you will find a generator that samples such triples from data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import random\n", "\n", "def iterate_minibatches(data, batch_size, shuffle=True, cycle=False):\n", " \"\"\"\n", " Generates minibatches of triples: {questions, correct answers, wrong answers}\n", " If there are several wrong (or correct) answers, picks one at random.\n", " \"\"\"\n", " indices = np.arange(len(data))\n", " while True:\n", " if shuffle:\n", " indices = np.random.permutation(indices)\n", " for batch_start in range(0, len(indices), batch_size):\n", " batch_indices = indices[batch_start: batch_start + batch_size]\n", " batch = data.iloc[batch_indices]\n", " questions = batch['question'].values\n", " correct_answers = np.array([\n", " row['options'][random.choice(row['correct_indices'])]\n", " for i, row in batch.iterrows()\n", " ])\n", " wrong_answers = np.array([\n", " row['options'][random.choice(row['wrong_indices'])]\n", " for i, row in batch.iterrows()\n", " ])\n", "\n", " yield {\n", " 'questions' : questions,\n", " 'correct_answers': correct_answers,\n", " 'wrong_answers': wrong_answers,\n", " }\n", " if not cycle:\n", " break" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dummy_batch = next(iterate_minibatches(train.sample(3), 3))\n", "print(dummy_batch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training: loss function (2 points)\n", "We want our vectorizers to put correct answers closer to question vectors and incorrect answers farther away from them. One way to express this is to use is Pairwise Hinge Loss _(aka Triplet Loss)_. \n", "\n", "$$ L = \\frac 1N \\underset {q, a^+, a^-} \\sum max(0, \\space \\delta - sim[V_q(q), V_a(a^+)] + sim[V_q(q), V_a(a^-)] )$$\n", "\n", ", where\n", "* sim[a, b] is some similarity function: dot product, cosine or negative distance\n", "* δ - loss hyperparameter, e.g. δ=1.0. If sim[a, b] is linear in b, all δ > 0 are equivalent.\n", "\n", "\n", "This reads as __Correct answers must be closer than the wrong ones by at least δ.__\n", "\n", "\n", "<center>_image: question vector is green, correct answers are blue, incorrect answers are red_</center>\n", "\n", "\n", "Note: in effect, we train a Deep Semantic Similarity Model [DSSM](https://www.microsoft.com/en-us/research/project/dssm/). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def similarity(a, b):\n", " \"\"\" Dot product as a similarity function \"\"\"\n", " <YOUR CODE>\n", " return <...>\n", "\n", "def compute_loss(question_vectors, correct_answer_vectors, wrong_answer_vectors, delta=1.0):\n", " \"\"\" \n", " Compute the triplet loss as per formula above.\n", " Use similarity function above for sim[a, b]\n", " :param question_vectors: float32[batch_size, vector_size]\n", " :param correct_answer_vectors: float32[batch_size, vector_size]\n", " :param wrong_answer_vectors: float32[batch_size, vector_size]\n", " :returns: loss for every row in batch, float32[batch_size]\n", " Hint: DO NOT use tf.reduce_max, it's a wrong kind of maximum :)\n", " \"\"\"\n", " <YOUR CODE>\n", " return <...>" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dummy_v1 = tf.constant([[0.1, 0.2, -1], [-1.2, 0.6, 1.0]], dtype=tf.float32)\n", "dummy_v2 = tf.constant([[0.9, 2.1, -6.6], [0.1, 0.8, -2.2]], dtype=tf.float32)\n", "dummy_v3 = tf.constant([[-4.1, 0.1, 1.2], [0.3, -1, -2]], dtype=tf.float32)\n", "\n", "assert np.allclose(similarity(dummy_v1, dummy_v2).eval(), [7.11, -1.84])\n", "assert np.allclose(compute_loss(dummy_v1, dummy_v2, dummy_v3, delta=5.0).eval(), [0.0, 3.88])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once loss is working, let's train our model by our usual means." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "placeholders = {\n", " key: tf.placeholder(tf.string, [None]) for key in dummy_batch.keys()\n", "}\n", "\n", "v_q = question_vectorizer(placeholders['questions'], is_train=True)\n", "v_a_correct = answer_vectorizer(placeholders['correct_answers'], is_train=True)\n", "v_a_wrong = answer_vectorizer(placeholders['wrong_answers'], is_train=True)\n", "\n", "loss = tf.reduce_mean(compute_loss(v_q, v_a_correct, v_a_wrong))\n", "step = tf.train.AdamOptimizer().minimize(loss)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we also compute recall: probability that a^+ is closer to q than a^-\n", "test_v_q = question_vectorizer(placeholders['questions'], is_train=False)\n", "test_v_a_correct = answer_vectorizer(placeholders['correct_answers'], is_train=False)\n", "test_v_a_wrong = answer_vectorizer(placeholders['wrong_answers'], is_train=False)\n", "\n", "correct_is_closer = tf.greater(similarity(test_v_q, test_v_a_correct),\n", " similarity(test_v_q, test_v_a_wrong))\n", "recall = tf.reduce_mean(tf.to_float(correct_is_closer))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training loop\n", "\n", "Just as we always do, we can now train DSSM on minibatches and periodically measure recall on validation data.\n", "\n", "\n", "__Note 1:__ DSSM training may be very sensitive to the choice of batch size. Small batch size may decrease model quality.\n", "\n", "__Note 2:__ here we use the same dataset as __\"test set\"__ and __\"validation (dev) set\"__. \n", "\n", "In any serious scientific experiment, those must be two separate sets. Validation is for hyperparameter tuning and testr is for final eval only.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from IPython.display import clear_output\n", "from tqdm import tqdm\n", "\n", "ewma = lambda x, span: pd.DataFrame({'x': x})['x'].ewm(span=span).mean().values\n", "dev_batches = iterate_minibatches(test, batch_size=256, cycle=True)\n", "loss_history = []\n", "dev_recall_history = []\n", "utils.initialize_uninitialized()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# infinite training loop. Stop it manually or implement early stopping\n", "\n", "for batch in iterate_minibatches(train, batch_size=256, cycle=True):\n", " feed = {placeholders[key] : batch[key] for key in batch}\n", " loss_t, _ = sess.run([loss, step], feed)\n", " loss_history.append(loss_t)\n", " if len(loss_history) % 50 == 0:\n", " # measure dev recall = P(correct_is_closer_than_wrong | q, a+, a-)\n", " dev_batch = next(dev_batches)\n", " recall_t = sess.run(recall, {placeholders[key] : dev_batch[key] for key in dev_batch})\n", " dev_recall_history.append(recall_t)\n", " \n", " if len(loss_history) % 50 == 0:\n", " clear_output(True)\n", " plt.figure(figsize=[12, 6])\n", " plt.subplot(1, 2, 1), plt.title('train loss (hinge)'), plt.grid()\n", " plt.scatter(np.arange(len(loss_history)), loss_history, alpha=0.1)\n", " plt.plot(ewma(loss_history, span=100))\n", " plt.subplot(1, 2, 2), plt.title('dev recall (1 correct vs 1 wrong)'), plt.grid()\n", " dev_time = np.arange(1, len(dev_recall_history) + 1) * 100\n", " plt.scatter(dev_time, dev_recall_history, alpha=0.1)\n", " plt.plot(dev_time, ewma(dev_recall_history, span=10))\n", " plt.show()\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Mean recall:\", np.mean(dev_recall_history[-10:]))\n", "assert np.mean(dev_recall_history[-10:]) > 0.85, \"Please train for at least 85% recall on test set. \"\\\n", " \"You may need to change vectorizer model for that.\"\n", "print(\"Well done!\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Final evaluation (1 point)\n", "\n", "Let's see how well does our model perform on actual question answering. \n", "\n", "Given a question and a set of possible answers, pick answer with highest similarity to estimate accuracy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# optional: build tf graph required for select_best_answer\n", "# <...>\n", "\n", "def select_best_answer(question, possible_answers):\n", " \"\"\"\n", " Predicts which answer best fits the question\n", " :param question: a single string containing a question\n", " :param possible_answers: a list of strings containing possible answers\n", " :returns: integer - the index of best answer in possible_answer\n", " \"\"\"\n", " <YOUR CODE>\n", " return <...>\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predicted_answers = [\n", " select_best_answer(question, possible_answers)\n", " for i, (question, possible_answers) in tqdm(test[['question', 'options']].iterrows(), total=len(test))\n", "]\n", "\n", "accuracy = np.mean([\n", " answer in correct_ix\n", " for answer, correct_ix in zip(predicted_answers, test['correct_indices'].values)\n", "])\n", "print(\"Accuracy: %0.5f\" % accuracy)\n", "assert accuracy > 0.65, \"we need more accuracy!\"\n", "print(\"Great job!\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def draw_results(question, possible_answers, predicted_index, correct_indices):\n", " print(\"Q:\", question, end='\\n\\n')\n", " for i, answer in enumerate(possible_answers):\n", " print(\"#%i: %s %s\" % (i, '[*]' if i == predicted_index else '[ ]', answer))\n", " \n", " print(\"\\nVerdict:\", \"CORRECT\" if predicted_index in correct_indices else \"INCORRECT\", \n", " \"(ref: %s)\" % correct_indices, end='\\n' * 3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "for i in [1, 100, 1000, 2000, 3000, 4000, 5000]:\n", " draw_results(test.iloc[i].question, test.iloc[i].options,\n", " predicted_answers[i], test.iloc[i].correct_indices)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "question = \"What is my name?\" # your question here!\n", "possible_answers = [\n", " <...> \n", " # ^- your options. \n", "]\n", "predicted answer = select_best_answer(question, possible_answers)\n", "\n", "draw_results(question, possible_answers,\n", " predicted_answer, [0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bonus tasks\n", "\n", "There are many ways to improve our question answering model. Here's a bunch of things you can do to increase your understanding and get bonus points.\n", "\n", "### 1. Hard Negatives (3+ pts)\n", "\n", "Not all wrong answers are equally wrong. As the training progresses, _most negative examples $a^-$ will be to easy._ So easy in fact, that loss function and gradients on such negatives is exactly __0.0__. To improve training efficiency, one can __mine hard negative samples__.\n", "\n", "Given a list of answers,\n", "* __Hard negative__ is the wrong answer with highest similarity with question,\n", "\n", "$$a^-_{hard} = \\underset {a^-} {argmax} \\space sim[V_q(q), V_a(a^-)]$$\n", "\n", "* __Semi-hard negative__ is the one with highest similarity _among wrong answers that are farther than positive one. This option is more useful if some wrong answers may actually be mislabelled correct answers.\n", "\n", "* One can also __sample__ negatives proportionally to $$P(a^-_i) \\sim e ^ {sim[V_q(q), V_a(a^-_i)]}$$\n", "\n", "\n", "The task is to implement at least __hard negative__ sampling and apply it for model training.\n", "\n", "\n", "### 2. Bring Your Own Model (3+ pts)\n", "In addition to Universal Sentence Encoder, one can also train a new model.\n", "* You name it: convolutions, RNN, self-attention\n", "* Use pre-trained ELMO or FastText embeddings\n", "* Monitor overfitting and use dropout / word dropout to improve performance\n", "\n", "__Note:__ if you use ELMO please note that it requires tokenized text while USE can deal with raw strings. You can tokenize data manually or use tokenized=True when reading dataset.\n", "\n", "\n", "* hard negatives (strategies: hardest, hardest farter than current, randomized)\n", "* train model on the full dataset to see if it can mine answers to new questions over the entire wikipedia. Use approximate nearest neighbor search for fast lookup.\n", "\n", "\n", "### 3. Search engine (3+ pts)\n", "\n", "Our basic model only selects answers from 2-5 available sentences in paragraph. You can extend it to search over __the whole dataset__. All sentences in all other paragraphs are viable answers.\n", "\n", "The goal is to train such a model and use it to __quickly find top-10 answers from the whole set__.\n", "\n", "* You can ask such model a question of your own making - to see which answers it can find among the entire training dataset or even the entire wikipedia.\n", "* Searching for top-K neighbors is easier if you use specialized methods: [KD-Tree](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html) or [HNSW](https://github.com/nmslib/hnswlib). \n", "* This task is much easier to train if you use hard or semi-hard negatives. You can even find hard negatives for one question from correct answers to other questions in batch - do so in-graph for maximum efficiency. See [1.] for more details.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" }, "widgets": { "state": { "69ee5b52104d471ca7bfb32ba4309743": { "views": [ { "cell_index": 11 } ] }, "7b18f460e231498eaafa7653026e98e0": { "views": [ { "cell_index": 4 } ] } }, "version": "1.2.0" } }, "nbformat": 4, "nbformat_minor": 2 }