{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "pl0KyQomBQ7l"
   },
   "source": [
    "# Evaluating QA: the Retriever & the Full QA System\n",
    "> A review of Information Retrieval and the role it plays in an IR QA system\n",
    "\n",
    "- title: \"Evaluating QA: the Retriever & the Full QA System\"\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- use_math: true\n",
    "- categories: [elasticsearch, mean average precision, recall for IRQA, QA system design]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Np3wkqy7lTrb"
   },
   "source": [
    "In our last post, [Evaluating QA: Metrics, Predictions, and the Null Response](https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html), we took a deep dive into how to assess the quality of a BERT-like Reader for Question Answering (QA) using the Hugging Face framework. In this post, we'll focus on the other component of a modern Information Retrieval-based (IR) QA system: the Retriever. Specifically, we'll introduce Elasticsearch as a powerful and efficient IR tool that can be used to scour through large corpora and retrieve relevant documents. We'll explain how to implement and evaluate a Retriever in the context of Question Answering and demonstrate its impact on an IR QA system."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "np9SZdgqBQ7m"
   },
   "source": [
    "### Prerequisites\n",
    "* a basic understanding of Information Retrieval & Search\n",
    "* a basic understanding of IR based QA systems (see our [previous posts](https://qa.fastforwardlabs.com/))\n",
    "* a basic understanding of Transformers and PyTorch\n",
    "* a basic understanding of the SQuAD2.0 dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "7pWWsT7nlTrc"
   },
   "source": [
    "# Retrieving the right document is important"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-q2l9AkZlTrd"
   },
   "source": [
    "As we've discussed throughout this series, many modern QA systems take a two-staged approach to answering questions. In the first stage, a document retriever selects *N* potentially relevant documents from a given corpus. Subsequently, a machine comprehension model processes each of the *N* documents to determine an answer to the input question. \n",
    "\n",
    "Because of recent advances in NLP and deep learning (i.e., flashy Transformers), the machine comprehension component has typically been the main focus of evaluation and performance enhancement. Retrievers have received limited attention in the context of QA, despite their obvious importance: stage two of an IR QA system is bounded by the performance of stage one. Let's get more specific.\n",
    "\n",
    "We [recently explained methods](https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html) that - given a question and context passage - enable BERT-like models to produce robust answers by selectively processing predictions and by refraining from answering certain questions at all. While the ability to properly comprehend a passage and produce a correct answer is a critical feature of any QA tool, the success of the overall system is highly dependent on first providing a correct passage to read through. Without being fed a context passage that actually contains the ground-truth answer, the overall system's performance is limited to how well it can predict no-answer questions. \n",
    "\n",
    "To demonstrate, we'll revisit an example from our [second blog post](https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA.html), in which we asked three questions of a Wikipedia search engine-based QA system:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "73yug9x4lTrd"
   },
   "source": [
    "```\n",
    "**Example 1: Incorrect**\n",
    "Question: When was Barack Obama born?\n",
    "Top wiki result: <WikipediaPage 'Barack Obama Sr.'>\n",
    "Answer: 18 June 1936 / February 2 , 1961 / \n",
    "\n",
    "**Example 2: Correct**\n",
    "Question: Why is the sky blue?\n",
    "Top wiki result: <WikipediaPage 'Diffuse sky radiation'>\n",
    "Answer: Rayleigh scattering / \n",
    "\n",
    "**Example 3: Correct**\n",
    "Question: How many sides does a pentagon have?\n",
    "Top wiki result: <WikipediaPage 'The Pentagon'>\n",
    "Answer: five / \n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "YL4_stz3lTre"
   },
   "source": [
    "In Example 1, the Reader had no chance of producing the correct answer because of its outright absence from the context served up by the Retriever. Namely, the Retriever erroneously provided a page about Barack Obama Sr. instead of his son, the former US President. In this case, the only way the Reader could have possibly produced the correct answer was if the correct answer was actually not to answer at all. \n",
    "\n",
    "On the flip side, in Example 3, the Retriever did not identify the globally \"correct\" document - it returned an article about \"The Pentagon\" instead of a page about geometry - but nonetheless, it provided enough context for the Reader to succeed.\n",
    "\n",
    "These quick examples illustrate why an effective Retriever is crucial for an end-to-end QA system. Now let's take a deeper look at a classic tool used for information retrieval - Elasticsearch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "sa6UyWz0lTre"
   },
   "source": [
    "# Elasticsearch as an IR Tool"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "v1MjzT9zlTrf"
   },
   "source": [
    "![](https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/my_icons/elasticsearch-logo.png?raw=1)\n",
    "\n",
    "Modern QA systems employ a variety of techniques for the task of information retrieval, ranging from traditional sparse vector word matching (e.g., Elasticsearch) to [novel approaches](https://arxiv.org/pdf/2004.04906.pdf) using dense representations of encoded passages combined with [efficient search capabilities](https://github.com/facebookresearch/faiss). Despite the flurry of contemporary research efforts in this area, the traditional sparse vector approach performs very well overall, and has only recently been overtaken by embedding-based systems for QA retrieval tasks. For that reason, we'll explore Elasticsearch as an easy-to-use framework for document retrieval. So, what exactly is Elasticsearch?\n",
    "\n",
    "Elasticsearch is a powerful open-source search and analytics engine built on the [Apache Lucene](https://lucene.apache.org/) library that is capable of handling all types of data - including textual, numerical, geospatial, structured, and unstructured data. It is built to scale with a robust set of features, rich ecosystem, and diverse list of client libraries, making it easy to integrate and use. In the context of information retrieval for automated question answering, we are keenly interested in the features surrounding full-text search. \n",
    "\n",
    "Elasticsearch provides a convenient way to index documents so they can quickly be queried for nearest neighbor search using a similarity metric based on TF-IDF. Specifically, it uses [BM25](https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/) term weighting to represent question and context passages as high-dimensional, sparse vectors that are efficiently searched in an inverted index. For more information on how an inverted index works under the hood, we recommend this quick and concise [blog post](https://codingexplained.com/coding/elasticsearch/understanding-the-inverted-index-in-elasticsearch)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "aGSmeXNtlTrh"
   },
   "source": [
    "## Using Elasticsearch with SQuAD2.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "KxL9yAI1lTri"
   },
   "source": [
    "With this basic understanding of how Elasticsearch works, let's dive in and build our own Document Retrieval system by indexing a set of Wikipedia article paragraphs that support questions and answers from the SQuAD2.0 dataset. Before we get started, we'll need to download and prepare data from SQuAD2.0."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "IiwrK5s6lTri"
   },
   "source": [
    " **Download and Prepare SQUAD2.0**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 411
    },
    "colab_type": "code",
    "id": "LR4iRtkKlTrj",
    "outputId": "45e3845e-dd88-40d1-fa3a-feb54038347a",
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# collapse-hide\n",
    "\n",
    "# Download the SQuAD2.0 train & dev sets\n",
    "!wget -P data/squad/ https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json\n",
    "!wget -P data/squad/ https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json\n",
    "    \n",
    "import json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "WSxO51NglTrn"
   },
   "source": [
    "A common practice in IR for QA is to segment large articles into smaller passages before indexing, for two main reasons:\n",
    "1. Transformer-based Readers are slow; providing an entire Wikipedia article to BERT for processing can take 5 - 30 seconds, even with a decent GPU!\n",
    "2. Smaller passages reduce noise; by identifying a more concise context passage for BERT to read through, we reduce the chance of BERT getting lost.\n",
    "\n",
    "Of course, the chunking method proposed here doesn't come without a cost. Larger documents contain more information on which to retrieve. By reducing passage size, we are potentially trading off system recall for speed - although, as we will discuss later in this post, there are techniques to alleviate this.\n",
    "\n",
    "With our chunking approach, each article paragraph will be prepended with the article title, and collectively serve as the corpus of documents over which our Elasticsearch Retriever will search. In practice, open-domain QA systems sit atop massive collections of documents (think: all of Wikipedia) to provide a breadth of information from which to answer general-knowledge questions. For the purposes of demonstrating Elasticsearch functionality, we will limit our corpus to only the Wikipedia articles supporting SQuAD2.0 questions.\n",
    "\n",
    "The following `parse_qa_records` function will extract question/answer examples, as well as paragraph content from the SQuAD2.0 data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "FNYE9t_5lTrn"
   },
   "outputs": [],
   "source": [
    "# collapse-hide\n",
    "\n",
    "def parse_qa_records(data):\n",
    "    '''\n",
    "    Loop through SQuAD2.0 dataset and parse out question/answer examples and unique article paragraphs\n",
    "    \n",
    "    Returns:\n",
    "        qa_records (list) - Question/answer examples as list of dictionaries\n",
    "        wiki_articles (list) - Unique Wikipedia titles and article paragraphs recreated from SQuAD data\n",
    "    \n",
    "    '''\n",
    "    num_with_ans = 0\n",
    "    num_without_ans = 0\n",
    "    qa_records = []\n",
    "    wiki_articles = {}\n",
    "    \n",
    "    for article in data:\n",
    "        \n",
    "        for i, paragraph in enumerate(article['paragraphs']):\n",
    "            \n",
    "            wiki_articles[article['title']+f'_{i}'] = article['title'] + ' ' + paragraph['context']\n",
    "            \n",
    "            for questions in paragraph['qas']:\n",
    "                \n",
    "                qa_record = {}\n",
    "                qa_record['example_id'] = questions['id']\n",
    "                qa_record['document_title'] = article['title']\n",
    "                qa_record['question_text'] = questions['question']\n",
    "                \n",
    "                try: \n",
    "                    qa_record['short_answer'] = questions['answers'][0]['text']\n",
    "                    num_with_ans += 1\n",
    "                except:\n",
    "                    qa_record['short_answer'] = \"\"\n",
    "                    num_without_ans += 1\n",
    "                    \n",
    "                qa_records.append(qa_record)\n",
    "        \n",
    "        \n",
    "    wiki_articles = [{'document_title':title, 'document_text': text}\\\n",
    "                         for title, text in wiki_articles.items()]\n",
    "                \n",
    "    print(f'Data contains {num_with_ans} question/answer pairs with a short answer, and {num_without_ans} without.'+\n",
    "          f'\\nThere are {len(wiki_articles)} unique wikipedia article paragraphs.')\n",
    "                \n",
    "    return qa_records, wiki_articles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 85
    },
    "colab_type": "code",
    "id": "dZHrwJBblTrq",
    "outputId": "5be3286a-0b51-4075-a027-b45deba70a4d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data contains 86821 question/answer pairs with a short answer, and 43498 without.\n",
      "There are 19035 unique wikipedia article paragraphs.\n",
      "Data contains 5928 question/answer pairs with a short answer, and 5945 without.\n",
      "There are 1204 unique wikipedia article paragraphs.\n"
     ]
    }
   ],
   "source": [
    "# load and parse data\n",
    "train_file = \"data/squad/train-v2.0.json\"\n",
    "dev_file = \"data/squad/dev-v2.0.json\"\n",
    "\n",
    "train = json.load(open(train_file, 'rb'))\n",
    "dev = json.load(open(dev_file, 'rb'))\n",
    "\n",
    "qa_records, wiki_articles = parse_qa_records(train['data'])\n",
    "qa_records_dev, wiki_articles_dev = parse_qa_records(dev['data'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Fyk_WZn6lTrt",
    "outputId": "3d23397f-830d-47e3-8649-9591e59ab1b1"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'example_id': '56d43c5f2ccc5a1400d830ab',\n",
       " 'document_title': 'Beyoncé',\n",
       " 'question_text': 'What was the first album Beyoncé released as a solo artist?',\n",
       " 'short_answer': 'Dangerously in Love'}"
      ]
     },
     "execution_count": 90,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# parsed record example\n",
    "qa_records[10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "GJkzpTrElTry",
    "outputId": "f495bb9a-d411-47aa-9e0d-48a0e8e9cd01"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'document_title': 'Beyoncé_10', 'document_text': 'Beyoncé Beyoncé\\'s first solo recording was a feature on Jay Z\\'s \"\\'03 Bonnie & Clyde\" that was released in October 2002, peaking at number four on the U.S. Billboard Hot 100 chart. Her first solo album Dangerously in Love was released on June 24, 2003, after Michelle Williams and Kelly Rowland had released their solo efforts. The album sold 317,000 copies in its first week, debuted atop the Billboard 200, and has since sold 11 million copies worldwide. The album\\'s lead single, \"Crazy in Love\", featuring Jay Z, became Beyoncé\\'s first number-one single as a solo artist in the US. The single \"Baby Boy\" also reached number one, and singles, \"Me, Myself and I\" and \"Naughty Girl\", both reached the top-five. The album earned Beyoncé a then record-tying five awards at the 46th Annual Grammy Awards; Best Contemporary R&B Album, Best Female R&B Vocal Performance for \"Dangerously in Love 2\", Best R&B Song and Best Rap/Sung Collaboration for \"Crazy in Love\", and Best R&B Performance by a Duo or Group with Vocals for \"The Closer I Get to You\" with Luther Vandross.'}\n"
     ]
    }
   ],
   "source": [
    "# parsed wiki paragraph example\n",
    "print(wiki_articles[10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "OahFS1vhlTr0"
   },
   "source": [
    "**Download Elasticsearch**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "V_nOLSOXlTr0"
   },
   "source": [
    "With our data ready to go, let's download, install, and configure Elasticsearch. We recommend opening this post as a Colab notebook and executing the following code snippet to set up Elasticsearch. Alternatively, you can install and launch Elasticsearch on your local machine by following the instructions [here](https://www.elastic.co/downloads/elasticsearch).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "KHmyF5RklTr1"
   },
   "outputs": [],
   "source": [
    "# collapse-hide\n",
    "\n",
    "# if using Colab - start Elasticsearch from source\n",
    "! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q\n",
    "! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz\n",
    "! chown -R daemon:daemon elasticsearch-7.6.2\n",
    "\n",
    "import os\n",
    "from subprocess import Popen, PIPE, STDOUT\n",
    "es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],\n",
    "                   stdout=PIPE, stderr=STDOUT,\n",
    "                   preexec_fn=lambda: os.setuid(1)  # as daemon\n",
    "                  )\n",
    "# wait until ES has started\n",
    "! sleep 30"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "wy8Tzay-lTr3"
   },
   "source": [
    "**Load Data into Elasticsearch**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9cespYnhlTr4"
   },
   "source": [
    "We'll use the [official low-level Python client library](https://elasticsearch-py.readthedocs.io/en/master/) for interacting with Elasticsearch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 173
    },
    "colab_type": "code",
    "id": "BMIa-bJHlTr4",
    "outputId": "95a011a7-4970-42ba-9d66-5f9b5581a87a"
   },
   "outputs": [],
   "source": [
    "# collapse-hide\n",
    "!pip install elasticsearch\n",
    "!pip install tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Bpb2NmWClTr6"
   },
   "source": [
    "By default, Elasticsearch is launched locally on port 9200. We first need to instantiate an Elasticsearch client object and connect to the service."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "DXvAmLwllTr7",
    "outputId": "dc593d7b-264e-468c-ff7e-77b18a4fc047"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 6,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from elasticsearch import Elasticsearch\n",
    "\n",
    "config = {'host':'localhost', 'port':9200}\n",
    "es = Elasticsearch([config])\n",
    "\n",
    "# test connection\n",
    "es.ping()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "fLI_1deTlTr9"
   },
   "source": [
    "Before we go further, let's introduce a few concepts that are specific to Elasticsearch and the process of indexing data. An _index_ is a collection of documents that have common characteristics (similar to a database schema in an RDBMS). _Documents_ are JSON objects having their own set of key-value pairs consisting of various data types (similar to rows/fields in RDBMS).  \n",
    "\n",
    "When we add a document into an index, the document's text fields undergo analysis prior to being indexed. This means that when executing a search query against an index, we are actually searching against the post-processed representation that is stored in the inverted index, not the raw input document itself.\n",
    "\n",
    "![Elasticsearch Index Process](https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/my_icons/elastic_index_process.png?raw=1)\n",
    "[Image Credit](https://codingexplained.com/coding/elasticsearch/understanding-analysis-in-elasticsearch-analyzers#:~:text=A%20Closer%20Look%20at%20Analyzers,documents%20when%20they%20are%20indexed.&text=An%20analyzer%20consists%20of%20three,them%20changing%20the%20input%20stream.)\n",
    "\n",
    "The analysis process is a customizable pipeline carried out by an _Analyzer_. Elasticsearch analyzer pipelines are composed of three sequential steps: _character filters_, a _tokenizer_, and _token filters._ Each of these components modifies the input stream of text according to some configurable settings.\n",
    "\n",
    "- **Character filters** have the ability to add, remove, or replace characters. A common application is to strip `html` markup from the raw input. \n",
    "- The character-filtered text is passed to a **tokenizer** which breaks up the input string into individual tokens. The default (`standard`) tokenizer splits tokens on whitespace, and most symbols (like commas, periods, semicolons, etc.)\n",
    "- The token stream is passed to a **token filter** which adds, removes, or modifies tokens. Typical token filters include converting all text to `lowercase`, and removing `stop` words. \n",
    "\n",
    "Elasticsearch comes with several built-in Analyzers that satisfy common use cases and defaults to the `Standard Analyzer`. The Standard Analyzer doesn't contain any character filters, uses a `standard` tokenizer, and applies a `lowercase` token filter. Let's take a look at an example sentence as it's passed through this pipeline:\n",
    "\n",
    "> \"I'm in the mood for drinking semi-dry red wine!\"\n",
    "\n",
    "![Elasticsearch Analyzer Pipeline](https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/my_icons/elasticsearch_standard_analyzer.png?raw=1)\n",
    "[Image Credit](https://codingexplained.com/coding/elasticsearch/understanding-analysis-in-elasticsearch-analyzers#:~:text=A%20Closer%20Look%20at%20Analyzers,documents%20when%20they%20are%20indexed.&text=An%20analyzer%20consists%20of%20three,them%20changing%20the%20input%20stream.)\n",
    "\n",
    "Crafting analyzers to your use case requires domain knowledge of the problem and dataset at hand, and doing so properly is key to optimizing relevance scoring for your search application. We found [this blog series](https://medium.com/elasticsearch/contents-cebdc419c8c9) very useful in explaining the importance of analysis in Elasticsearch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "_oaKhW1flTr9"
   },
   "source": [
    "**_Create an Index_**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "hrlsaFgslTr-"
   },
   "source": [
    "Let's create a new index and add our Wikipedia articles to it. To do so, we provide a name and optionally some index configurations. Here we are specifying a set of `mappings` that indicate our anticipated index schema, data types, and how the text fields should be processed. If no `body` is passed, Elasticsearch will automatically infer fields and data types from incoming documents, as well as apply the `Standard Analyzer` to any text fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 68
    },
    "colab_type": "code",
    "id": "vu6imTr6lTr-",
    "outputId": "6bfb7bfe-db9a-4ad5-878c-b23a56fc5960"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'acknowledged': True,\n",
       " 'index': 'squad-standard-index',\n",
       " 'shards_acknowledged': True}"
      ]
     },
     "execution_count": 7,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index_config = {\n",
    "    \"settings\": {\n",
    "        \"analysis\": {\n",
    "            \"analyzer\": {\n",
    "                \"standard_analyzer\": {\n",
    "                    \"type\": \"standard\"\n",
    "                }\n",
    "            }\n",
    "        }\n",
    "    },\n",
    "    \"mappings\": {\n",
    "        \"dynamic\": \"strict\", \n",
    "        \"properties\": {\n",
    "            \"document_title\": {\"type\": \"text\", \"analyzer\": \"standard_analyzer\"},\n",
    "            \"document_text\": {\"type\": \"text\", \"analyzer\": \"standard_analyzer\"}\n",
    "            }\n",
    "        }\n",
    "    }\n",
    "\n",
    "index_name = 'squad-standard-index'\n",
    "es.indices.create(index=index_name, body=index_config, ignore=400)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "3vqk73V3lTsB"
   },
   "source": [
    "**_Populate the Index_**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "A11K4dsPlTsB"
   },
   "source": [
    "We can then loop through our list of Wikipedia titles and articles and add them to our newly created Elasticsearch index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Pnd12zSHlTsB"
   },
   "outputs": [],
   "source": [
    "# collapse-hide\n",
    "from tqdm.notebook import tqdm\n",
    "\n",
    "def populate_index(es_obj, index_name, evidence_corpus):\n",
    "    '''\n",
    "    Loads records into an existing Elasticsearch index\n",
    "\n",
    "    Args:\n",
    "        es_obj (elasticsearch.client.Elasticsearch) - Elasticsearch client object\n",
    "        index_name (str) - Name of index\n",
    "        evidence_corpus (list) - List of dicts containing data records\n",
    "\n",
    "    '''\n",
    "\n",
    "    for i, rec in enumerate(tqdm(evidence_corpus)):\n",
    "    \n",
    "        try:\n",
    "            index_status = es_obj.index(index=index_name, id=i, body=rec)\n",
    "        except:\n",
    "            print(f'Unable to load document {i}.')\n",
    "            \n",
    "    n_records = es_obj.count(index=index_name)['count']\n",
    "    print(f'Succesfully loaded {n_records} into {index_name}')\n",
    "\n",
    "\n",
    "    return"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 83,
     "referenced_widgets": [
      "a493a4c07a2743899571330cf6476b74",
      "e94a6f98578743c096c9b045d9dfdf81",
      "28a305d13e584615b9ba3cd9a37bdd56",
      "a299f8fa902348b7807b4d97ebc6027d",
      "a44c2693aee14b89a4c6512c85142f51",
      "a8cebe4ccf004d84bdc37ce44d1192e8",
      "b04d963baf9a40789305d1372ffe1aab",
      "dca5b0a8c1014a66bc8c5ba988878a85"
     ]
    },
    "colab_type": "code",
    "id": "k44Drj_HlTsE",
    "outputId": "7e54768d-c113-4a39-ed76-b37a5b4d3578"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a493a4c07a2743899571330cf6476b74",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=20239.0), HTML(value='')))"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Succesfully loaded 20239 into squad-standard-index\n"
     ]
    }
   ],
   "source": [
    "all_wiki_articles = wiki_articles + wiki_articles_dev\n",
    "\n",
    "populate_index(es_obj=es, index_name='squad-standard-index', evidence_corpus=all_wiki_articles)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "hqCpXHL6lTsG"
   },
   "source": [
    "**_Search the Index_**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "KsLqs2THlTsH"
   },
   "source": [
    "Wahoo! We now have some documents loaded into an index. Elasticsearch provides a rich [query language](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) that supports a diverse range of query types. For this example, we'll use the standard query for performing full text search called a `match` query. By default, Elasticsearch sorts and returns a JSON response of search results based on a computed [relevance score](https://qbox.io/blog/practical-guide-elasticsearch-scoring-relevancy#:~:text=Together%2C%20these%20combine%20into%20a,number%20known%20as%20the%20_score.), which indicates how well a given document matches the query. In addition, the search response also includes the amount of time the query took to run.\n",
    "\n",
    "Let's look at a simple `match` query used to search the `document_text` field in our newly created index.\n",
    "\n",
    "> Important: As previously mentioned, all documents in the index have gone through an analysis process prior to indexing; this is called _index time analysis._ To maintain consistency in matching text queries against the post-processed index tokens, the same Analyzer used on a given field at index time is automatically applied to the query text at search time. _Search time analysis_ is applied depending on which query type is used; `match` queries apply search time analysis by default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "UfbzT03mlTsH"
   },
   "outputs": [],
   "source": [
    "# collapse-hide\n",
    "def search_es(es_obj, index_name, question_text, n_results):\n",
    "    '''\n",
    "    Execute an Elasticsearch query on a specified index\n",
    "    \n",
    "    Args:\n",
    "        es_obj (elasticsearch.client.Elasticsearch) - Elasticsearch client object\n",
    "        index_name (str) - Name of index to query\n",
    "        query (dict) - Query DSL\n",
    "        n_results (int) - Number of results to return\n",
    "        \n",
    "    Returns\n",
    "        res - Elasticsearch response object\n",
    "    \n",
    "    '''\n",
    "    \n",
    "    # construct query\n",
    "    query = {\n",
    "            'query': {\n",
    "                'match': {\n",
    "                    'document_text': question_text\n",
    "                    }\n",
    "                }\n",
    "            }\n",
    "    \n",
    "    res = es_obj.search(index=index_name, body=query, size=n_results)\n",
    "    \n",
    "    return res"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "y1pePxDFlTsK"
   },
   "outputs": [],
   "source": [
    "question_text = 'Who was the first president of the Republic of China?'\n",
    "\n",
    "# execute query\n",
    "res = search_es(es_obj=es, index_name='squad-standard-index', question_text=question_text, n_results=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 238
    },
    "colab_type": "code",
    "id": "vEw77SallTsM",
    "outputId": "567623ea-649d-4d66-cd93-9a3cde84c28a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Question: Who was the first president of the Republic of China?\n",
      "Query Duration: 74 milliseconds\n",
      "Title, Relevance Score:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('Modern_history_54', 23.131157),\n",
       " ('Nanjing_18', 17.076923),\n",
       " ('Republic_of_the_Congo_10', 16.840765),\n",
       " ('Prime_minister_16', 16.137493),\n",
       " ('Korean_War_29', 15.801523),\n",
       " ('Korean_War_43', 15.586578),\n",
       " ('Qing_dynasty_52', 15.291815),\n",
       " ('Chinese_characters_55', 14.773873),\n",
       " ('Korean_War_23', 14.736045),\n",
       " ('2008_Sichuan_earthquake_48', 14.417962)]"
      ]
     },
     "execution_count": 12,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(f'Question: {question_text}')\n",
    "print(f'Query Duration: {res[\"took\"]} milliseconds')\n",
    "print('Title, Relevance Score:')\n",
    "[(hit['_source']['document_title'], hit['_score']) for hit in res['hits']['hits']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "JloR3u_nlTsO"
   },
   "source": [
    "## Evaluating Retriever Performance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "UsM5mdnHlTsP"
   },
   "source": [
    "Ok, we now have a basic understanding of how to use Elasticsearch as an IR tool to return some results for a given question, but how do we know if it's working? How do we evaluate what a good IR tool looks like?\n",
    "\n",
    "We'll need two things to evaluate our Retriever: some labeled examples (i.e., SQuAD2.0 question/answer pairs) and some performance metrics. In the conventional world of information retrieval, there are [many metrics](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval))used to quantify the relevance of query results, largely centered around the concepts of precision and recall. For IR in the context of QA, these ideas are adapted into two commonly used evaluation metrics: *recall* and *mean average precision (mAP)*. Additionally, we consider the amount of time required to execute a query, since the main point of having a two-stage QA system is to efficiently narrow the large search space for our Reader.\n",
    "\n",
    "**Recall**\n",
    "\n",
    "Traditionally, _recall_ in IR indicates the fraction of all relevant documents that are retrieved. In this case, we are less concerned with finding *all* of the passages containing the answer and more concerned with the binary presence of a passage containing the correct answer being returned. In that light, a Retriever's recall is defined across a set of questions as *the percentage of questions for which the answer segment appears in one of the top N pages returned by the search method.*\n",
    "\n",
    "**Mean Average Precision**\n",
    "\n",
    "While the _recall_ metric focuses on the minimum viable result set to enable a Reader for success, we do still care about the composition of that result set. We want a metric that rewards a Retriever for: a) returning a lot of answer-containing documents in the result set (i.e., the traditional meaning of precision), and b) returning those answer-containing documents higher up in the result set than non-answer-containing documents (i.e., ranking them correctly). This is precisely (🙃) what _mean average precision_ (mAP) does for us. \n",
    "\n",
    "To explain mAP further, let's first break down the concept of average precision for information retrieval. If our Retriever is asked to return _N_ documents and _m_ of those documents contains the true answer, then average precision (AP) is defined as:\n",
    "\n",
    "![](https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/my_icons/map_equation.png?raw=1)\n",
    "\n",
    "where *rel(k)* is just a binary indication of whether the kth passage contains the correct answer or not. Using a concrete example, consider retrieving _N_=3 documents, of which only one contains the correct answer. Here are three scenarios for how this could happen:\n",
    "\n",
    "![](https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/my_icons/map_example.png?raw=1)\n",
    "    \n",
    "Scenario A is rewarded with the highest score because it was able to correctly rank the ground truth document relative to the others returned. Since average precision is calculated on a per-query basis, the mean average precision is simply just *the average AP across all queries*. \n",
    "\n",
    "Now, using our Wikipedia passage index, let's define a function called `evaluate_retriever` to loop through all question/answer examples from the SQuAD2.0 train set and see how well our Elasticsearch Retriever performs in terms of recall, mAP, and average query duration when retrieving *N=3* passages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "yiakwwmklTsP"
   },
   "outputs": [],
   "source": [
    "# collapse-hide\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "def average_precision(binary_results):\n",
    "    \n",
    "    ''' Calculates the average precision for a list of binary indicators '''\n",
    "    \n",
    "    m = 0\n",
    "    precs = []\n",
    "\n",
    "    for i, val in enumerate(binary_results):\n",
    "        if val == 1:\n",
    "            m += 1\n",
    "            precs.append(sum(binary_results[:i+1])/(i+1))\n",
    "            \n",
    "    ap = (1/m)*np.sum(precs) if m else 0\n",
    "            \n",
    "    return ap\n",
    "\n",
    "\n",
    "def evaluate_retriever(es_obj, index_name, qa_records, n_results):\n",
    "    '''\n",
    "    This function loops through a set of question/answer examples from SQuAD2.0 and \n",
    "    evaluates Elasticsearch as a information retrieval tool in terms of recall, mAP, and query duration.\n",
    "    \n",
    "    Args:\n",
    "        es_obj (elasticsearch.client.Elasticsearch) - Elasticsearch client object\n",
    "        index_name (str) - name of index to query\n",
    "        qa_records (list) - list of qa_records from preprocessing steps\n",
    "        n_results (int) - the number of results ElasticSearch should return for a given query\n",
    "        \n",
    "    Returns:\n",
    "        test_results_df (pd.DataFrame) - a dataframe recording search results info for every example in qa_records\n",
    "    \n",
    "    '''\n",
    "    \n",
    "    results = []\n",
    "    \n",
    "    for i, qa in enumerate(tqdm(qa_records)):\n",
    "        \n",
    "        ex_id = qa['example_id']\n",
    "        question = qa['question_text']\n",
    "        answer = qa['short_answer']\n",
    "        \n",
    "        # execute query\n",
    "        res = search_es(es_obj=es_obj, index_name=index_name, question_text=question, n_results=n_results)\n",
    "        \n",
    "        # calculate performance metrics from query response info\n",
    "        duration = res['took']\n",
    "        binary_results = [int(answer.lower() in doc['_source']['document_text'].lower()) for doc in res['hits']['hits']]\n",
    "        ans_in_res = int(any(binary_results))\n",
    "        ap = average_precision(binary_results)\n",
    "\n",
    "        rec = (ex_id, question, answer, duration, ans_in_res, ap)\n",
    "        results.append(rec)\n",
    "    \n",
    "    # format results dataframe\n",
    "    cols = ['example_id', 'question', 'answer', 'query_duration', 'answer_present', 'average_precision']\n",
    "    results_df = pd.DataFrame(results, columns=cols)\n",
    "    \n",
    "    # format results dict\n",
    "    metrics = {'Recall': results_df.answer_present.value_counts(normalize=True)[1],\n",
    "               'Mean Average Precision': results_df.average_precision.mean(),\n",
    "               'Average Query Duration':results_df.query_duration.mean()}\n",
    "    \n",
    "    return results_df, metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "referenced_widgets": [
      "c2e8a34b484b47eca6cd2010052a6b44"
     ]
    },
    "colab_type": "code",
    "id": "N9jLAXDdlTsR",
    "outputId": "7c4aa880-a287-47a2-d6e9-ca03adbaa034"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c2e8a34b484b47eca6cd2010052a6b44",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=92749.0), HTML(value='')))"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "# combine train/dev examples and filter out SQuAD records that\n",
    "# do not have a short answer for the given question\n",
    "all_qa_records = qa_records+qa_records_dev\n",
    "qa_records_answerable = [record for record in all_qa_records if record['short_answer'] != '']\n",
    "\n",
    "# run evaluation\n",
    "results_df, metrics = evaluate_retriever(es_obj=es, index_name='squad-standard-index', qa_records=qa_records_answerable, n_results=3)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "tOxeFQqFlTsT",
    "outputId": "6a8bf424-bb46-447d-b386-c1d7b9be2146"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Recall': 0.8226180336176131,\n",
       " 'Mean Average Precision': 0.7524133234140888,\n",
       " 'Average Query Duration': 3.0550841518506937}"
      ]
     },
     "execution_count": 135,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "hU2Klvn-lTsU"
   },
   "source": [
    "## Improving Search Results with a Custom Analyzer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "S-8WrpnZlTsV"
   },
   "source": [
    "Identifying a correct passage in the Top 3 results for 82% of the SQuAD questions in ~3 milliseconds per question is not too bad! But that means that we've effectively limited our overall QA system to an 82% upper bound on performance. How can we improve upon this?\n",
    "\n",
    "One simple and obvious way to increase recall would be to just retrieve more passages. The following figure shows the effects of varying corpus size and result size on Elasticsearch retriever recall. As expected we see that the number of passages retrieved (i.e. *Top N*) has a dramatic impact on recall; a ~10-15 point jump from 1 to 3 passages returned, and ~5 point jump for each of the other tiers. We also see a gradual decrease in recall as corpus size increases, which isn't surprising.\n",
    "\n",
    "![Recall vs. Corpus Size](https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/my_icons/recall_v_corpussize.png?raw=1 \"Experimental results demonstrating the impact of increasing corpus size and number of results retrieved on Elasticsearch recall.\")\n",
    "\n",
    "While increasing the number of passages retrieved is effective, it also has implications on overall system performance as the (already slow) Reader now has to reason over more text. Instead, we can lean on best practices in the well-explored domain of information retrieval.\n",
    "\n",
    "Optimizing full text search is a battle between precision (returning as few irrelevant documents as possible) and recall (returning as many relevant documents as possible). Matching only exact words in the question results in high precision; however, it misses out on many passages that could be relevant. We can cast a wider net by searching for terms that are not _exactly_ the same as those in the question, but are related in some way. Here, Elasticsearch Analyzers can help. Earlier in this post, we described how Analyzers provide a flexible and extensible method to tailor search for a given dataset. Two of the custom Analyzers that can help cast a wider net are _stop words_ and _stemming._\n",
    "\n",
    "- **Stop words:** Stop words are the most frequently occuring words in the English language (for example: \"and,\" \"the,\" \"to,” etc.) and add minimal semantic value to a piece of text. It is common practice to remove them in order to decrease the size of the index and increase the relevance of search results. \n",
    "\n",
    "- **Stemming:** The English language is inflected; words can alter their written form to express different meanings. For example, “sing,” “sings,” “sang,” and “singing” are written with slight differences, but all really mean the same thing (albeit with varying tenses). Stemming algorithms exploit the fact that search intent is _usually_ word-form agnostic, and attempt to reduce inflected words to their root form: consequently improving retrievability.  We'll implement the [Snowball](https://snowballstem.org/) stemming algorithm as a token filter in our custom Analyzer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 284,
     "referenced_widgets": [
      "f8d18feb30b24de5bbbb869cc9a31ae8",
      "15f44b8cf87d459bb6a40b5e6b9db470",
      "f4d90214c67749168472a2ba3cb0d72a",
      "95923a713e324d18bc9fc82a466703c4",
      "40b3ea36c5ca407597e8ce6c738c9786",
      "9061d4497e4443029cbb21b77281cf31",
      "139cd5c5c43f436495dfddbdfc7d35c3",
      "ef2c1a3506c549b7878e3ea4a5cc565f"
     ]
    },
    "colab_type": "code",
    "id": "dPMz6_6ZlTsV",
    "outputId": "e4a16e49-387b-4913-adc6-db5def10d339"
   },
   "outputs": [],
   "source": [
    "# create new index\n",
    "index_config = {\n",
    "    \"settings\": {\n",
    "        \"analysis\": {\n",
    "            \"analyzer\": {\n",
    "                \"stop_stem_analyzer\": {\n",
    "                    \"type\": \"custom\",\n",
    "                    \"tokenizer\": \"standard\",\n",
    "                    \"filter\":[\n",
    "                        \"lowercase\",\n",
    "                        \"stop\",\n",
    "                        \"snowball\"\n",
    "                    ]\n",
    "                    \n",
    "                }\n",
    "            }\n",
    "        }\n",
    "    },\n",
    "    \"mappings\": {\n",
    "        \"dynamic\": \"strict\", \n",
    "        \"properties\": {\n",
    "            \"document_title\": {\"type\": \"text\", \"analyzer\": \"stop_stem_analyzer\"},\n",
    "            \"document_text\": {\"type\": \"text\", \"analyzer\": \"stop_stem_analyzer\"}\n",
    "            }\n",
    "        }\n",
    "    }\n",
    "\n",
    "es.indices.create(index='squad-stop-stem-index', body=index_config, ignore=400)\n",
    "\n",
    "# populate the index\n",
    "populate_index(es_obj=es, index_name='squad-stop-stem-index', evidence_corpus=all_wiki_articles)\n",
    "\n",
    "# evaluate retriever performance\n",
    "stop_stem_results_df, stop_stem_metrics = evaluate_retriever(es_obj=es, index_name='squad-stop-stem-index',\\\n",
    "                                                             qa_records=qa_records_answerable, n_results=3)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "13ETw8oBlTsY",
    "outputId": "918b6bb6-1eff-4138-940c-e5784a39300b"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Recall': 0.8501115914996388,\n",
       " 'Mean Average Precision': 0.7800892731997112,\n",
       " 'Average Query Duration': 0.7684287701215108}"
      ]
     },
     "execution_count": 137,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stop_stem_metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "4jt2vmgelTsa"
   },
   "source": [
    "Awesome! We've increased recall and mAP by about 3 points _and_ reduced our average query duration by nearly 4 times, through simple preprocessing steps that just scratch the surface of tailored analysis in Elasticsearch.  \n",
    "\n",
    "There is no \"one-size-fits-all\" recipe for optimizing search relevance, and every implementation will be different. In addition to custom analysis, there are many other methods for increasing search recall - for example, query expansion, which introduces additional tokens/phrases into a query at search time. We'll save that topic for another post. Instead, let’s take a look at how the Retriever's performance affects a QA system."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "M4YOvX-ilTsb"
   },
   "source": [
    "# The Full IR QA System"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "RAFJgQcjnbky"
   },
   "source": [
    "We used the questions from the train set to evaluate the stand-alone retriever, in order to provide as large a collection as possible. However, BERT has been trained on those questions and would return inflated performance values if we used them for full-system evaluation. So let’s resort to our trusty SQuAD2.0 dev set. \n",
    "\n",
    "> Note: This section focuses on a discussion. The code to reproduce our results can be found at the end. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ah5HxO2Gs7Gf"
   },
   "source": [
    "## Connecting the retriever to the reader\n",
    "\n",
    "In our last post, we evaluated a BERT-like model on the SQuAD2.0 dev set by providing the model with a paragraph that perfectly aligned with the question. This time, the retriever will serve up a collection of relevant documents. We created a reader class that leverages the Hugging Face (HF) question-answering [`pipeline`](https://huggingface.co/transformers/main_classes/pipelines.html#transformers.QuestionAnsweringPipeline) to do the brunt of the work for us (loading models and tokenizers, converting text to features, chunking, prediction, etc.), but how should it process multiple documents from the retriever? And how should it determine which document contains the best answer? \n",
    "\n",
    "This turns out to be one of the thornier subtleties in building a full QA system. There are several ways to approach this problem. Here are two that we tried: \n",
    "\n",
    "1. Pass each document to the reader individually, then aggregate the resulting scores.\n",
    "2. Concatenate all documents into one long passage and pass to the reader simultaneously.\n",
    "\n",
    "\n",
    "Both methods have pros and cons. Let's take a look at them. \n",
    "\n",
    "**Pass each document individually**\n",
    "\n",
    "In Option 1, the reader returns answers and scores for each document. A series of heuristics must be developed to determine which answer is the best, and when the null answer should be returned. For this post, we chose a simple but reasonable heuristic: \"Only return null if the highest scoring answer in each document is null; otherwise return the highest scoring non-null answer.\"\n",
    "\n",
    "Unfortunately, a direct comparison of answer scores between documents is not technically possible. The reason lies in the type of score returned by the HF pipeline: a softmax probability over all the tokens _in that document_. This means that the only meaningful comparisons are between answers _from the same document_ whose probabilities will sum to 1. Comparing an answer with a score of 0.78 from one document is not guaranteed to be better than an answer with a score of 0.70 from another document!\n",
    "\n",
    "Finally, this option is slower (in our current implementation) because each article is passed individually, leading to multiple BERT calls. \n",
    "\n",
    "**Pass all documents together as one long context**\n",
    "\n",
    "Option 2 circumvents many of these challenges but leads to other problems. The pros are:\n",
    "1. all candidate answers are scored on the same probability scale,\n",
    "2. handling the null answer is more straightforward (we did that [last time](https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html)), and\n",
    "3. we can take advantage of faster compute, since HF will chunk long documents for us and pass them through BERT in a batch.\n",
    "\n",
    "\n",
    "On the other hand, when concatenating multiple passages, there's a good chance that BERT will see a mixed context: the end of one paragraph grafted onto the beginning of another, for example. This could make it more difficult for the model to correctly identify an answer in a potentially confusing context. Another drawback is that it's more difficult to backstrapolate which of the input documents from which the answer ultimately came.  \n",
    "\n",
    "Our reader class has two methods: `predict` and `predict_combine`, corresponding to Option 1 and Option 2, respectively. We tested each of them over 1000 examples from the SQuAD2.0 dev set while increasing the number of retrieved documents.\n",
    "\n",
    "![](my_icons/qa_eval_combined_vs_individual.png)\n",
    "\n",
    "There are two take-aways here. First, we see that the concatenation method (blue bars) outperforms passing documents individually and applying heuristics to the outputs (orange bars). While more sophistical heuristics can be developed, for short documents (paragraphs in this case), we find that the concatenation method is the most straightforward approach.\n",
    "\n",
    "The second thing to notice is that, as the number of retrieved documents increases, the overall performance decreases for both methods. What's going on? When we evaluated the retriever, we found that increasing the number of retrieved documents _increased_ the likelihood that the correct answer was contained in at least one of them. So why does reader performance degrade?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "AOTKaSZ4uRWu"
   },
   "source": [
    "## Evaluating the system\n",
    "\n",
    "Standard evaluation on the SQuAD2.0 dev set considers only the best overall answer for each question, but we can create another evaluation metric that mirrors our IR recall metric from earlier. Specifically, we compute the _percent of examples in which the correct answer is found in **at least one** of the top k documents provided by the retriever_.  \n",
    "\n",
    "![](my_icons/qa_eval_top1_vs_any_topk.png)\n",
    "\n",
    "The blue bars are the same as the blue bars in the previous figure, but this time the orange bars represent our new recall-esque metric. What a difference! This demonstrates that when the model is provided with more documents, the correct answer truly is present more often. However, trying to predict which one of those answers is the right one is challenging: this task is not achieved by a simple heuristic, and becomes harder with more documents.\n",
    "\n",
    "It may seem counterintuitive, but this behavior does make sense. Let’s imagine a simple system that performs ranked document retrieval and random answer selection. Ranked document retrieval, in this case, means that the correct answer is most likely to be found in the top-most ranked document, with some decreasing probability of being contained in the second- or third-ranked document, and so on. As we retrieve more and more documents, the probability _increases_ that the correct answer is contained in the resulting set. However, as the number of documents increases, so too do the number of possible answers from which to choose - one from each document. Random answer selection over an increasing number of answer choices results in a _decrease_ in performance. Obviously, BERT is not random, but it’s also not _perfect,_ so the trait persists. \n",
    "\n",
    "Does this mean we shouldn't use QA systems like this? Of course not! There are several factors to consider: \n",
    "\n",
    "1. **Use Case:** If your QA system seeks to provide enhanced search capabilities, then it might not be necessary to predict a single answer with high confidence. It might be sufficient to provide answers from several documents for the user to peruse. On the other hand, if your use case seeks to augment a chatbot, then predicting a high confidence answer might be more important for user experience. \n",
    "\n",
    "2. **Better heuristics:**  While our simple heuristic didn't perform as well as concatenating all the input documents into one long context, there is research into developing heuristics that work. In particular, [one promising approach](https://arxiv.org/abs/1902.01718) develops a combined answer score that considers both the retriever's document ranking, and the reader's answer score. \n",
    "\n",
    "3. **Document length:** Our concatenation method works reasonably well compared to other methods, but the documents are short. If the document length becomes considerably longer, this method's performance can degrade significantly. \n",
    "\n",
    "\n",
    "**Impact of a retriever on a QA system** \n",
    "\n",
    "Considering all that we've learned so far, what is the overall impact of the retriever on a full QA system? Using our concatenation method and returning only the best answer for all questions in the SQuAD2.0 dev set, we can compare results with [our previous blog post](https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html) in which we evaluated only the reader.  \n",
    "\n",
    "![](my_icons/qa_system_vs_reader_only.png)\n",
    "\n",
    "As expected, adding a retriever to supply documents to the reader reduces the system's ability to identify the correct answer. This motivates approaches for enhancing the retriever, in order to supply the reader with the best documents possible."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Final Thoughts\n",
    "\n",
    "We did it! We built a full QA system with off-the-shelf parts using ElasticSearch and HuggingFace Transformers.  \n",
    "\n",
    "We made a series of design choices in building our full QA system, including the choice to index over Wikipedia paragraphs rather than full articles. This allowed us to more easily replicate SQuAD evaluation methods, but this isn't practical. In the real world, a QA system will need to work with existing indexes, which are typically performed over full documents (not paragraphs). In addition to architectural constraints, indexing over full documents provides the retriever with the best chance of returning a relevant document.\n",
    "\n",
    "However, passing multiple long documents to a Transformer model is a recipe for boredom -- it will take forever and it likely won't be highly informative. Transformers work best with smaller passages. Thus, extracting a few highly relevant paragraphs from the most relevant document is a better recipe for a practical implementation. This is exactly the approach we'll take next time when we (hopefully) address the biggest question of all: \n",
    "\n",
    "> How do I apply a QA system to **my** data? \n",
    "\n",
    "Stay tuned!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Post Script: The Code\n",
    "If you open this notebook in Colab, you'll find several cells below that step through the experiments we ran for the final section. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# hide\n",
    "!pip install transformers==2.11.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# hide\n",
    "from transformers.data.processors import SquadV2Processor\n",
    "processor = SquadV2Processor()\n",
    "examples = processor.get_dev_examples(data_dir='data/squad/')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# hide\n",
    "!curl -L -O https://raw.githubusercontent.com/melaniebeck/question_answering/8b5200223a4808921233b408d2f744dbd4a1e427/src/readers.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# hide\n",
    "from readers import Reader\n",
    "reader = Reader()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# hide\n",
    "def query(prediction_function, question, index_name, topk):\n",
    "    \"\"\"\n",
    "    Extracts answers from a full QA system: \n",
    "        1) Constructs query and retrieves the topk relevant documents \n",
    "        2) Passes those documents to the reader's prediction_function\n",
    "        3) Returns the topk answers for each of the k documents\n",
    "\n",
    "    Inputs:\n",
    "        prediction_function: either reader.predict or reader.predict_combine\n",
    "        question: str, question string\n",
    "        index_name: str, name of the index for the retriever\n",
    "        topk: int, number of documents to retrieve from retriever\n",
    "\n",
    "    Outputs: \n",
    "        answers: dict with format:\n",
    "            {\n",
    "              'question': question string,\n",
    "              'answers': list of answer dicts from reader \n",
    "            }\n",
    "    \"\"\"\n",
    "    retriever_results = search_es(es_obj=es, \n",
    "                                  index_name=index_name, \n",
    "                                  question_text=question, \n",
    "                                  n_results=topk)\n",
    "\n",
    "    passages = retriever_results['hits']['hits']\n",
    "    docs = []\n",
    "    for passage in passages:\n",
    "        doc = {\n",
    "            'id': passage['_id'],\n",
    "            'score': passage['_score'],\n",
    "            'text': passage['_source']['document_text'], \n",
    "            'title': passage['_source']['document_title'],\n",
    "        }\n",
    "        docs.append(doc)\n",
    "\n",
    "    answers = prediction_function(question, docs, topk)\n",
    "    return answers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Zi9757o_qqC6"
   },
   "outputs": [],
   "source": [
    "# hide\n",
    "from transformers.data.metrics.squad_metrics import squad_evaluate\n",
    "from transformers.data.metrics.squad_metrics import compute_exact, normalize_answer\n",
    "\n",
    "def evaluate_qasystem_squadstyle(\n",
    "                      prediction_function,\n",
    "                      examples, \n",
    "                      index_name, \n",
    "                      topk, \n",
    "                      output_path='data/',\n",
    "                      save_output=True \n",
    "                      ):\n",
    "    \"\"\"\n",
    "    Squad-style evaluation requires that only the best answer aggregated from all\n",
    "    top k documentsis is provided as the prediction during evaluation. \n",
    "    \n",
    "    Inputs\n",
    "        prediction_function: either reader.predict or reader.predict_combine\n",
    "        examples: list, the SQuAD2.0 dev set as loaded by the squad processors\n",
    "        index_name: str, name of the index in ElasticSearch\n",
    "        topk: int, number of documents to retrieve\n",
    "        output_path: str, directory to store prediction output\n",
    "        save_output: bool, whether to save prediction output\n",
    "    \n",
    "    Outputs\n",
    "        Saved to disk \n",
    "        predictions: Best answer for each SQuAD2.0 question \n",
    "        meta_predictions: Top N answers for each SQuAD question\n",
    "        \n",
    "        Returns\n",
    "        results: OrderedDict of results from the HF squad_evaluate method\n",
    "    \"\"\"\n",
    "    import pickle\n",
    "\n",
    "    outfile = output_path+f\"predictions_{index_name}_{topk}.pkl\"\n",
    "\n",
    "    # if we've already computed predictions, load them for evaluation\n",
    "    if os.path.exists(outfile):\n",
    "        predictions = pickle.load(open(outfile, \"rb\"))\n",
    "    else:\n",
    "        predictions = {}\n",
    "        meta_predictions = {}\n",
    "        \n",
    "        for example in tqdm(examples):\n",
    "            # retrieve top N relevant documents from retriever\n",
    "            reader_results = query(prediction_function, \n",
    "                                   example.question_text, \n",
    "                                   index_name, \n",
    "                                   topk\n",
    "                                   )\n",
    "            # add best answer to predictions\n",
    "            answers = reader_results['answers']\n",
    "            predictions[example.qas_id] = answers[0]['answer_text']\n",
    "\n",
    "            # for debugging/explainability - save the full answer \n",
    "            # (not just text answer from top hit)\n",
    "            meta_predictions[example.qas_id] = answers\n",
    "\n",
    "        if save_output:\n",
    "            pickle.dump(predictions, open(outfile, \"wb\"))\n",
    "\n",
    "            meta_outfile = os.path.splitext(outfile)[0]+\"_meta.pkl\"\n",
    "            pickle.dump(meta_predictions, open(meta_outfile, \"wb\"))\n",
    "        \n",
    "    # compute evaluation with HF \n",
    "    results = squad_evaluate(examples, predictions)\n",
    "    return results\n",
    "\n",
    "def evaluate_qasystem_recallstyle(\n",
    "                      prediction_function,\n",
    "                      examples, \n",
    "                      index_name, \n",
    "                      topk, \n",
    "                      output_path='data/',\n",
    "                      save_output=True \n",
    "                      ):\n",
    "    \"\"\"\n",
    "    Recall-style evaluation computes the % of all examples that contain the \n",
    "    correct answer in at least one of the documents. No prediction output is \n",
    "    saved in this version because I got tired. \n",
    "\n",
    "    This version only computes Exact Match because it's really hard to determine\n",
    "    whether the correct answer is within the top k answers when computing F1 \n",
    "    which is continuous between 0 and 1. \n",
    "    \n",
    "    Inputs\n",
    "        prediction_function: either reader.predict or reader.predict_combine\n",
    "        examples: list, the SQuAD2.0 dev set as loaded by the squad processors\n",
    "        index_name: str, name of the index in ElasticSearch\n",
    "        topk: int, number of documents to retrieve\n",
    "    \n",
    "    Outputs   \n",
    "        Returns\n",
    "        results: OrderedDict of exact match results\n",
    "        \n",
    "    Notes: \n",
    "        This function utilizes several HF methods and other functionality. \n",
    "        Supporting code is in the cell below. \n",
    "    \"\"\"\n",
    "\n",
    "    qas_id_to_has_answer = {example.qas_id: bool(example.answers) for example in examples}\n",
    "    has_answer_qids = [qas_id for qas_id, has_answer in qas_id_to_has_answer.items() if has_answer]\n",
    "    no_answer_qids = [qas_id for qas_id, has_answer in qas_id_to_has_answer.items() if not has_answer]\n",
    "\n",
    "    has_correct_answer = {}\n",
    "    for example in tqdm(examples):\n",
    "        # retrieve top N relevant documents from retriever\n",
    "        reader_results = query(prediction_function, \n",
    "                               example.question_text, \n",
    "                               index_name, \n",
    "                               topk\n",
    "                               )\n",
    "\n",
    "        # pull up all gold answers for this example\n",
    "        qas_id = example.qas_id\n",
    "        gold_answers = [answer[\"text\"] for answer in example.answers if normalize_answer(answer[\"text\"])]\n",
    "\n",
    "        if not gold_answers:\n",
    "            # For unanswerable questions, only correct answer is empty string\n",
    "            gold_answers = [\"\"]\n",
    "\n",
    "        # check if any of the gold answers is contained in any of the current predictions \n",
    "        exact_scores = []\n",
    "        predictions = reader_results['answers']\n",
    "        for prediction in predictions:\n",
    "            exact_scores.append(max(compute_exact(a, prediction['answer_text']) for a in gold_answers))\n",
    "        \n",
    "        has_correct_answer[qas_id] = int(any(exact_scores))\n",
    "\n",
    "    evaluation = make_eval_dict(has_correct_answer)\n",
    "\n",
    "    if has_answer_qids:\n",
    "        has_ans_eval = make_eval_dict(has_correct_answer, qid_list=has_answer_qids)\n",
    "        merge_eval(evaluation, has_ans_eval, \"HasAns\")\n",
    "\n",
    "    if no_answer_qids:\n",
    "        no_ans_eval = make_eval_dict(has_correct_answer, qid_list=no_answer_qids)\n",
    "        merge_eval(evaluation, no_ans_eval, \"NoAns\")\n",
    "\n",
    "    return evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# hide\n",
    "\n",
    "# supporting functions for evaluation \n",
    "import collections\n",
    "def merge_eval(main_eval, new_eval, prefix):\n",
    "    for k in new_eval:\n",
    "        main_eval[\"%s_%s\" % (prefix, k)] = new_eval[k]\n",
    "\n",
    "def make_eval_dict(exact_scores, qid_list=None):\n",
    "    if not qid_list:\n",
    "        total = len(exact_scores)\n",
    "        return collections.OrderedDict(\n",
    "            [\n",
    "                (\"exact\", 100.0 * sum(exact_scores.values()) / total),\n",
    "                (\"total\", total),\n",
    "            ]\n",
    "        )\n",
    "    else:\n",
    "        total = len(qid_list)\n",
    "        return collections.OrderedDict(\n",
    "            [\n",
    "                (\"exact\", 100.0 * sum(exact_scores[k] for k in qid_list) / total),\n",
    "                (\"total\", total),\n",
    "            ]\n",
    "        )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# hide\n",
    "\n",
    "# Experiment 1: Compute evaluation over top k = 1 to 5 using the method in which\n",
    "# documents are passed invidivudally to the reader and a heuristic applied to the results\n",
    "os.makedirs('data/individual_results/')\n",
    "all_results = []\n",
    "for i in range(1,6):\n",
    "    results = evaluate_qasystem_squadstyle(reader.predict,\n",
    "                                         examples[:1000], \n",
    "                                         index_name='squad-stop-stem-index', \n",
    "                                         topk=i, \n",
    "                                         output_path='data/individual_results/'\n",
    "                                         )\n",
    "    results['topk'] = i\n",
    "    all_results.append(results)\n",
    "\n",
    "pickle.dump(all_results, open(\"qa_eval_1000squad_individual.pkl\", \"wb\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# hide\n",
    "\n",
    "# Experiment 2: Compute evaluation over top k = 1 to 5 using the method in which\n",
    "# documents concatenated and passed the reader all at once\n",
    "os.makedirs('data/combined_results/')\n",
    "all_results = []\n",
    "for i in range(1,6):\n",
    "    results = evaluate_qasystem_squadstyle(reader.predict_combined,\n",
    "                                         examples[:1000], \n",
    "                                         index_name='squad-stop-stem-index', \n",
    "                                         topk=i, \n",
    "                                         output_path='data/combined_results/'\n",
    "                                         )\n",
    "    results['topk'] = i\n",
    "    all_results.append(results)\n",
    "\n",
    "pickle.dump(all_results, open(\"qa_eval_1000squad_combined.pkl\", \"wb\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# hide\n",
    "\n",
    "# Experiment 3: Compute recall-style evaluation over top k = 1 to 5 using the\n",
    "# \"individual\" method \n",
    "all_results = []\n",
    "for i in range(1,6):\n",
    "    results = evaluate_qasystem_recallstyle(reader.predict,\n",
    "                                          examples[:1000], \n",
    "                                          index_name='squad-stop-stem-index', \n",
    "                                          topk=i\n",
    "                                          )\n",
    "    results['topk'] = i\n",
    "    all_results.append(results)\n",
    "\n",
    "pickle.dump(all_results, open(\"qa_eval_1000squad_topkEM.pkl\", \"wb\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 287,
     "referenced_widgets": [
      "66e0efa9685248629e009c1e255bff6e",
      "f0858954bbfd447eb95a04a3caabf8a4",
      "f6c75626f2d54bfbaa7bad761af5b9c2",
      "62fb51c849b04bc78c629dd42f811deb",
      "26b9259fecf4445b8fdcb753ed6d09ef",
      "e91c7ba9a6314f0d905d08d0f822cbc1",
      "3d3ce5e897b84a218237582372bca2bb",
      "0f9413092ffe496d8f31c374732e870d"
     ]
    },
    "colab_type": "code",
    "id": "Eaiz_d3eDTkR",
    "outputId": "778f92a4-3c0c-4ffd-afc5-be2621cf8b1c"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "66e0efa9685248629e009c1e255bff6e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=11873.0), HTML(value='')))"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "OrderedDict([('exact', 57.16331171565737),\n",
       "             ('f1', 60.41835572408481),\n",
       "             ('total', 11873),\n",
       "             ('HasAns_exact', 50.64102564102564),\n",
       "             ('HasAns_f1', 57.16044829825517),\n",
       "             ('HasAns_total', 5928),\n",
       "             ('NoAns_exact', 63.666947014297726),\n",
       "             ('NoAns_f1', 63.666947014297726),\n",
       "             ('NoAns_total', 5945),\n",
       "             ('best_exact', 57.16331171565737),\n",
       "             ('best_exact_thresh', 0.0),\n",
       "             ('best_f1', 60.41835572408505),\n",
       "             ('best_f1_thresh', 0.0)])"
      ]
     },
     "execution_count": 104,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# hide\n",
    "\n",
    "# Experiment 4: Full squad-style evaluation\n",
    "# Running the cell below requires a gpu so execute with caution or Colab. \n",
    "# Evaluation over the top 1 retrieved documents takes about 35 minutes to run.\n",
    "\n",
    "full_squad_eval_results = evaluate_qasystem_squadstyle(reader.predict,\n",
    "                                                       examples, \n",
    "                                                       index_name='squad-stop-stem-index', \n",
    "                                                       topk=1\n",
    "                                                       )\n",
    "full_squad_eval_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 287,
     "referenced_widgets": [
      "be3cf3e6c6ba40d087f8dd27dd4f5e64",
      "d193d10e1ad94119a849d123c093f3cc",
      "705fa1214c044cbcae6f5d109b22802d",
      "3c0ffe3eeca741899ddbe1306e60ce39",
      "e30ff795eb1c4016be3967f190764399",
      "f04b80c35efb44b5bec078f4d7a57f3b",
      "7a1bbfb20bd84d4b8995584a37dabae1",
      "26ef4a7e1f76465ead7e51c1d9866c6f"
     ]
    },
    "colab_type": "code",
    "id": "Uv7CAhHuYbWw",
    "outputId": "128381b6-e6a1-4bd4-9f6f-133a38649b37"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "be3cf3e6c6ba40d087f8dd27dd4f5e64",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=11873.0), HTML(value='')))"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "OrderedDict([('exact', 39.15606839046576),\n",
       "             ('f1', 42.27172639000713),\n",
       "             ('total', 11873),\n",
       "             ('HasAns_exact', 49.42645074224021),\n",
       "             ('HasAns_f1', 55.666701657988256),\n",
       "             ('HasAns_total', 5928),\n",
       "             ('NoAns_exact', 28.915054667788056),\n",
       "             ('NoAns_f1', 28.915054667788056),\n",
       "             ('NoAns_total', 5945),\n",
       "             ('best_exact', 50.08843594710688),\n",
       "             ('best_exact_thresh', 0.0),\n",
       "             ('best_f1', 50.09054156489514),\n",
       "             ('best_f1_thresh', 0.0)])"
      ]
     },
     "execution_count": 105,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# hide\n",
    "evaluate_qasystem(examples, index_name='squad-stop-stem-index', topk=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# hide\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from collections import OrderedDict\n",
    "import numpy as np\n",
    "\n",
    "invidual_top1 = pickle.load(open(\"qa_eval_1000squad_individual.pkl\", \"rb\"))\n",
    "top1_df = pd.DataFrame(data=invidual_top1)\n",
    "\n",
    "individual_top5 = pickle.load(open(\"qa_eval_1000squad_topkEM.pkl\",\"rb\"))\n",
    "top5_df = pd.DataFrame(data=individual_top5)\n",
    "\n",
    "combined_top1 = pickle.load(open(\"qa_eval_1000squad_combined.pkl\",\"rb\"))\n",
    "combo_df = pd.DataFrame(data=combined_top1)\n",
    "\n",
    "width = .25\n",
    "\n",
    "# Compare combined vs individual method for passing to reader\n",
    "plt.figure(figsize=(10,7))\n",
    "plt.bar(combo_df['topk'] - width/2, combo_df['exact'], width=width, label='Concatenated')\n",
    "plt.bar(top1_df['topk'] + width/2, top1_df['exact'], width=width, label='Individual')\n",
    "\n",
    "#plt.text(0, 40, 'No Retriever', rotation=90, ha='center', va='top', color='white', fontsize=18)\n",
    "plt.xticks(top1_df['topk'])\n",
    "plt.legend(fontsize=16, frameon=False)\n",
    "\n",
    "plt.ylabel('Exact Match Score', fontsize=16)\n",
    "plt.xlabel('Top k retrieved documents', fontsize=16)\n",
    "plt.title(\"Concatenated vs Individual Eval for 1000 SQuAD2.0 examples\", fontsize=18)\n",
    "\n",
    "plt.savefig('qa_eval_combined_vs_individual.png')\n",
    "\n",
    "# Compare Top 1 vs ANY Top k eval methods\n",
    "plt.figure(figsize=(10,7))\n",
    "plt.bar(combo_df['topk'] - width/2, combo_df['exact'], width=width, label='Top 1')\n",
    "plt.bar(top5_df['topk'] + width/2, top5_df['exact'], width=width, label='Top k')\n",
    "\n",
    "plt.xticks(top5_df['topk'])\n",
    "plt.legend(fontsize=16, frameon=False)\n",
    "\n",
    "plt.ylabel('Exact Match Score', fontsize=16)\n",
    "plt.xlabel('Top k retrieved documents', fontsize=16)\n",
    "plt.title(\"Answer in Top 1 vs. any of Top k for 1000 SQuAD2.0 examples\", fontsize=18)\n",
    "\n",
    "plt.savefig('qa_eval_top1_vs_any_topk.png')\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAlAAAAGgCAYAAABlmFnBAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3dfXRU5bn38d9lEkJAIUAjIgGCgYoKldoA1iKoCJFQAUUt5UXAsEApFahFre3hhOIR1FZBTzkNFSWUGopCJDZWoTxiodJKrFBR6AKVlxiFqKCgvCXczx9MpoQkZm7yMkPy/aw1K5k9e2ZfyWINX/beszHnnAAAABC6c8I9AAAAwNmGgAIAAPBEQAEAAHgioAAAADwRUAAAAJ6i63Jj3/jGN1xSUlJdbhIAAOCMvPnmm5845xIqeqxOAyopKUn5+fl1uUkAAIAzYma7KnuMQ3gAAACeCCgAAABPBBQAAIAnAgoAAMATAQUAAOCJgAIAAPBUp5cxACLBF198oX379un48ePhHgXwFh0drcaNGyshIUGNGzcO9zhAg0VAoUH54osvtHfvXrVt21ZxcXEys3CPBITMOafi4mIdOnRIu3fvVuvWrdW8efNwjwU0SAQUGpR9+/apbdu2atKkSbhHAbyZmWJiYtSiRQvFxsbq448/JqCAMOEcKDQox48fV1xcXLjHAKotLi5OR48eDfcYQINFQKHB4bAd6gP+HAPhRUABAAB4qjKgzOxiM9t0yu0LM5tqZi3NbLWZbQ98bVEXAwMAAIRblSeRO+f+Lam7JJlZlKQPJeVIul/SGufcHDO7P3D/vlqcFahVSffnhXX7O+cMCuv2AQCh8z2E10/Se865XZKGSMoKLM+SNLQmBwNQtUWLFsnMgrdGjRopOTlZDzzwgI4cORKWmZKSkjR27Ng6297SpUvVt29fxcfHq0mTJurWrZseeughHT58+Ixfs65/BgBnH9+AGi4pO/B9a+fcR5IU+Hp+RU8wswlmlm9m+UVFRWc+KYBKPffcc9qwYYPy8vKUmpqq2bNna/r06eEeq9ZNnDhRI0aMUHJysv7whz8oLy9Pt9xyi2bPnq2+ffvqiy++CPeIAOqpkK8DZWaNJA2W9DOfDTjnFkhaIEkpKSnOazoAIenevbs6deokSerfv7+2b9+uhQsXat68eTrnnLP3syJHjx5VbGxshY8tWrRICxYs0Ny5czVlypTg8muvvVZpaWnq3bu3pkyZomeeeaauxq03wn04uz7iEH394/POOlDSP51zewP395pZG0kKfN1X08MBODNXXHGFDh8+rE8++aTM8g8++EAjR45UQkKCYmNj1b17d+Xk5JRZZ8eOHRo9erQ6duyouLg4XXTRRbrrrru0f//+ctuZN2+ekpKS1LhxY6WkpGjdunUVzhPKdjMyMmRm2rJli1JTU3Xuuefqtttuq/RnfPjhh3XZZZfp7rvvLvdYjx49lJ6ert///vcqLCyUJO3cuVNmpszMTM2YMUNt2rRRfHy8brzxRhUUFFS6nTfffFNmppUrV5Z7bOzYsUpMTFRJSUmlzwdQP/kE1A/1n8N3kpQraUzg+zGSyr+7AAiLnTt3qnnz5mrVqlVw2Z49e9SrVy9t3rxZjz/+uHJzc3XFFVdo2LBhys3NDa5XWFioxMREzZ07V6+88opmzJihNWvWKC0trcw2Fi5cqKlTp+raa6/VCy+8oLFjx+qHP/xhudAKdbulhgwZor59+yo3N1fTpk2r8OcrLCzUtm3bdOONN1Z6PaTBgwerpKREr732Wpnls2fP1o4dO/T0009r3rx52rBhg0aOHFnp7/I73/mOevTooczMzDLLDxw4oGXLlmn8+PGKioqq9PkA6qeQDuGZWRNJ/SVNPGXxHEnLzCxd0m5Jt9b8eABCUVJSouLiYh08eFA5OTlavny55s6dW+Yv9oyMDDnn9NprrwXDKjU1VXv27NGMGTM0ePBgSVKfPn3Up0+f4POuuuoqderUSVdffbXeeustffvb39aJEyeUkZGh1NTUMofIEhISNHz48DKzhbrdUnfffXeZQ3IV2bNnj6STJ3tXpvSx0nVLdejQQc8++2zwflFRkaZPn67CwkJdeOGFFb7WpEmTlJ6erl27dqlDhw6SpMWLF+vYsWMaP378184KoH4KaQ+Uc+4r51wr59znpyz71DnXzznXOfD1s9obE8DX6dKli2JiYtSyZUulp6dr4sSJmjx5cpl1Xn75ZaWlpal58+YqLi4O3lJTU7V58+bgCdfHjh3TQw89pC5duiguLk4xMTG6+uqrJUn//ve/JUkFBQUqKCgod4ht2LBhio6OPqPtlrrpppuq/Hmdq/p0ysrWGTSo7Lko3bp1kyTt3r270tcaPny44uPj9bvf/S64LDMzU4MGDVJiYmKVswCof87es0sBBOXk5Gjjxo166aWXdP3112v+/PlavHhxmXX27dunxYsXKyYmpsyt9NN6n376qSTpZz/7mTIyMjRq1Cjl5eXpjTfe0IoVKyQpeGmEjz76SJLUunXrMtuIjo4uc9jQZ7ul2rRpU+XP265dO0knD1VWZteuXWXWLdWyZcsy90tPUv+6yz40btxY48aN08KFC1VcXKx169bp3Xff1Z133lnlrADqp5A/hQcgcnXt2jX4KbzrrrtO3/rWtzR9+nQNGzZMTZs2lSS1atVKV199te67r+Lr3ZYevlq6dKluv/12/eIXvwg+dujQoTLrlkbO3r17yywvLi4uF0ShbrdUKP/HW9u2bXXxxRfrxRdf1EMPPVThc3JzcxUVFaW+fftW+XqhuOuuu/TYY49p5cqVysnJUVJSklJTU2vktQGcfQgooJ6JjY3Vo48+qiFDhmj+/PnBPT033HCDNmzYoMsuu0xxcXGVPv+rr75STExMmWWnXwogMTFR7dq107Jly3THHXcEly9fvlzFxcVl1g11u76mT5+u8ePH68knnyz3SbyNGzdq4cKFGjlyZKXnNflKTk7WgAED9Oijj2rTpk2aMWPGWX2JCADVQ0AB9dDgwYPVo0cP/epXv9LkyZMVFxenX/7yl+rZs6f69OmjyZMnKykpSfv379eWLVv0/vvv6+mnn5Z0MniysrLUrVs3derUSStWrNDrr79e5vXPOecc/fd//7fGjx+vcePGafjw4dqxY4dmz56tZs2alVk31O36Sk9P1+uvv66pU6dq8+bNGjZsmOLi4rRu3Tr96le/UteuXTVv3rwz+wVWYtKkSRoyZIhiYmLKhCOAhoeAAgLq24XuHnzwQaWmpuq3v/2tpk2bpvbt2ys/P18ZGRl64IEHVFRUpFatWqlr164aM2ZM8HlPPvmknHP6+c9/LklKS0tTdna2evbsWeb109PTdejQIT322GPKzs5W165dtXTpUo0aNarMeqFu90wsXLhQ/fr1029/+1sNHz5cx48fV3Jysu6991795Cc/UZMmTar1+qcbNGiQmjRporS0NF1wwQU1+toAzi4WyqdZakpKSorLz8+vs+0Bp9u6dasuueSScI+Bs9Tq1as1YMAA/eUvf1G/fv3CPU6t/XnmSuQ1r779A62hMLM3nXMpFT3GHigAqMJ7772n999/X9OmTdMVV1wREfEEILw4AxIAqjBr1iwNHDhQsbGx5S4PAaBhYg8UAFRh0aJFWrRoUbjHABBB2AMFAADgiYACAADwREABAAB4IqAAAAA81buTyLl+Sc3j+iUAAJTFHigAAABP9W4PFHDGMpqHefufh3f7AICQsQcKOIstWrRIZlbh7S9/+Yv365mZMjIygvczMjJkZiE99/Dhw5o9e7Yuv/xyNWnSRM2bN1efPn20dOnSr33egw8+KDPTzTff7DXrpk2bNGzYMLVv316xsbFq06aNrr32Wj3xxBNerxOKtWvXKiMjQydOnKjx1wZwdmIPFFAPPPfcc0pMTCyz7NJLL62z7X/++efq37+/tm7dqnvuuUd9+vTRkSNHtGLFCo0YMUJ//etfNX/+/AqfW3pl77y8PH366adq1apVldvbuHGjrr76avXq1UuPPPKILrjgAhUUFGj9+vXKycnR3XffXaM/39q1azVz5kz94he/0Dnn8O9OAAQUUC90795dnTp1Ctv2p0yZos2bN2v9+vXq0aNHcHlaWpq6deumqVOnqnfv3hoxYkSZ573++uvavn270tLS9NJLLyk7O1uTJ0+ucntPPvmk4uPjtWrVKsXGxgaXjxo1ir1EAOoE/5QC6rnSw3w7d+4ss9zn8NzXKSws1JIlSzR+/Pgy8VTq7rvv1qWXXqo5c+aUeywrK0tRUVH63e9+p3bt2oX8/8x99tlnatGiRZl4KlW6h+jo0aNKSEjQtGnTyq1T+jvZtm2bpJN7tPr3769WrVqpSZMmuuiiizRp0iRJJ39PM2fOlCTFxMQED5GW+uqrr3TfffepY8eOatSokTp27Kj/+Z//KRNya9eulZnphRde0MSJE9WyZUu1aNFC06ZNU0lJiTZu3KjevXuradOmuuyyy/TKK6+E9HsAED4EFFAPlJSUqLi4OHgrKSmps22vXbtWJSUlGjx4cIWPm5luvPFGvf3229q7d29w+ZEjR7Rs2TL1799fF154oUaNGqWNGzdq69atVW6zZ8+e2rZtm+6880698cYbKi4uLrdObGysxo0bp6ysLB05cqTMY5mZmerbt6+6dOmiQ4cOKTU1VVFRUVq0aJFeeuklzZgxI/ia48ePV3p6uiRp/fr12rBhgzZs2CBJKi4uVmpqqp566ilNmTJFf/7znzV+/HjNmjVL06dPLzfT1KlT1bRpU/3xj3/U5MmTNXfuXE2dOlW333677rjjDq1YsUItW7bUzTffrE8++aTK3wOA8OEQHlAPdOnSpcz9733ve1q/fn2dbHvPnj2SpKSkpErXKX1s9+7dat26tSRp5cqVOnDggG6//XZJ0pgxYzR79mxlZWVVuLfqVNOnT9dbb72lzMxMZWZmKi4uTr1799att96qO+64Q1FRUZKku+66S7/+9a/13HPPafTo0ZKkf/3rX/r73/+u7OxsSdK2bdu0f/9+PfLII/rWt74V3MbYsWMlSYmJicHzy3r16qXo6P+8bWZnZ2v9+vV67bXX1KdPH0lSv379JEkzZ87Ufffdp/PPPz+4/nXXXafHHntMktS/f3/l5eXpf//3f7Vu3Tr17t1bktSmTRtdfvnlysvL05gxY7729wAgfNgDBdQDOTk52rhxY/C2cOHCOtu2cy7kdU49ATsrK0vNmjXT0KFDJUkXX3yxevXqpSVLllR5HlNcXJxycnL0zjvv6NFHH9XAgQOVn5+vCRMmKC0tLbi9jh07KjU1VZmZmcHnZmZmKiEhIfipv86dOys+Pl4TJ07UkiVLgkEYipdfflkdOnTQVVddVWYP4IABA3T8+HH9/e9/L7P+wIEDy9zv0qWLmjZtGoyn0mWSvOYAUPcIKKAe6Nq1q1JSUoK3iy++uM623a5dO0kqd47VqXbt2iVJatu2rSTp448/1qpVqzRo0CAdPXpUBw4c0IEDBzRs2DB9+OGHWrNmTUjbvvTSS/XTn/5Uy5cvV2FhoUaNGqVVq1YpL+8//yPBpEmT9Le//U1btmzRl19+qSVLlmjcuHFq1KiRJKl58+Z69dVXdeGFF2rSpElq3769unbtquXLl1e5/X379mnXrl2KiYkpc+vZs6ck6dNPPy2zfosWLcrcb9SokeLj48stk1TusCOAyEJAAfVc48aNJUnHjh0rs/z0v9zP1DXXXKOoqCjl5uZW+LhzTi+++KK++c1v6oILLpAkLVmyRCUlJcrOzlaLFi2Ct3vvvVfSyb1Tvho3bhw87+jdd98NLk9LS1NSUpIyMzOVnZ2tgwcPasKECWWe2717dy1fvlyfffaZNmzYoOTkZN12223asmXL126zVatW6tixY5m9f6febrzxRu+fA8DZgXOggHquQ4cOkqQtW7bom9/8pqSTJz+vWrWqRl6/bdu2GjFihJ566imNHTu23CfxnnjiCb377rt6/PHHg8sWL16sDh06aNGiReVe7+GHH1ZOTo4OHjyo8847r8JtFhQUlLvulaTgp+ratGkTXHbOOedo4sSJmjNnjtatW6frr79eycnJFb5udHS0rrzySs2aNUu5ubnaunWrunbtGvy03+HDh8vMdMMNN2j58uU699xzy52HBqB+I6CAeq5Hjx5KTk7W9OnTdeLECcXGxmr+/Pk6evRojW3jiSee0DvvvKPrrrtOP/3pT4MX0ly+fLmefvppDRo0KHhxy3/+8596++23lZGRoWuuuabcax05ckQvv/yynn/+eY0bN67C7d15553au3evRo8era5duwYvBfDII48oOTlZN910U5n109PTlZGRoc2bN5c7NPenP/1JCxYs0NChQ9WxY0d9+eWXeuKJJ3Teeefpu9/9rqT/XJT017/+tQYOHKioqCilpKRo5MiReuaZZ9SvXz/dc889uvzyy3Xs2DG99957ys3N1QsvvKAmTZpU99cLIAIRUECpevp/0UVHR2vlypX60Y9+pLFjx6ply5aaOnWqevXqFby+UXXFx8dr3bp1mjt3rrKzszVnzpzgOTw///nPNXPmzOAJ5FlZWTKz4KfcTjdgwAC1a9dOWVlZlQbUj3/8Yz377LP6zW9+o8LCQh07dkyJiYkaNWqU/uu//kvnnntumfUTEhLUt29fvf322+Uut9C5c2fFxcVp1qxZ+uijj3TeeeepR48eWr16dXAv1/e//31NmjRJ8+fP1y9/+Us55+ScU0xMjF555RXNmTNHCxYs0AcffKCmTZsqOTlZgwYNCp7PBKD+sVA+QVNTUlJSXH5+fq1uI+n+vKpXgpedcwaFe4Qas3XrVl1yySXhHqNB2L17t6688kpddNFFWr16teLi4sI2y/79+9W+fXtNnTpVs2bNCtscNa22/jzzPlrz6tP7aENiZm8651IqeoyTyAHUivbt2+vFF1/Upk2b9IMf/KDCi13WtqKiIq1fv14TJkzQiRMnglcXB4Dq4hAegFrzne98R4cOHQrb9vPy8jRu3Di1b99eWVlZZU4uB4DqIKAA1Ftjx46t9FwrAKgODuEBAAB4IqDQ4NTlByeA2sKfYyC8CCg0KDExMTp8+HC4xwCq7fDhw8ELfAKoewQUGpTzzz9fH374ob766iv+BY+zjnNOx48f12effaaCggK1atUq3CMBDRYnkaNBadasmSSpsLBQx48fD/M0gL/o6Gg1btxY7du3D/4/hwDqHgGFBqdZs2bBkAIA4ExwCA8AAMATAQUAAOCJgAIAAPBEQAEAAHgioAAAADyFFFBmFm9mz5vZNjPbambfNbOWZrbazLYHvrao7WEBAAAiQaiXMZgn6WXn3C1m1khSE0kPSFrjnJtjZvdLul/SfbU0JwAAZ6+M5uGeoH7J+DzcE1S9B8rMmknqI2mhJDnnjjnnDkgaIikrsFqWpKG1NSQAAEAkCeUQ3kWSiiQ9Y2ZvmdlTZtZUUmvn3EeSFPh6fkVPNrMJZpZvZvlFRUU1NjgAAEC4hBJQ0ZKukPR/zrlvS/pSJw/XhcQ5t8A5l+KcS0lISDjDMQEAACJHKAFVIKnAOfePwP3ndTKo9ppZG0kKfN1XOyMCAABElioDyjn3saQ9ZnZxYFE/Se9KypU0JrBsjKSVtTIhAABAhAn1U3g/lvSHwCfw3pc0Tifja5mZpUvaLenW2hkRAAAgsoQUUM65TZJSKnioX82OAwAAEPm4EjkAAIAnAgoAAMATAQUAAOCJgAIAAPBEQAEAAHgioAAAADwRUAAAAJ4IKAAAAE+hXokcDVlG83BPUL9kfB7uCQAA1cQeKAAAAE8EFAAAgCcCCgAAwBMBBQAA4ImAAgAA8ERAAQAAeCKgAAAAPBFQAAAAnggoAAAATwQUAACAJwIKAADAEwEFAADgiYACAADwREABAAB4IqAAAAA8EVAAAACeCCgAAABPBBQAAIAnAgoAAMATAQUAAOCJgAIAAPBEQAEAAHgioAAAADwRUAAAAJ4IKAAAAE8EFAAAgCcCCgAAwBMBBQAA4ImAAgAA8ERAAQAAeCKgAAAAPBFQAAAAnggoAAAATwQUAACAp+hQVjKznZIOSiqRVOycSzGzlpL+KClJ0k5Jtznn9tfOmAAAAJHDZw/Utc657s65lMD9+yWtcc51lrQmcB8AAKDeq84hvCGSsgLfZ0kaWv1xAAAAIl+oAeUkrTKzN81sQmBZa+fcR5IU+Hp+RU80swlmlm9m+UVFRdWfGAAAIMxCOgdK0vecc4Vmdr6k1Wa2LdQNOOcWSFogSSkpKe4MZgQAAIgoIe2Bcs4VBr7uk5QjqaekvWbWRpICX/fV1pAAAACRpMqAMrOmZnZe6feSBkjaIilX0pjAamMkraytIQEAACJJKIfwWkvKMbPS9Z91zr1sZhslLTOzdEm7Jd1ae2MCAABEjioDyjn3vqTLK1j+qaR+tTEUAABAJONK5AAAAJ4IKAAAAE8EFAAAgCcCCgAAwBMBBQAA4ImAAgAA8ERAAQAAeCKgAAAAPBFQAAAAnggoAAAATwQUAACAJwIKAADAEwEFAADgiYACAADwREABAAB4IqAAAAA8EVAAAACeCCgAAABPBBQAAIAnAgoAAMATAQUAAOCJgAIAAPBEQAEAAHgioAAAADwRUAAAAJ4IKAAAAE8EFAAAgCcCCgAAwBMBBQAA4ImAAgAA8ERAAQAAeCKgAAAAPBFQAAAAnggoAAAATwQUAACAJwIKAADAEwEFAADgiYACAADwREABAAB4IqAAAAA8EVAAAACeCCgAAABPBBQAAICnkAPKzKLM7C0z+1Pgfkcz+4eZbTezP5pZo9obEwAAIHL47IGaImnrKfcflvS4c66zpP2S0mtyMAAAgEgVUkCZWaKkQZKeCtw3SddJej6wSpakobUxIAAAQKQJdQ/UXEn3SjoRuN9K0gHnXHHgfoGkthU90cwmmFm+meUXFRVVa1gAAIBIUGVAmdn3Je1zzr156uIKVnUVPd85t8A5l+KcS0lISDjDMQEAACJHdAjrfE/SYDNLk9RYUjOd3CMVb2bRgb1QiZIKa29MAACAyFHlHijn3M+cc4nOuSRJwyX9P+fcSEmvSrolsNoYSStrbUoAAIAIUp3rQN0n6SdmtkMnz4laWDMjAQAARLZQDuEFOefWSlob+P59ST1rfiQAAIDIxpXIAQAAPBFQAAAAnggoAAAATwQUAACAJwIKAADAEwEFAADgiYACAADwREABAAB4IqAAAAA8EVAAAACeCCgAAABPBBQAAIAnAgoAAMATAQUAAOCJgAIAAPBEQAEAAHgioAAAADwRUAAAAJ4IKAAAAE8EFAAAgCcCCgAAwBMBBQAA4ImAAgAA8ERAAQAAeCKgAAAAPBFQAAAAnggoAAAATwQUAACAJwIKAADAEwEFAADgiYACAADwREABAAB4IqAAAAA8EVAAAACeCCgAAABPBBQAAIAnAgoAAMATAQUAAOCJgAIAAPBEQAEAAHgioAAAADwRUAAAAJ4IKAAAAE9VBpSZNTazN8xss5m9Y2YzA8s7mtk/zGy7mf3RzBrV/rgAAADhF8oeqKOSrnPOXS6pu6QbzOxKSQ9Letw511nSfknptTcmAABA5KgyoNxJhwJ3YwI3J+k6Sc8HlmdJGlorEwIAAESYkM6BMrMoM9skaZ+k1ZLek3TAOVccWKVAUttKnjvBzPLNLL+oqKgmZgYAAAirkALKOVfinOsuKVFST0mXVLRaJc9d4JxLcc6lJCQknPmkAAAAEcLrU3jOuQOS1kq6UlK8mUUHHkqUVFizowEAAESmUD6Fl2Bm8YHv4yRdL2mrpFcl3RJYbYyklbU1JAAAQCSJrnoVtZGUZWZROhlcy5xzfzKzdyUtNbMHJb0laWEtzgkAABAxqgwo59y/JH27guXv6+T5UAAAAA0KVyIHAADwREABAAB4IqAAAAA8EVAAAACeCCgAAABPBBQAAIAnAgoAAMATAQUAAOCJgAIAAPBEQAEAAHgioAAAADwRUAAAAJ4IKAAAAE8EFAAAgCcCCgAAwBMBBQAA4ImAAgAA8ERAAQAAeCKgAAAAPBFQAAAAnggoAAAATwQUAACAJwIKAADAEwEFAADgiYACAADwREABAAB4IqAAAAA8EVAAAACeCCgAAABPBBQAAIAnAgoAAMATAQUAAOCJgAIAAPBEQAEAAHgioAAAADwRUAAAAJ4IKAAAAE8EFAAAgCcCCgAAwBMBBQAA4ImAAgAA8ERAAQAAeCKgAAAAPFUZUGbWzsxeNbOtZvaOmU0JLG9pZqvNbHvga4vaHxcAACD8QtkDVSzpHufcJZKulPQjM7tU0v2S1jjnOktaE7gPAABQ71UZUM65j5xz/wx8f1DSVkltJQ2RlBVYLUvS0NoaEgAAIJJ4nQNlZkmSvi3pH5JaO+c+kk5GlqTzK3nOBDPLN7P8oqKi6k0LAAAQAUIOKDM7V9JySVOdc1+E+jzn3ALnXIpzLiUhIeFMZgQAAIgoIQWUmcXoZDz9wTm3IrB4r5m1CTzeRtK+2hkRAAAgsoTyKTyTtFDSVufcY6c8lCtpTOD7MZJW1vx4AAAAkSc6hHW+J2m0pLfNbFNg2QOS5khaZmbpknZLurV2RgQAAIgsVQaUc269JKvk4X41Ow4AAEDk40rkAAAAnggoAAAATwQUAACAJwIKAADAEwEFAADgiYACAADwREABAAB4IqAAAAA8EVAAAACeCCgAAABPBBQAAIAnAgoAAMATAQUAAOCJgAIAAPBEQAEAAHgioAAAADwRUAAAAJ4IKAAAAE8EFAAAgCcCCgAAwBMBBQAA4ImAAgAA8ERAAQAAeCKgAAAAPBFQAAAAnggoAAAATwQUAACAJwIKAADAEwEFAADgiYACAADwREABAAB4IqAAAAA8EVAAAACeCCgAAABPBBQAAIAnAgoAAMATAQUAAOCJgAIAAPBEQAEAAHgioAAAADwRUAAAAJ4IKAAAAE8EFAAAgKcqA8rMnjazfWa25ZRlLc1stZltD3xtUbtjAgAARI5Q9kAtknTDacvul7TGOddZ0prAfQAAgAahyoByztwioBoAAASoSURBVP1V0menLR4iKSvwfZakoTU8FwAAQMQ603OgWjvnPpKkwNfzK1vRzCaYWb6Z5RcVFZ3h5gAAACJHrZ9E7pxb4JxLcc6lJCQk1PbmAAAAat2ZBtReM2sjSYGv+2puJAAAgMh2pgGVK2lM4PsxklbWzDgAAACRL5TLGGRL2iDpYjMrMLN0SXMk9Tez7ZL6B+4DAAA0CNFVreCc+2ElD/Wr4VkAAADOClyJHAAAwBMBBQAA4ImAAgAA8ERAAQAAeCKgAAAAPBFQAAAAnggoAAAATwQUAACAJwIKAADAEwEFAADgiYACAADwREABAAB4IqAAAAA8EVAAAACeCCgAAABPBBQAAIAnAgoAAMATAQUAAOCJgAIAAPBEQAEAAHgioAAAADwRUAAAAJ4IKAAAAE8EFAAAgCcCCgAAwBMBBQAA4ImAAgAA8ERAAQAAeCKgAAAAPBFQAAAAnggoAAAATwQUAACAJwIKAADAEwEFAADgiYACAADwREABAAB4IqAAAAA8EVAAAACeCCgAAABPBBQAAIAnAgoAAMATAQUAAOCJgAIAAPBUrYAysxvM7N9mtsPM7q+poQAAACLZGQeUmUVJ+o2kgZIulfRDM7u0pgYDAACIVNXZA9VT0g7n3PvOuWOSlkoaUjNjAQAARK7oajy3raQ9p9wvkNTr9JXMbIKkCYG7h8zs39XYJsLApG9I+iTcc9QbMy3cEwCoY7yP1rC6ex/tUNkD1QmoiqZ35RY4t0DSgmpsB2FmZvnOuZRwzwEAZyveR+uf6hzCK5DU7pT7iZIKqzcOAABA5KtOQG2U1NnMOppZI0nDJeXWzFgAAACR64wP4Tnnis1ssqRXJEVJeto5906NTYZIwiFYAKge3kfrGXOu3GlLAAAA+BpciRwAAMATAQUAAOCJgGqgzGysmblKbgcC61xzyrIBFbxGkpmdCDw+vu5/CgCIDFW8p14fWOchM1tlZp8Glo8N89iohupcBwr1w606eUmKUxWfdv+gpNGSVp22/HZJhySdVzujAcBZp6L31HcDX38saZOkP+nk+yfOYgQUNjnndlSxzgpJt5hZU+fcl6csHy1puaSxtTUcAJxlvu49tblz7oSZdRIBddbjEB5CsUInrzJ/c+kCM7tKUrKk34drKAA4mzjnToR7BtQcAgpRZhZ92u30Pxdf6eSeptGnLLtd0t8kvV9XgwLAWeD099SocA+E2kFAYZuk46fdKrqi/GJJ/cysrZnFSrotsAwA8B+nv6e+Ft5xUFs4Bwo3qfwJjwcqWO/VwHojJH0gKU7SMkktanU6ADi7nP6eejBcg6B2EVDYEsJJ5HLOOTP7g04extslKdc597mZEVAA8B8hvafi7MchPPhYLKmbpDRx+A4A0ICxBwohc85tM7PfSErQyf9EGgCABomAQncz+0YFy/MrWtk5N7mW5wGAesnM+urkP0AvCCxKMbNDkuScez5sg+GMEFB4rpLlCXU6BQDUfzMl9T3l/o8CN0myuh8H1WHOuXDPAAAAcFbhJHIAAABPBBQAAIAnAgoAAMATAQUAAOCJgAIAAPBEQAEAAHgioAAAADwRUAAAAJ7+P5fMeaOUoyJwAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 720x504 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# hide\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from collections import OrderedDict\n",
    "import numpy as np\n",
    "import pickle\n",
    "\n",
    "# from previous blog post\n",
    "no_retriever = OrderedDict([('exact', 66.25958056093658),\n",
    "             ('f1', 69.66994428499025),\n",
    "             ('total', 11873),\n",
    "             ('HasAns_exact', 68.91025641025641),\n",
    "             ('HasAns_f1', 75.74076391627662),\n",
    "             ('HasAns_total', 5928),\n",
    "             ('NoAns_exact', 63.61648444070648),\n",
    "             ('NoAns_f1', 63.61648444070648),\n",
    "             ('NoAns_total', 5945),\n",
    "             ('best_exact', 68.36519834919565),\n",
    "             ('best_exact_thresh', -4.189256191253662),\n",
    "             ('best_f1', 71.1144383018176),\n",
    "             ('best_f1_thresh', -3.767639636993408)])\n",
    "\n",
    "# from output of cell above\n",
    "retriever = OrderedDict([('exact', 57.16331171565737),\n",
    "             ('f1', 60.41835572408481),\n",
    "             ('total', 11873),\n",
    "             ('HasAns_exact', 50.64102564102564),\n",
    "             ('HasAns_f1', 57.16044829825517),\n",
    "             ('HasAns_total', 5928),\n",
    "             ('NoAns_exact', 63.666947014297726),\n",
    "             ('NoAns_f1', 63.666947014297726),\n",
    "             ('NoAns_total', 5945),\n",
    "             ('best_exact', 57.16331171565737),\n",
    "             ('best_exact_thresh', 0.0),\n",
    "             ('best_f1', 60.41835572408505),\n",
    "             ('best_f1_thresh', 0.0)])\n",
    "\n",
    "plt.figure(figsize=(10,7))\n",
    "width = 0.25\n",
    "\n",
    "plt.bar(np.array([1, 2]) - width/2, [no_retriever['exact'], no_retriever['f1']], width=width, label='Reader Only')\n",
    "plt.bar(np.array([1, 2]) + width/2, [retriever['exact'], retriever['f1']], width=width, label='Full QA System')\n",
    "plt.xticks([1,2], ['EM', 'F1'], fontsize=16)\n",
    "plt.legend(loc='upper center', bbox_to_anchor=(0.45, 1), fontsize=16)\n",
    "\n",
    "plt.savefig(\"qa_system_vs_reader_only.png\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [
    "In0YUapuBQ8A",
    "pKVJVU5hBQ8U",
    "t4s4Bx3LBQ8p"
   ],
   "machine_shape": "hm",
   "name": "2020-06-09-Evaluating_BERT_on_SQuAD.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "0f9413092ffe496d8f31c374732e870d": {
     "model_module": "@jupyter-widgets/base",
     "model_name": "LayoutModel",
     "state": {
      "_model_module": "@jupyter-widgets/base",
      "_model_module_version": "1.2.0",
      "_model_name": "LayoutModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "LayoutView",
      "align_content": null,
      "align_items": null,
      "align_self": null,
      "border": null,
      "bottom": null,
      "display": null,
      "flex": null,
      "flex_flow": null,
      "grid_area": null,
      "grid_auto_columns": null,
      "grid_auto_flow": null,
      "grid_auto_rows": null,
      "grid_column": null,
      "grid_gap": null,
      "grid_row": null,
      "grid_template_areas": null,
      "grid_template_columns": null,
      "grid_template_rows": null,
      "height": null,
      "justify_content": null,
      "justify_items": null,
      "left": null,
      "margin": null,
      "max_height": null,
      "max_width": null,
      "min_height": null,
      "min_width": null,
      "object_fit": null,
      "object_position": null,
      "order": null,
      "overflow": null,
      "overflow_x": null,
      "overflow_y": null,
      "padding": null,
      "right": null,
      "top": null,
      "visibility": null,
      "width": null
     }
    },
    "139cd5c5c43f436495dfddbdfc7d35c3": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "DescriptionStyleModel",
     "state": {
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "DescriptionStyleModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "StyleView",
      "description_width": ""
     }
    },
    "15f44b8cf87d459bb6a40b5e6b9db470": {
     "model_module": "@jupyter-widgets/base",
     "model_name": "LayoutModel",
     "state": {
      "_model_module": "@jupyter-widgets/base",
      "_model_module_version": "1.2.0",
      "_model_name": "LayoutModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "LayoutView",
      "align_content": null,
      "align_items": null,
      "align_self": null,
      "border": null,
      "bottom": null,
      "display": null,
      "flex": null,
      "flex_flow": null,
      "grid_area": null,
      "grid_auto_columns": null,
      "grid_auto_flow": null,
      "grid_auto_rows": null,
      "grid_column": null,
      "grid_gap": null,
      "grid_row": null,
      "grid_template_areas": null,
      "grid_template_columns": null,
      "grid_template_rows": null,
      "height": null,
      "justify_content": null,
      "justify_items": null,
      "left": null,
      "margin": null,
      "max_height": null,
      "max_width": null,
      "min_height": null,
      "min_width": null,
      "object_fit": null,
      "object_position": null,
      "order": null,
      "overflow": null,
      "overflow_x": null,
      "overflow_y": null,
      "padding": null,
      "right": null,
      "top": null,
      "visibility": null,
      "width": null
     }
    },
    "26b9259fecf4445b8fdcb753ed6d09ef": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "ProgressStyleModel",
     "state": {
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "ProgressStyleModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "StyleView",
      "bar_color": null,
      "description_width": "initial"
     }
    },
    "26ef4a7e1f76465ead7e51c1d9866c6f": {
     "model_module": "@jupyter-widgets/base",
     "model_name": "LayoutModel",
     "state": {
      "_model_module": "@jupyter-widgets/base",
      "_model_module_version": "1.2.0",
      "_model_name": "LayoutModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "LayoutView",
      "align_content": null,
      "align_items": null,
      "align_self": null,
      "border": null,
      "bottom": null,
      "display": null,
      "flex": null,
      "flex_flow": null,
      "grid_area": null,
      "grid_auto_columns": null,
      "grid_auto_flow": null,
      "grid_auto_rows": null,
      "grid_column": null,
      "grid_gap": null,
      "grid_row": null,
      "grid_template_areas": null,
      "grid_template_columns": null,
      "grid_template_rows": null,
      "height": null,
      "justify_content": null,
      "justify_items": null,
      "left": null,
      "margin": null,
      "max_height": null,
      "max_width": null,
      "min_height": null,
      "min_width": null,
      "object_fit": null,
      "object_position": null,
      "order": null,
      "overflow": null,
      "overflow_x": null,
      "overflow_y": null,
      "padding": null,
      "right": null,
      "top": null,
      "visibility": null,
      "width": null
     }
    },
    "28a305d13e584615b9ba3cd9a37bdd56": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "FloatProgressModel",
     "state": {
      "_dom_classes": [],
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "FloatProgressModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/controls",
      "_view_module_version": "1.5.0",
      "_view_name": "ProgressView",
      "bar_style": "success",
      "description": "100%",
      "description_tooltip": null,
      "layout": "IPY_MODEL_a8cebe4ccf004d84bdc37ce44d1192e8",
      "max": 20239,
      "min": 0,
      "orientation": "horizontal",
      "style": "IPY_MODEL_a44c2693aee14b89a4c6512c85142f51",
      "value": 20239
     }
    },
    "3c0ffe3eeca741899ddbe1306e60ce39": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "HTMLModel",
     "state": {
      "_dom_classes": [],
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "HTMLModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/controls",
      "_view_module_version": "1.5.0",
      "_view_name": "HTMLView",
      "description": "",
      "description_tooltip": null,
      "layout": "IPY_MODEL_26ef4a7e1f76465ead7e51c1d9866c6f",
      "placeholder": "​",
      "style": "IPY_MODEL_7a1bbfb20bd84d4b8995584a37dabae1",
      "value": " 11873/11873 [6:21:18&lt;00:00,  1.93s/it]"
     }
    },
    "3d3ce5e897b84a218237582372bca2bb": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "DescriptionStyleModel",
     "state": {
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "DescriptionStyleModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "StyleView",
      "description_width": ""
     }
    },
    "40b3ea36c5ca407597e8ce6c738c9786": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "ProgressStyleModel",
     "state": {
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "ProgressStyleModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "StyleView",
      "bar_color": null,
      "description_width": "initial"
     }
    },
    "62fb51c849b04bc78c629dd42f811deb": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "HTMLModel",
     "state": {
      "_dom_classes": [],
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "HTMLModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/controls",
      "_view_module_version": "1.5.0",
      "_view_name": "HTMLView",
      "description": "",
      "description_tooltip": null,
      "layout": "IPY_MODEL_0f9413092ffe496d8f31c374732e870d",
      "placeholder": "​",
      "style": "IPY_MODEL_3d3ce5e897b84a218237582372bca2bb",
      "value": " 11873/11873 [7:01:01&lt;00:00,  2.13s/it]"
     }
    },
    "66e0efa9685248629e009c1e255bff6e": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "HBoxModel",
     "state": {
      "_dom_classes": [],
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "HBoxModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/controls",
      "_view_module_version": "1.5.0",
      "_view_name": "HBoxView",
      "box_style": "",
      "children": [
       "IPY_MODEL_f6c75626f2d54bfbaa7bad761af5b9c2",
       "IPY_MODEL_62fb51c849b04bc78c629dd42f811deb"
      ],
      "layout": "IPY_MODEL_f0858954bbfd447eb95a04a3caabf8a4"
     }
    },
    "705fa1214c044cbcae6f5d109b22802d": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "FloatProgressModel",
     "state": {
      "_dom_classes": [],
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "FloatProgressModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/controls",
      "_view_module_version": "1.5.0",
      "_view_name": "ProgressView",
      "bar_style": "success",
      "description": "100%",
      "description_tooltip": null,
      "layout": "IPY_MODEL_f04b80c35efb44b5bec078f4d7a57f3b",
      "max": 11873,
      "min": 0,
      "orientation": "horizontal",
      "style": "IPY_MODEL_e30ff795eb1c4016be3967f190764399",
      "value": 11873
     }
    },
    "7a1bbfb20bd84d4b8995584a37dabae1": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "DescriptionStyleModel",
     "state": {
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "DescriptionStyleModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "StyleView",
      "description_width": ""
     }
    },
    "9061d4497e4443029cbb21b77281cf31": {
     "model_module": "@jupyter-widgets/base",
     "model_name": "LayoutModel",
     "state": {
      "_model_module": "@jupyter-widgets/base",
      "_model_module_version": "1.2.0",
      "_model_name": "LayoutModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "LayoutView",
      "align_content": null,
      "align_items": null,
      "align_self": null,
      "border": null,
      "bottom": null,
      "display": null,
      "flex": null,
      "flex_flow": null,
      "grid_area": null,
      "grid_auto_columns": null,
      "grid_auto_flow": null,
      "grid_auto_rows": null,
      "grid_column": null,
      "grid_gap": null,
      "grid_row": null,
      "grid_template_areas": null,
      "grid_template_columns": null,
      "grid_template_rows": null,
      "height": null,
      "justify_content": null,
      "justify_items": null,
      "left": null,
      "margin": null,
      "max_height": null,
      "max_width": null,
      "min_height": null,
      "min_width": null,
      "object_fit": null,
      "object_position": null,
      "order": null,
      "overflow": null,
      "overflow_x": null,
      "overflow_y": null,
      "padding": null,
      "right": null,
      "top": null,
      "visibility": null,
      "width": null
     }
    },
    "95923a713e324d18bc9fc82a466703c4": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "HTMLModel",
     "state": {
      "_dom_classes": [],
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "HTMLModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/controls",
      "_view_module_version": "1.5.0",
      "_view_name": "HTMLView",
      "description": "",
      "description_tooltip": null,
      "layout": "IPY_MODEL_ef2c1a3506c549b7878e3ea4a5cc565f",
      "placeholder": "​",
      "style": "IPY_MODEL_139cd5c5c43f436495dfddbdfc7d35c3",
      "value": " 20239/20239 [02:38&lt;00:00, 127.56it/s]"
     }
    },
    "a299f8fa902348b7807b4d97ebc6027d": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "HTMLModel",
     "state": {
      "_dom_classes": [],
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "HTMLModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/controls",
      "_view_module_version": "1.5.0",
      "_view_name": "HTMLView",
      "description": "",
      "description_tooltip": null,
      "layout": "IPY_MODEL_dca5b0a8c1014a66bc8c5ba988878a85",
      "placeholder": "​",
      "style": "IPY_MODEL_b04d963baf9a40789305d1372ffe1aab",
      "value": " 20239/20239 [04:14&lt;00:00, 79.57it/s]"
     }
    },
    "a44c2693aee14b89a4c6512c85142f51": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "ProgressStyleModel",
     "state": {
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "ProgressStyleModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "StyleView",
      "bar_color": null,
      "description_width": "initial"
     }
    },
    "a493a4c07a2743899571330cf6476b74": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "HBoxModel",
     "state": {
      "_dom_classes": [],
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "HBoxModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/controls",
      "_view_module_version": "1.5.0",
      "_view_name": "HBoxView",
      "box_style": "",
      "children": [
       "IPY_MODEL_28a305d13e584615b9ba3cd9a37bdd56",
       "IPY_MODEL_a299f8fa902348b7807b4d97ebc6027d"
      ],
      "layout": "IPY_MODEL_e94a6f98578743c096c9b045d9dfdf81"
     }
    },
    "a8cebe4ccf004d84bdc37ce44d1192e8": {
     "model_module": "@jupyter-widgets/base",
     "model_name": "LayoutModel",
     "state": {
      "_model_module": "@jupyter-widgets/base",
      "_model_module_version": "1.2.0",
      "_model_name": "LayoutModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "LayoutView",
      "align_content": null,
      "align_items": null,
      "align_self": null,
      "border": null,
      "bottom": null,
      "display": null,
      "flex": null,
      "flex_flow": null,
      "grid_area": null,
      "grid_auto_columns": null,
      "grid_auto_flow": null,
      "grid_auto_rows": null,
      "grid_column": null,
      "grid_gap": null,
      "grid_row": null,
      "grid_template_areas": null,
      "grid_template_columns": null,
      "grid_template_rows": null,
      "height": null,
      "justify_content": null,
      "justify_items": null,
      "left": null,
      "margin": null,
      "max_height": null,
      "max_width": null,
      "min_height": null,
      "min_width": null,
      "object_fit": null,
      "object_position": null,
      "order": null,
      "overflow": null,
      "overflow_x": null,
      "overflow_y": null,
      "padding": null,
      "right": null,
      "top": null,
      "visibility": null,
      "width": null
     }
    },
    "b04d963baf9a40789305d1372ffe1aab": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "DescriptionStyleModel",
     "state": {
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "DescriptionStyleModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "StyleView",
      "description_width": ""
     }
    },
    "be3cf3e6c6ba40d087f8dd27dd4f5e64": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "HBoxModel",
     "state": {
      "_dom_classes": [],
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "HBoxModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/controls",
      "_view_module_version": "1.5.0",
      "_view_name": "HBoxView",
      "box_style": "",
      "children": [
       "IPY_MODEL_705fa1214c044cbcae6f5d109b22802d",
       "IPY_MODEL_3c0ffe3eeca741899ddbe1306e60ce39"
      ],
      "layout": "IPY_MODEL_d193d10e1ad94119a849d123c093f3cc"
     }
    },
    "d193d10e1ad94119a849d123c093f3cc": {
     "model_module": "@jupyter-widgets/base",
     "model_name": "LayoutModel",
     "state": {
      "_model_module": "@jupyter-widgets/base",
      "_model_module_version": "1.2.0",
      "_model_name": "LayoutModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "LayoutView",
      "align_content": null,
      "align_items": null,
      "align_self": null,
      "border": null,
      "bottom": null,
      "display": null,
      "flex": null,
      "flex_flow": null,
      "grid_area": null,
      "grid_auto_columns": null,
      "grid_auto_flow": null,
      "grid_auto_rows": null,
      "grid_column": null,
      "grid_gap": null,
      "grid_row": null,
      "grid_template_areas": null,
      "grid_template_columns": null,
      "grid_template_rows": null,
      "height": null,
      "justify_content": null,
      "justify_items": null,
      "left": null,
      "margin": null,
      "max_height": null,
      "max_width": null,
      "min_height": null,
      "min_width": null,
      "object_fit": null,
      "object_position": null,
      "order": null,
      "overflow": null,
      "overflow_x": null,
      "overflow_y": null,
      "padding": null,
      "right": null,
      "top": null,
      "visibility": null,
      "width": null
     }
    },
    "dca5b0a8c1014a66bc8c5ba988878a85": {
     "model_module": "@jupyter-widgets/base",
     "model_name": "LayoutModel",
     "state": {
      "_model_module": "@jupyter-widgets/base",
      "_model_module_version": "1.2.0",
      "_model_name": "LayoutModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "LayoutView",
      "align_content": null,
      "align_items": null,
      "align_self": null,
      "border": null,
      "bottom": null,
      "display": null,
      "flex": null,
      "flex_flow": null,
      "grid_area": null,
      "grid_auto_columns": null,
      "grid_auto_flow": null,
      "grid_auto_rows": null,
      "grid_column": null,
      "grid_gap": null,
      "grid_row": null,
      "grid_template_areas": null,
      "grid_template_columns": null,
      "grid_template_rows": null,
      "height": null,
      "justify_content": null,
      "justify_items": null,
      "left": null,
      "margin": null,
      "max_height": null,
      "max_width": null,
      "min_height": null,
      "min_width": null,
      "object_fit": null,
      "object_position": null,
      "order": null,
      "overflow": null,
      "overflow_x": null,
      "overflow_y": null,
      "padding": null,
      "right": null,
      "top": null,
      "visibility": null,
      "width": null
     }
    },
    "e30ff795eb1c4016be3967f190764399": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "ProgressStyleModel",
     "state": {
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "ProgressStyleModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "StyleView",
      "bar_color": null,
      "description_width": "initial"
     }
    },
    "e91c7ba9a6314f0d905d08d0f822cbc1": {
     "model_module": "@jupyter-widgets/base",
     "model_name": "LayoutModel",
     "state": {
      "_model_module": "@jupyter-widgets/base",
      "_model_module_version": "1.2.0",
      "_model_name": "LayoutModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "LayoutView",
      "align_content": null,
      "align_items": null,
      "align_self": null,
      "border": null,
      "bottom": null,
      "display": null,
      "flex": null,
      "flex_flow": null,
      "grid_area": null,
      "grid_auto_columns": null,
      "grid_auto_flow": null,
      "grid_auto_rows": null,
      "grid_column": null,
      "grid_gap": null,
      "grid_row": null,
      "grid_template_areas": null,
      "grid_template_columns": null,
      "grid_template_rows": null,
      "height": null,
      "justify_content": null,
      "justify_items": null,
      "left": null,
      "margin": null,
      "max_height": null,
      "max_width": null,
      "min_height": null,
      "min_width": null,
      "object_fit": null,
      "object_position": null,
      "order": null,
      "overflow": null,
      "overflow_x": null,
      "overflow_y": null,
      "padding": null,
      "right": null,
      "top": null,
      "visibility": null,
      "width": null
     }
    },
    "e94a6f98578743c096c9b045d9dfdf81": {
     "model_module": "@jupyter-widgets/base",
     "model_name": "LayoutModel",
     "state": {
      "_model_module": "@jupyter-widgets/base",
      "_model_module_version": "1.2.0",
      "_model_name": "LayoutModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "LayoutView",
      "align_content": null,
      "align_items": null,
      "align_self": null,
      "border": null,
      "bottom": null,
      "display": null,
      "flex": null,
      "flex_flow": null,
      "grid_area": null,
      "grid_auto_columns": null,
      "grid_auto_flow": null,
      "grid_auto_rows": null,
      "grid_column": null,
      "grid_gap": null,
      "grid_row": null,
      "grid_template_areas": null,
      "grid_template_columns": null,
      "grid_template_rows": null,
      "height": null,
      "justify_content": null,
      "justify_items": null,
      "left": null,
      "margin": null,
      "max_height": null,
      "max_width": null,
      "min_height": null,
      "min_width": null,
      "object_fit": null,
      "object_position": null,
      "order": null,
      "overflow": null,
      "overflow_x": null,
      "overflow_y": null,
      "padding": null,
      "right": null,
      "top": null,
      "visibility": null,
      "width": null
     }
    },
    "ef2c1a3506c549b7878e3ea4a5cc565f": {
     "model_module": "@jupyter-widgets/base",
     "model_name": "LayoutModel",
     "state": {
      "_model_module": "@jupyter-widgets/base",
      "_model_module_version": "1.2.0",
      "_model_name": "LayoutModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "LayoutView",
      "align_content": null,
      "align_items": null,
      "align_self": null,
      "border": null,
      "bottom": null,
      "display": null,
      "flex": null,
      "flex_flow": null,
      "grid_area": null,
      "grid_auto_columns": null,
      "grid_auto_flow": null,
      "grid_auto_rows": null,
      "grid_column": null,
      "grid_gap": null,
      "grid_row": null,
      "grid_template_areas": null,
      "grid_template_columns": null,
      "grid_template_rows": null,
      "height": null,
      "justify_content": null,
      "justify_items": null,
      "left": null,
      "margin": null,
      "max_height": null,
      "max_width": null,
      "min_height": null,
      "min_width": null,
      "object_fit": null,
      "object_position": null,
      "order": null,
      "overflow": null,
      "overflow_x": null,
      "overflow_y": null,
      "padding": null,
      "right": null,
      "top": null,
      "visibility": null,
      "width": null
     }
    },
    "f04b80c35efb44b5bec078f4d7a57f3b": {
     "model_module": "@jupyter-widgets/base",
     "model_name": "LayoutModel",
     "state": {
      "_model_module": "@jupyter-widgets/base",
      "_model_module_version": "1.2.0",
      "_model_name": "LayoutModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "LayoutView",
      "align_content": null,
      "align_items": null,
      "align_self": null,
      "border": null,
      "bottom": null,
      "display": null,
      "flex": null,
      "flex_flow": null,
      "grid_area": null,
      "grid_auto_columns": null,
      "grid_auto_flow": null,
      "grid_auto_rows": null,
      "grid_column": null,
      "grid_gap": null,
      "grid_row": null,
      "grid_template_areas": null,
      "grid_template_columns": null,
      "grid_template_rows": null,
      "height": null,
      "justify_content": null,
      "justify_items": null,
      "left": null,
      "margin": null,
      "max_height": null,
      "max_width": null,
      "min_height": null,
      "min_width": null,
      "object_fit": null,
      "object_position": null,
      "order": null,
      "overflow": null,
      "overflow_x": null,
      "overflow_y": null,
      "padding": null,
      "right": null,
      "top": null,
      "visibility": null,
      "width": null
     }
    },
    "f0858954bbfd447eb95a04a3caabf8a4": {
     "model_module": "@jupyter-widgets/base",
     "model_name": "LayoutModel",
     "state": {
      "_model_module": "@jupyter-widgets/base",
      "_model_module_version": "1.2.0",
      "_model_name": "LayoutModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "LayoutView",
      "align_content": null,
      "align_items": null,
      "align_self": null,
      "border": null,
      "bottom": null,
      "display": null,
      "flex": null,
      "flex_flow": null,
      "grid_area": null,
      "grid_auto_columns": null,
      "grid_auto_flow": null,
      "grid_auto_rows": null,
      "grid_column": null,
      "grid_gap": null,
      "grid_row": null,
      "grid_template_areas": null,
      "grid_template_columns": null,
      "grid_template_rows": null,
      "height": null,
      "justify_content": null,
      "justify_items": null,
      "left": null,
      "margin": null,
      "max_height": null,
      "max_width": null,
      "min_height": null,
      "min_width": null,
      "object_fit": null,
      "object_position": null,
      "order": null,
      "overflow": null,
      "overflow_x": null,
      "overflow_y": null,
      "padding": null,
      "right": null,
      "top": null,
      "visibility": null,
      "width": null
     }
    },
    "f4d90214c67749168472a2ba3cb0d72a": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "FloatProgressModel",
     "state": {
      "_dom_classes": [],
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "FloatProgressModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/controls",
      "_view_module_version": "1.5.0",
      "_view_name": "ProgressView",
      "bar_style": "success",
      "description": "100%",
      "description_tooltip": null,
      "layout": "IPY_MODEL_9061d4497e4443029cbb21b77281cf31",
      "max": 20239,
      "min": 0,
      "orientation": "horizontal",
      "style": "IPY_MODEL_40b3ea36c5ca407597e8ce6c738c9786",
      "value": 20239
     }
    },
    "f6c75626f2d54bfbaa7bad761af5b9c2": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "FloatProgressModel",
     "state": {
      "_dom_classes": [],
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "FloatProgressModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/controls",
      "_view_module_version": "1.5.0",
      "_view_name": "ProgressView",
      "bar_style": "success",
      "description": "100%",
      "description_tooltip": null,
      "layout": "IPY_MODEL_e91c7ba9a6314f0d905d08d0f822cbc1",
      "max": 11873,
      "min": 0,
      "orientation": "horizontal",
      "style": "IPY_MODEL_26b9259fecf4445b8fdcb753ed6d09ef",
      "value": 11873
     }
    },
    "f8d18feb30b24de5bbbb869cc9a31ae8": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "HBoxModel",
     "state": {
      "_dom_classes": [],
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "HBoxModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/controls",
      "_view_module_version": "1.5.0",
      "_view_name": "HBoxView",
      "box_style": "",
      "children": [
       "IPY_MODEL_f4d90214c67749168472a2ba3cb0d72a",
       "IPY_MODEL_95923a713e324d18bc9fc82a466703c4"
      ],
      "layout": "IPY_MODEL_15f44b8cf87d459bb6a40b5e6b9db470"
     }
    }
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}