{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\" " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building an End-to-End Question-Answering System With BERT\n", "\n", "In this notebook, we build a practical, end-to-end Question-Answering (QA) system with BERT in rougly 3 lines of code. We will treat a corpus of text documents as a knowledge base to which we can ask questions and retrieve exact answers using [BERT](https://arxiv.org/abs/1810.04805). This goes beyond simplistic keyword searches.\n", "\n", "For this example, we will use the [20 Newsgroup dataset](http://qwone.com/~jason/20Newsgroups/) as the text corpus. As a collection of newsgroup postings which contains an abundance of opinions and debates, the corpus is not ideal as a knowledgebase. It is better to use fact-based documents such as Wikipedia articles or even news articles. However, this dataset will suffice for this example.\n", "\n", "Let us begin by loading the dataset into an array using **scikit-learn** and importing *ktrain* modules." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# load 20newsgroups datset into an array\n", "from sklearn.datasets import fetch_20newsgroups\n", "remove = ('headers', 'footers', 'quotes')\n", "newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)\n", "newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)\n", "docs = newsgroups_train.data + newsgroups_test.data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import ktrain\n", "from ktrain.text.qa import SimpleQA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### STEP 1: Index the Documents\n", "\n", "We will first index the documents into a search engine that will be used to quickly retrieve documents that are likely to contain answers to a question. To do so, we must choose an index location, which must be a folder that does not already exist. \n", "\n", "Since the newsgroup postings are small and fit in memory, we wil set `commit_every` to a large value to speed up the indexing process. This means results will not be written until the end. If you experience issues, you can lower this value." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "INDEXDIR = '/tmp/myindex'" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SimpleQA.initialize_index(INDEXDIR)\n", "SimpleQA.index_from_list(docs, INDEXDIR, commit_every=len(docs),\n", " multisegment=True, procs=4, # these args speed up indexing\n", " breakup_docs=True # this slows indexing but speeds up answer retrieval\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For documents sets that are too large to be loaded into a Python list, you can use `SimpleQA.index_from_folder`, which will crawl a folder and index all plain text documents (e.g.,, `.txt` files) by default. If your documents are in formats like `.pdf`, `.docx`, or `.pptx`, you can supply the `use_text_extraction=True` argument to `index_from_folder`, which will use the [textract](https://textract.readthedocs.io/en/stable/) package to extract text from different file types and index this text into the search engine for answer rerieval. You can also manually convert them to `.txt` files with the `ktrain.text.textutils.extract_copy` or tools like [Apache Tika](https://tika.apache.org/) or [textract](https://textract.readthedocs.io/en/stable/). \n", "\n", "#### Speeding Up Indexing\n", "By default, `index_from_list` and `index_from_folder` use a single processor (`procs=1`) with each processor using a maximum of 256MB of memory (`limitmb=256`) and merging results into a single segment (`multisegment=False`). These values can be changed to speedup indexing as arguments to `index_from_list` or `index_from_folder`. See the [whoosh documentation](https://whoosh.readthedocs.io/en/latest/batch.html) for more information on these parameters and how to use them to speedup indexing. In this case, we've used `multisegment=True` and `procs=4`.\n", "\n", "#### Speeding Up Answer Retrieval\n", "\n", "Note that larger documents will cause inferences in STEP 3 (see below) to be very slow. If your dataset consists of larger documents (e.g., long articles), we recommend breaking them up into pages (e.g., splitting the original PDF using something like `pdfseparate`) or splitting them into paragraphs (paragraphs are probably preferrable). The latter can be done with *ktrain* using:\n", "```python\n", "ktrain.text.textutils.paragraph_tokenize(document, join_sentences=True)\n", "```\n", "If you supply `breakup_docs=True` in the cell above, this will be done automatically. Note that `breakup_docs=True` will slightly **slow indexing** (i.e., STEP 1), but **speed up answer retrieval** (i.e., STEP 3 below). A second way to speed up answer-retrieval is to increase `batch_size` in STEP 3 if using a GPU, which will be discussed later.\n", "\n", "\n", "The above steps need to only be performed once. Once an index is already created, you can skip this step and proceed directly to **STEP 2** to begin using your system." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### STEP 2: Create a QA instance\n", "\n", "Next, we create a QA instance. This step will automatically download the BERT SQuAD model if it does not already exist on your system." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "qa = SimpleQA(INDEXDIR)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it! In roughly **3 lines of code**, we have built an end-to-end QA system that can now be used to generate answers to questions. Let's ask our system some questions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### STEP 3: Ask Questions\n", "\n", "We will invoke the `ask` method to issue questions to the text corpus we indexed and retrieve answers. We will also use the `qa.display` method to nicely display the top 5 results in this Jupyter notebook. The answers are inferred using a BERT model fine-tuned on the SQuAD dataset. The model will comb through paragraphs and sentences to find candidate answers. By default, `ask` currently uses a `batch_size` of 8, but, if necessary, you can experiment with lowering it by setting the `batch_size` parameter. On a CPU, for instance, you may want to try `batch_size=1`.\n", "\n", "Note also that the 20 Newsgroup Dataset covers events in the early to mid 1990s, so references to recent events will not exist.\n", "\n", "#### Space Question" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Candidate AnswerContextConfidenceDocument Reference
0in october of 1997
cassini is scheduled for launch aboard a titan iv / centaur in october of 1997 .
0.81903259
1on january 26,1962
ranger 3, launched on january 26,1962 , was intended to land an instrument capsule on the surface of the moon, but problems during the launch caused the probe to miss the moon and head into solar orbit.
0.1512298525
2- 10 / 06 / 97
key scheduled dates for the cassini mission (vvejga trajectory)-------------------------------------------------------------10 / 06 / 97-titan iv / centaur launch 04 / 21 / 98-venus 1 gravity assist 06 / 20 / 99-venus 2 gravity assist 08 / 16 / 99-earth gravity assist 12 / 30 / 00-jupiter gravity assist 06 / 25 / 04-saturn arrival 01 / 09 / 05-titan probe release 01 / 30 / 05-titan probe entry 06 / 25 / 08-end of primary mission (schedule last updated 7 / 22 / 92) - 10 / 06 / 97
0.02969459
3* 98
cassini * * * * * * * * * * * * * * * * * * 98 ,115 * * * *
0.0000265356
4the latter part of the 1990s
scheduled for launch in the latter part of the 1990s , the craf and cassini missions are a collaborative project of nasa, the european space agency and the federal space agencies of germany and italy, as well as the united states air force and the department of energy.
0.00001718684
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "answers = qa.ask('When did the Cassini probe launch?')\n", "qa.display_answers(answers[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the top candidate answer indicates that the Cassini space probe was launched in October of 1997, which appears to be correct. The correct answer will not always be the top answer, but it is in this case. \n", "\n", "Note that, since we used `index_from_list` to index documents, the last column (i.e., **Document Reference**) shows the list index associated with the newsgroup posting containing the answer, which can be used to peruse the entire document containing the answer. If using `index_from_folder` to index documents, the last column will show the relative path and filename of the document. The **Document Reference** values can be customized by supplying a `references` parameter to `index_from_list`.\n", "\n", "To see the text of the document that contains the top answer, uncomment and execute the following line (it's a comparatively long post)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "#print(docs[59])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The 20 Newsgroup dataset contains lots of posts discussing and debating religions like Christianity and Islam, as well. Let's ask a question on this subject.\n", "\n", "#### Religious Question" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Candidate AnswerContextConfidenceDocument Reference
0the holy prophet of islam
just a small reminder to all my muslim brothers, did _ ever _ the holy prophet of islam (muhammad pbuh), say to anyone who called himself a muslim :
0.7624641278
1the messenger of allah
muhammad is the messenger of allah , and those who are with him are firm against the unbelievers and merciful among each other.
0.2018624876
2is the last prophet of islam
muhammad peace and blessings of allah be upon him (saw) is the last prophet of islam .
0.0351214640
3either a liar, or he was crazy (a modern day mad mahdi) or he was actually who he said he was
the book says that muhammad was either a liar, or he was crazy (a modern day mad mahdi) or he was actually who he said he was .
0.0002934934
4[ mahound ' s
muhammad ' s [ mahound ' s ] integrity is not really impugned in this part of the story, and there ' s no reason to think this was rushdie ' s intent : gibreel, as the archangel, produces the verses (divine and satanic), though he does not know their provenance.
0.00013815852
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "answers = qa.ask('Who was Muhammad?')\n", "qa.display_answers(answers[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we see different views on who Muhammad, the founder of Islam, as debated and discussed in this document set. \n", "\n", "Finally, the 20 Newsgroup dataset also contains many groups about computing hardware and software. Let's ask a technical support question.\n", "\n", "#### Technical Question" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Candidate AnswerContextConfidenceDocument Reference
0if your viewer does not do gamma correction
if your viewer does not do gamma correction , then linear images will look too dark, and gamma corrected images will ok.
0.93799013873
1is gamma correction
this, is gamma correction (or the lack of it).
0.04516513873
2so if you just dump your nice linear image out to a crt
so if you just dump your nice linear image out to a crt , the image will look much too dark.
0.01033713873
3that small color details
the algorithm achieves much of its compression by exploiting known limitations of the human eye, notably the fact that small color details are not perceived as well as small details of light and dark.
0.0021146987
4that small color details
the algorithm achieves much of its compression by exploiting known limitations of the human eye, notably the fact that small color details are not perceived as well as small details of light and dark.
0.00211412344
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "answers = qa.ask('What causes computer images to be too dark?')\n", "qa.display_answers(answers[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, a lack of *gamma correction* is the top answer.\n", "\n", "### The `batch_size` Argument to `ask`\n", "\n", "As of **ktrain v0.22.x**, the `ask` method uses `batch_size=8` by default, which means 8 question-document pairs are fed to the model at a time. Older versions of **ktrain** used a `batch_size` of 1. A `batch_size` of 8 speeds of answer-retrieval. If you experience an Out-of-Memory (OOM) error, you can reduce the batch size by setting the `batch_size` argument to `ask` (e.g., `batch_size=1`). Reducing `batch_size` may also be beneficial if `ask` is being invoked using a **CPU** instead of **GPU**.\n", "\n", "\n", "\n", "### Deploying the QA System\n", "\n", "To deploy this system, the only state that needs to be persisted is the search index we initialized and populated in **STEP 1**. Once a search index is initialized and populated, one can simply re-run from **STEP 2**.\n", "\n", "\n", "### Using `SimpleQA` as a Simple Search Engine\n", "Once an index is created, `SimpleQA` can also be used as a conventional search engine to perform keyword searches using the `search` method:\n", "\n", "```python\n", "qa.search(' \"solar orbit\" AND \"battery power\" ') # find documents that contain both these phrases\n", "```\n", "See the [whoosh documentation](https://whoosh.readthedocs.io/en/latest/querylang.html) for more information on query syntax.\n", "\n", "### The `index_from_folder` method\n", "\n", "Earlier, we mentioned the `index_from_folder` method could be used to index documents of different file types (e.g., `.pdf`, `.docx`, `.ppt`, etc.). Here is a brief code example:\n", "\n", "```python\n", "# index documents of different types into a built-in search engine\n", "from ktrain.text.qa import SimpleQA\n", "INDEXDIR = '/tmp/myindex'\n", "SimpleQA.initialize_index(INDEXDIR)\n", "corpus_path = '/my/folder/of/documents' # contains .pdf, .docx, .pptx files in addition to .txt files\n", "SimpleQA.index_from_folder(corpus_path, INDEXDIR, use_text_extraction=True, # enable text extraction\n", " multisegment=True, procs=4, # these args speed up indexing\n", " breakup_docs=True) # speeds up answer retrieval\n", "\n", "# ask questions (setting higher batch size can further speed up answer retrieval)\n", "qa = SimpleQA(INDEXDIR)\n", "answers = qa.ask('What is ktrain?', batch_size=8)\n", "\n", "# top answer snippet extracted from https://arxiv.org/abs/2004.10703:\n", "# \"ktrain is a low-code platform for machine learning\"\n", "\n", "\n", "```\n", "\n", "\n", "### Connecting the QA System to an Existing Search Engine\n", "\n", "In this notebook, we created and populated our own search index of documents. As mentioned above, **ktrain** uses [whoosh](https://whoosh.readthedocs.io/en/latest/querylang.html) internally for this. It is relatively easy to use the **ktrain** `qa` module with an existing, pre-populated search engine like [Apache Solr](https://lucene.apache.org/solr/) or [Elastic Search](https://github.com/elastic/elasticsearch). You can simply subclass the `QA` class and override the `search` method:\n", "\n", "```python\n", "from ktrain.text.qa import QA\n", "class MyCustomQA(QA):\n", " \"\"\"\n", " Custom QA Module\n", " \"\"\"\n", " def __init__(self,\n", " bert_squad_model='bert-large-uncased-whole-word-masking-finetuned-squad',\n", " bert_emb_model='bert-base-uncased'):\n", " \"\"\"\n", " MyCustomQA constructor. Include other parameters as needed.\n", " Args:\n", " bert_squad_model(str): name of BERT SQUAD model to use\n", " bert_emb_model(str): BERT model to use to generate embeddings for semantic similarity\n", "\n", " \"\"\"\n", " super().__init__(bert_squad_model=bert_squad_model, bert_emb_model=bert_emb_model)\n", " \n", " \n", " def search(self, query, limit=10, min_context_length=50):\n", " \"\"\"\n", " search index for query\n", " Args:\n", " query(str): search query\n", " limit(int): number of top search results to return\n", " Returns:\n", " list of dicts with keys: reference, rawtext\n", " \"\"\"\n", " \n", " # ADD CODE HERE TO QUERY YOUR SEARCH ENGINE\n", " # The query is the text of the question being asked.\n", " # This code will find documents that match words in question.\n", "\n", "```\n", "\n", "If the back-end search engine is already populated with documents, you can now simply instantiate a `QA` object and invoke `ask` normally:\n", "\n", "```python\n", "qa = MyCustomQA()\n", "qa.ask('What is the best search engine?')\n", "```\n", "\n", "Note that, as mentioned above, this will work best when documents stored in the search engine are broken into smaller contexts (e.g., paragraphs), if they are not already." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }