{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2\n",
    "%matplotlib inline\n",
    "import os\n",
    "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n",
    "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\" "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Zero Shot Learning Using Natural Language Inference\n",
    "\n",
    "In this notebook, we will demonstrate **zero-shot** topic classification.  **Zero-Shot Learning (ZSL)** is being able to solve a task despite not having received any training examples of that task.  The `ZeroShotClassifier` class in *ktrain* can be used to perform topic classification with no training examples.  The technique is based on **Natural Language Inference (or NLI)** as described in [this interesting blog post](https://joeddav.github.io/blog/2020/05/29/ZSL.html) by Joe Davison.\n",
    "\n",
    "## STEP 1: Setup the Zero Shot Classifier and Describe Topics\n",
    "\n",
    "We first instantiate the zero-shot-classifier and then describe the topic labels for our classifier with strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ktrain.text.zsl import ZeroShotClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "zsl = ZeroShotClassifier()\n",
    "labels=['politics', 'elections', 'sports', 'films', 'television']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## STEP 2: Predict\n",
    "\n",
    "There is no training involved here, as we are using **zero-shot-learning**.  We will simply supply the document that is being classified and the `topic_strings` defined earlier. The `predict` method uses Natural Language Inference (NLI) to infer the topic probabilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('politics', 0.979189932346344),\n",
       " ('elections', 0.9874580502510071),\n",
       " ('sports', 0.0005765462410636246),\n",
       " ('films', 0.0022924456279724836),\n",
       " ('television', 0.0010546103585511446)]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc = 'I am extremely dissatisfied with the President and will definitely vote in 2020.'\n",
    "zsl.predict(doc, labels=labels, include_labels=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, our model correctly assigned the highest probabilities to `politics` and `elections`, as the text supplied pertains to both these topics.\n",
    "\n",
    "Let's try some other examples.\n",
    "#### document about `television`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('politics', 0.00015667644038330764),\n",
       " ('elections', 0.00032881161314435303),\n",
       " ('sports', 0.00013884963118471205),\n",
       " ('films', 0.07557642459869385),\n",
       " ('television', 0.9813269376754761)]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc = 'What is your favorite sitcom of all time?'\n",
    "zsl.predict(doc, labels=labels, include_labels=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### document about both `politics` and `television`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('politics', 0.8049427270889282),\n",
       " ('elections', 0.01889326609671116),\n",
       " ('sports', 0.005504833068698645),\n",
       " ('films', 0.05876927077770233),\n",
       " ('television', 0.8776823878288269)]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc = \"\"\"\n",
    "President Donald Trump's senior adviser and son-in-law, Jared Kushner, praised \n",
    "the administration's response to the coronavirus pandemic as a \\\"great success story\\\" on Wednesday -- \n",
    "less than a day after the number of confirmed coronavirus cases in the United States topped 1 million. \n",
    "Kushner painted a rosy picture for \\\"Fox and Friends\\\" Wednesday morning, \n",
    "saying that \\\"the federal government rose to the challenge and \n",
    "this is a great success story and I think that that's really what needs to be told.\\\"\n",
    "\"\"\"\n",
    "zsl.predict(doc, labels=labels, include_labels=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### document about `sports`, `television`, and `film`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('politics', 0.0005349867278710008),\n",
       " ('elections', 0.0007852867711335421),\n",
       " ('sports', 0.9848827123641968),\n",
       " ('films', 0.9576993584632874),\n",
       " ('television', 0.941143274307251)]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc = \"The Last Dance is a 2020 American basketball documentary miniseries co-produced by ESPN Films and Netflix.\"\n",
    "zsl.predict(doc, labels=labels, include_labels=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Customizing the Classifier for Zero-Shot Sentiment Analysis\n",
    "\n",
    "As stated above, the `ZeroShotClassifier` is implemented using Natural Language Inference (NLI).  That is, the document is treated as a **premise**, and each label is treated as a **hypothesis**.  To predict labels, an NLI model is used to predict whether or not each label is entailed by the premise.  By default, the template used for the hypothesis is of the form `\"This text is about <label>.\"`, where `<label>` is replaced with a candidate label (e.g., `politics`, `sports`, etc.).  Although this works well for many text classification problems such as the topic classification examples above, we can customize the template with the `nli_template` parameter if necessary.  For instance, if predicting sentiment of movie reviews, we might change the template as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('negative', 0.9995395541191101), ('positive', 0.011613081209361553)]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc = \"I will definitely not be seeing this movie again.\"\n",
    "zsl.predict(doc, labels=['negative', 'positive'], include_labels=True,\n",
    "            nli_template=\"The sentiment of this movie review is {}.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you compare with the default template, you'll see the negative score is higher with the custom template.\n",
    "\n",
    "Let's now consider a more ambiguous review:\n",
    "> I will definitely not be seeing this movie again, but the acting was good."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('negative', 0.8110149502754211), ('positive', 0.5280577540397644)]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc = \"I will definitely not be seeing this movie again, but the acting was good.\"\n",
    "zsl.predict(doc, labels=['negative', 'positive'], include_labels=True,\n",
    "            nli_template=\"The sentiment of this movie review is {}.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the output above, we see that the results do **NOT** sum to one and both labels are above a standard threshold of `0.5`.  By default, `ZeroShotClassifier` treats the task as a multilabel problem, which allows multiple labels to be true.  Since the review is both negative and positive, both scores are above the `0.5` threshold (although the `positive` class is only above slightly when using the custom template).\n",
    "\n",
    "If the labels are to be treated as mutually-exclusive, we can set `multilabel=False` in which case the scores will sum to 1 we will classify the review as negative overall:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('negative', 0.6576023101806641), ('positive', 0.34239766001701355)]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc = \"I will definitely not be seeing this movie again, but the acting was good.\"\n",
    "zsl.predict(doc, labels=['negative', 'positive'], include_labels=True,\n",
    "            nli_template=\"The sentiment of this movie review is {}.\",\n",
    "             multilabel=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prediction Time and Batch Size\n",
    "\n",
    "The `predict` method can accept a large list of documents.  Documents are automatically split into batches based on the `batch_size` parameter, which can be increased to speed up predictions.\n",
    "\n",
    "Note also that the `predict` method of `ZeroShotClassifier` generates a separate NLI prediction for each label included in the `labels` parameter.  As `len(labels)` and the number of documents fed to `predict` increases, the prediction time will also increase.  **You can speed up predictions by increasing the `batch_size`.**  The default `batch_size` is currently set conservatively at 8:\n",
    "\n",
    "#### Predicting 800 topics for a single document takes ~26 seconds on a TITAN V GPU using `batch_size=1`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 53.2 s, sys: 728 ms, total: 53.9 s\n",
      "Wall time: 26 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "doc = 'I am extremely dissatisfied with the President and will definitely vote in 2020.'\n",
    "labels=['politics', 'elections', 'sports', 'films', 'television']\n",
    "predictions = zsl.predict(doc, labels=labels*160, include_labels=True, batch_size=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, 26 seconds is slow.  We can speed things up by increasing `batch_size`:\n",
    "\n",
    "#### Predicting 800 topics for a single document takes less than 2 seconds on a TITAN V GPU using `batch_size=64`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.74 s, sys: 480 ms, total: 2.22 s\n",
      "Wall time: 1.67 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "doc = 'I am extremely dissatisfied with the President and will definitely vote in 2020.'\n",
    "predictions = zsl.predict(doc, labels=labels*160, include_labels=True, batch_size=64)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predicting 5 topics for 1000 documents takes ~10 seconds on a TITAN V GPU using `batch_size=64`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 11.1 s, sys: 2.57 s, total: 13.7 s\n",
      "Wall time: 10.3 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "doc = 'I am extremely dissatisfied with the President and will definitely vote in 2020.'\n",
    "predictions = zsl.predict([doc]*1000, labels=labels, include_labels=True, batch_size=64)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With 1000 documents and 5 topics, we are essentially making 5000 predictions in ~10 seconds with a `batch_size=64`. Lower batch sizes would be much slower given this many predictions.  The `batch_size` should be set based on available memory.\n",
    "\n",
    "Finally, there is a `max_length` parameter that is set to 512 as default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}