{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Keyphrase Extraction in `ktrain`\n",
    "\n",
    "Keyphrase extraction in **ktrain** leverages the [textblob](https://textblob.readthedocs.io/en/dev/) package, which can be installed with:\n",
    "```\n",
    "pip install textblob textract\n",
    "python -m textblob.download_corpora\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ktrain.text.kw import KeywordExtractor\n",
    "from ktrain.text.textextractor import TextExtractor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download a Paper from ArXiv and Extract Text\n",
    "For our test document, let's download the ktrain ArXiv paper and use the `TextExtractor` module to extract text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget --user-agent=\"Mozilla\" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q\n",
    "text = TextExtractor().extract('/tmp/downloaded_paper.pdf')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of words in downloaded paper: 4551\n"
     ]
    }
   ],
   "source": [
    "print(f\"# of words in downloaded paper: {len(text.split())}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using N-Grams as the candidate generator\n",
    "\n",
    "Let's first use `ngrams` as the candidate generator, which is comparatively fast:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "kwe = KeywordExtractor()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 396 ms, sys: 19.8 ms, total: 416 ms\n",
      "Wall time: 415 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('machine learning', 0.10548523206751055),\n",
       " ('step', 0.06751054852320675),\n",
       " ('learning rate', 0.046413502109704644),\n",
       " ('arxiv preprint', 0.046413502109704644),\n",
       " ('text classification', 0.03375527426160337),\n",
       " ('augmented machine', 0.02531645569620253),\n",
       " ('open-domain question-answering', 0.02531645569620253),\n",
       " ('augmented machine learning', 0.02531645569620253),\n",
       " ('bert', 0.02109704641350211),\n",
       " ('low-code library', 0.02109704641350211)]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "kwe.extract_keywords(text, candidate_generator='ngrams')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using Noun Phrases as the candidate generator\n",
    "\n",
    "\n",
    "If we use `noun_phrases` as the candidate generator instead, quality improves slightly at the expense of a longer running time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.04 s, sys: 0 ns, total: 1.04 s\n",
      "Wall time: 1.04 s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('machine learning', 0.0784313725490196),\n",
       " ('text classification', 0.049019607843137254),\n",
       " ('image classification', 0.049019607843137254),\n",
       " ('exact answers', 0.0392156862745098),\n",
       " ('augmented machine learning', 0.0392156862745098),\n",
       " ('graph data', 0.029411764705882353),\n",
       " ('node classification', 0.029411764705882353),\n",
       " ('entity recognition', 0.029411764705882353),\n",
       " ('code example', 0.029411764705882353),\n",
       " ('index documents', 0.029411764705882353)]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "kwe.extract_keywords(text, candidate_generator='noun_phrases')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Other Parameters\n",
    "The `extract_keywords` method has many other parameters to control the output.  For instance, you can control the number of words in keyphrases with the `ngram_range` parameter. Here, we extract 3-word keyphrases:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('augmented machine learning', 0.07017543859649122),\n",
       " ('a. s. maiya', 0.05263157894736842),\n",
       " ('optimal learning rate', 0.03508771929824561),\n",
       " ('natural language questions', 0.03508771929824561),\n",
       " ('support text data', 0.017543859649122806),\n",
       " ('learning rate schedules', 0.017543859649122806),\n",
       " ('machine learning model', 0.017543859649122806),\n",
       " ('unsupervised topic modeling', 0.017543859649122806),\n",
       " ('large text corpus', 0.017543859649122806),\n",
       " ('social media accounts', 0.017543859649122806)]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kwe.extract_keywords(text, candidate_generator='noun_phrases', ngram_range=(3,3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combining All the Steps:  Low-Code Keyphrase Extraction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('machine learning', 0.0784313725490196),\n",
       " ('text classification', 0.049019607843137254),\n",
       " ('image classification', 0.049019607843137254),\n",
       " ('exact answers', 0.0392156862745098),\n",
       " ('augmented machine learning', 0.0392156862745098),\n",
       " ('graph data', 0.029411764705882353),\n",
       " ('node classification', 0.029411764705882353),\n",
       " ('entity recognition', 0.029411764705882353),\n",
       " ('code example', 0.029411764705882353),\n",
       " ('index documents', 0.029411764705882353)]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from ktrain.text.kw import KeywordExtractor\n",
    "from ktrain.text.textextractor import TextExtractor\n",
    "!wget --user-agent=\"Mozilla\" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q\n",
    "text = TextExtractor().extract('/tmp/downloaded_paper.pdf')\n",
    "kwe = KeywordExtractor()\n",
    "kwe.extract_keywords(text, candidate_generator='noun_phrases')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Non-English Keyphrase Extraction\n",
    "\n",
    "Keyphrases can be extracted for non-English languages by supplying a 2-character language code as the `lang` argument. For simplified or traditional Chinese, use `zh`.\n",
    "\n",
    "#### Chinese"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('监督 学习', 0.06),\n",
       " ('训练 数据', 0.06),\n",
       " ('学习 算法', 0.04),\n",
       " ('机器 学习', 0.02),\n",
       " ('学习 任务', 0.02),\n",
       " ('样本 输入', 0.02),\n",
       " ('输入 输出', 0.02),\n",
       " ('输入 映射', 0.02),\n",
       " ('自由 一组', 0.02),\n",
       " ('一组 训练', 0.02)]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"\"\"\n",
    "监督学习是学习一个函数的机器学习任务\n",
    "         根据样本输入-输出对将输入映射到输出。他推导出一个\n",
    "         函数来自由一组训练示例组成的标记训练数据。\n",
    "         在监督学习中，每个示例都是由一个输入对象组成的对\n",
    "         （通常是一个向量）和一个期望的输出值（也称为监控信号）。\n",
    "         监督学习算法分析训练数据并产生推断函数，\n",
    "         可用于映射新示例。最佳方案将允许\n",
    "         算法来正确确定不可见实例的类标签。这需要\n",
    "         学习算法从训练数据泛化到新情况\n",
    "         “合理”的方式（见归纳偏差）。\n",
    "\"\"\"\n",
    "kwe = KeywordExtractor(lang='zh')\n",
    "kwe.extract_keywords(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### French"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(\"données d'entraînement\", 0.0392156862745098),\n",
       " (\"l'apprentissage supervisé\", 0.0196078431372549),\n",
       " (\"tâche d'apprentissage\", 0.0196078431372549),\n",
       " (\"d'apprentissage automatique\", 0.0196078431372549),\n",
       " ('automatique consistant', 0.0196078431372549),\n",
       " (\"base d'exemples\", 0.0196078431372549),\n",
       " ('paires entrée-sortie', 0.0196078431372549),\n",
       " (\"d'entraînement étiquetées\", 0.0196078431372549),\n",
       " ('étiquetées constituées', 0.0196078431372549),\n",
       " (\"constituées d'un\", 0.0196078431372549)]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"\"\"L'apprentissage supervisé est la tâche d'apprentissage automatique consistant à apprendre une fonction qui\n",
    "         mappe une entrée à une sortie sur la base d'exemples de paires entrée-sortie. Il en déduit une\n",
    "         fonction à partir de données d'entraînement étiquetées constituées d'un ensemble d'exemples d'entraînement.\n",
    "         En apprentissage supervisé, chaque exemple est une paire composée d'un objet d'entrée\n",
    "         (généralement un vecteur) et une valeur de sortie souhaitée (également appelée signal de supervision).\n",
    "         Un algorithme d'apprentissage supervisé analyse les données d'apprentissage et produit une fonction inférée,\n",
    "         qui peut être utilisé pour cartographier de nouveaux exemples. Un scénario optimal permettra\n",
    "         algorithme pour déterminer correctement les étiquettes de classe pour les instances invisibles. Cela nécessite\n",
    "         l'algorithme d'apprentissage pour généraliser à partir des données d'entraînement à des situations inédites dans un\n",
    "         manière « raisonnable » (voir biais inductif).\"\"\"\n",
    "\n",
    "kwe = KeywordExtractor(lang='fr')\n",
    "kwe.extract_keywords(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following languages are supported:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "en english\n",
      "ar arabic\n",
      "az azerbaijani\n",
      "da danish\n",
      "nl dutch\n",
      "fi finnish\n",
      "fr french\n",
      "de german\n",
      "el greek\n",
      "hu hungarian\n",
      "id indonesian\n",
      "it italian\n",
      "kk kazakh\n",
      "ne nepali\n",
      "no norwegian\n",
      "pt portuguese\n",
      "ro romanian\n",
      "ru russian\n",
      "sl slovene\n",
      "es spanish\n",
      "sv swedish\n",
      "tg tajik\n",
      "tr turkish\n",
      "zh chinese\n"
     ]
    }
   ],
   "source": [
    "from ktrain.text.kw.core import SUPPORTED_LANGS\n",
    "for k,v in SUPPORTED_LANGS.items():\n",
    "    print(k,v)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scalability\n",
    "The `KeywordExtractor` is a already fast. With parallelization, keyphrase extraction can easily scale to a large number of documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"\"\"\n",
    "         Supervised learning is the machine learning task of learning a function that\n",
    "         maps an input to an output based on example input-output pairs. It infers a\n",
    "         function from labeled training data consisting of a set of training examples.\n",
    "         In supervised learning, each example is a pair consisting of an input object\n",
    "         (typically a vector) and a desired output value (also called the supervisory signal). \n",
    "         A supervised learning algorithm analyzes the training data and produces an inferred function, \n",
    "         which can be used for mapping new examples. An optimal scenario will allow for the \n",
    "         algorithm to correctly determine the class labels for unseen instances. This requires \n",
    "         the learning algorithm to generalize from the training data to unseen situations in a \n",
    "         'reasonable' way (see inductive bias).\n",
    "\n",
    "\"\"\"\n",
    "docs = [text] * 10000\n",
    "kwe = KeywordExtractor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can process these 10,000 documents using 8 processors in only a few seconds:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.19 s, sys: 225 ms, total: 2.42 s\n",
      "Wall time: 9.51 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "from joblib import Parallel, delayed\n",
    "results = Parallel(n_jobs=8)(delayed(kwe.extract_keywords)(doc) for doc in docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of results is 10000\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('supervised learning', 0.07317073170731707),\n",
       " ('training data', 0.07317073170731707),\n",
       " ('learning algorithm', 0.04878048780487805),\n",
       " ('machine learning', 0.024390243902439025),\n",
       " ('learning task', 0.024390243902439025),\n",
       " ('output based', 0.024390243902439025),\n",
       " ('example input-output', 0.024390243902439025),\n",
       " ('input-output pairs', 0.024390243902439025),\n",
       " ('labeled training', 0.024390243902439025),\n",
       " ('data consisting', 0.024390243902439025)]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(f'# of results is {len(results)}')\n",
    "results[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}