{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Keyphrase Extraction in `ktrain`\n",
    "\n",
    "Keyphrase extraction in **ktrain** leverages the [textblob](https://textblob.readthedocs.io/en/dev/) package, which can be installed with:\n",
    "```\n",
    "pip install textblob tika\n",
    "python -m textblob.download_corpora\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ktrain.text.kw import KeywordExtractor\n",
    "from ktrain.text.textextractor import TextExtractor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download a Paper from ArXiv and Extract Text\n",
    "For our test document, let's download the ktrain ArXiv paper and use the `TextExtractor` module to extract text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget --user-agent=\"Mozilla\" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q\n",
    "text = TextExtractor().extract('/tmp/downloaded_paper.pdf')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of words in downloaded paper: 4316\n"
     ]
    }
   ],
   "source": [
    "print(f\"# of words in downloaded paper: {len(text.split())}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using N-Grams as the candidate generator\n",
    "\n",
    "Let's first use `ngrams` as the candidate generator, which is comparatively fast:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "kwe = KeywordExtractor()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 341 ms, sys: 16.9 ms, total: 358 ms\n",
      "Wall time: 357 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('machine learning', 0.5503444817814314),\n",
       " ('augmented machine', 0.5123881190828152),\n",
       " ('augmented machine learning', 0.5123881190828152),\n",
       " ('low-code library', 0.5107922072149182),\n",
       " ('step', 0.5092460272048237),\n",
       " ('text classification', 0.5044526957819503),\n",
       " ('open-domain question-answering', 0.4996712653266335),\n",
       " ('learning rate', 0.4894264238049616),\n",
       " ('bert', 0.424790141017796),\n",
       " ('arxiv preprint', 0.16264098705836771)]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "kwe.extract_keywords(text, candidate_generator='ngrams')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using Noun Phrases as the candidate generator\n",
    "\n",
    "\n",
    "If we use `noun_phrases` as the candidate generator instead, quality improves slightly at the expense of a longer running time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 855 ms, sys: 103 µs, total: 856 ms\n",
      "Wall time: 855 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('machine learning', 0.5341716824761019),\n",
       " ('augmented machine learning', 0.5208544167057394),\n",
       " ('text classification', 0.5134074336523509),\n",
       " ('image classification', 0.5071170746851726),\n",
       " ('node classification', 0.4973034499292447),\n",
       " ('tabular data', 0.49645958463369566),\n",
       " ('entity recognition', 0.45195059648705926),\n",
       " ('exact answers', 0.4462502183477142),\n",
       " ('import ktrain', 0.32891369271775894),\n",
       " ('load model', 0.32052348289886556)]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "kwe.extract_keywords(text, candidate_generator='noun_phrases')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Other Parameters\n",
    "The `extract_keywords` method has many other parameters to control the output.  For instance, you can control the number of words in keyphrases with the `ngram_range` parameter. Here, we extract 3-word keyphrases:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('augmented machine learning', 0.541435342459079),\n",
       " ('machine learning model', 0.4982195592681719),\n",
       " ('support text data', 0.49549171563837363),\n",
       " ('learning rate schedules', 0.47765279578595193),\n",
       " ('a. s. maiya', 0.4612715229636928),\n",
       " ('unsupervised topic modeling', 0.44648865417358047),\n",
       " ('large text corpus', 0.4374416332143215),\n",
       " ('optimal learning rate', 0.42667304584617965),\n",
       " ('non-supervised ml tasks', 0.2330746472277638),\n",
       " ('natural language questions', 0.21662908635171388)]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kwe.extract_keywords(text, candidate_generator='noun_phrases', ngram_range=(3,3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combining All the Steps:  Low-Code Keyphrase Extraction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('machine learning', 0.5341716824761019),\n",
       " ('augmented machine learning', 0.5208544167057394),\n",
       " ('text classification', 0.5134074336523509),\n",
       " ('image classification', 0.5071170746851726),\n",
       " ('node classification', 0.4973034499292447),\n",
       " ('tabular data', 0.49645958463369566),\n",
       " ('entity recognition', 0.45195059648705926),\n",
       " ('exact answers', 0.4462502183477142),\n",
       " ('import ktrain', 0.32891369271775894),\n",
       " ('load model', 0.32052348289886556)]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from ktrain.text.kw import KeywordExtractor\n",
    "from ktrain.text.textextractor import TextExtractor\n",
    "!wget --user-agent=\"Mozilla\" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q\n",
    "text = TextExtractor().extract('/tmp/downloaded_paper.pdf')\n",
    "kwe = KeywordExtractor()\n",
    "kwe.extract_keywords(text, candidate_generator='noun_phrases')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Non-English Keyphrase Extraction\n",
    "\n",
    "Keyphrases can be extracted for non-English languages by supplying a 2-character language code as the `lang` argument. For simplified or traditional Chinese, use `zh`.\n",
    "\n",
    "#### Chinese"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Building prefix dict from the default dictionary ...\n",
      "Loading model from cache /tmp/jieba.cache\n",
      "Loading model cost 0.669 seconds.\n",
      "Prefix dict has been built successfully.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('监督 学习', 0.53),\n",
       " ('机器 学习', 0.48103658536585364),\n",
       " ('学习 任务', 0.4764634146341463),\n",
       " ('样本 输入', 0.4627439024390244),\n",
       " ('输入 映射', 0.4398780487804878),\n",
       " ('自由 一组', 0.39719512195121953),\n",
       " ('一组 训练', 0.3926219512195122),\n",
       " ('训练 数据', 0.38670731707317074),\n",
       " ('学习 算法', 0.22731707317073171),\n",
       " ('输入 输出', 0.01152439024390244)]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"\"\"\n",
    "监督学习是学习一个函数的机器学习任务\n",
    "         根据样本输入-输出对将输入映射到输出。他推导出一个\n",
    "         函数来自由一组训练示例组成的标记训练数据。\n",
    "         在监督学习中，每个示例都是由一个输入对象组成的对\n",
    "         （通常是一个向量）和一个期望的输出值（也称为监控信号）。\n",
    "         监督学习算法分析训练数据并产生推断函数，\n",
    "         可用于映射新示例。最佳方案将允许\n",
    "         算法来正确确定不可见实例的类标签。这需要\n",
    "         学习算法从训练数据泛化到新情况\n",
    "         “合理”的方式（见归纳偏差）。\n",
    "\"\"\"\n",
    "kwe = KeywordExtractor(lang='zh')\n",
    "kwe.extract_keywords(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### French"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(\"l'apprentissage supervisé\", 0.5098039215686274),\n",
       " (\"tâche d'apprentissage\", 0.4928634698232476),\n",
       " (\"d'apprentissage automatique\", 0.489783387687724),\n",
       " ('automatique consistant', 0.4815698353263277),\n",
       " (\"base d'exemples\", 0.43588195031606075),\n",
       " ('paires entrée-sortie', 0.4261283568869026),\n",
       " (\"données d'entraînement\", 0.4051314571002939),\n",
       " (\"d'entraînement étiquetées\", 0.39122075935096834),\n",
       " ('étiquetées constituées', 0.3835205540121593),\n",
       " (\"constituées d'un\", 0.37787373676369934)]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"\"\"L'apprentissage supervisé est la tâche d'apprentissage automatique consistant à apprendre une fonction qui\n",
    "         mappe une entrée à une sortie sur la base d'exemples de paires entrée-sortie. Il en déduit une\n",
    "         fonction à partir de données d'entraînement étiquetées constituées d'un ensemble d'exemples d'entraînement.\n",
    "         En apprentissage supervisé, chaque exemple est une paire composée d'un objet d'entrée\n",
    "         (généralement un vecteur) et une valeur de sortie souhaitée (également appelée signal de supervision).\n",
    "         Un algorithme d'apprentissage supervisé analyse les données d'apprentissage et produit une fonction inférée,\n",
    "         qui peut être utilisé pour cartographier de nouveaux exemples. Un scénario optimal permettra\n",
    "         algorithme pour déterminer correctement les étiquettes de classe pour les instances invisibles. Cela nécessite\n",
    "         l'algorithme d'apprentissage pour généraliser à partir des données d'entraînement à des situations inédites dans un\n",
    "         manière « raisonnable » (voir biais inductif).\"\"\"\n",
    "\n",
    "kwe = KeywordExtractor(lang='fr')\n",
    "kwe.extract_keywords(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following languages are supported:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "en english\n",
      "ar arabic\n",
      "az azerbaijani\n",
      "da danish\n",
      "nl dutch\n",
      "fi finnish\n",
      "fr french\n",
      "de german\n",
      "el greek\n",
      "hu hungarian\n",
      "id indonesian\n",
      "it italian\n",
      "kk kazakh\n",
      "ne nepali\n",
      "no norwegian\n",
      "pt portuguese\n",
      "ro romanian\n",
      "ru russian\n",
      "sl slovene\n",
      "es spanish\n",
      "sv swedish\n",
      "tg tajik\n",
      "tr turkish\n",
      "zh chinese\n"
     ]
    }
   ],
   "source": [
    "from ktrain.text.kw.core import SUPPORTED_LANGS\n",
    "for k,v in SUPPORTED_LANGS.items():\n",
    "    print(k,v)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scalability\n",
    "The `KeywordExtractor` is a already fast. With parallelization, keyphrase extraction can easily scale to a large number of documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"\"\"\n",
    "         Supervised learning is the machine learning task of learning a function that\n",
    "         maps an input to an output based on example input-output pairs. It infers a\n",
    "         function from labeled training data consisting of a set of training examples.\n",
    "         In supervised learning, each example is a pair consisting of an input object\n",
    "         (typically a vector) and a desired output value (also called the supervisory signal). \n",
    "         A supervised learning algorithm analyzes the training data and produces an inferred function, \n",
    "         which can be used for mapping new examples. An optimal scenario will allow for the \n",
    "         algorithm to correctly determine the class labels for unseen instances. This requires \n",
    "         the learning algorithm to generalize from the training data to unseen situations in a \n",
    "         'reasonable' way (see inductive bias).\n",
    "\n",
    "\"\"\"\n",
    "docs = [text] * 10000\n",
    "kwe = KeywordExtractor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can process these 10,000 documents using 8 processors in only a few seconds:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 3.94 s, sys: 95 ms, total: 4.04 s\n",
      "Wall time: 9.36 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "from joblib import Parallel, delayed\n",
    "results = Parallel(n_jobs=8)(delayed(kwe.extract_keywords)(doc) for doc in docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of results is 10000\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('supervised learning', 0.5357142857142857),\n",
       " ('machine learning', 0.4946192305347235),\n",
       " ('learning task', 0.4894975916102677),\n",
       " ('output based', 0.44980488994573503),\n",
       " ('example input-output', 0.4395616120968234),\n",
       " ('input-output pairs', 0.43443997317236754),\n",
       " ('training data', 0.4236784342418145),\n",
       " ('labeled training', 0.40499054935674655),\n",
       " ('data consisting', 0.3941070666422779),\n",
       " ('learning algorithm', 0.2632461435278337)]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(f'# of results is {len(results)}')\n",
    "results[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}