{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Keyphrase Extraction in `ktrain`\n", "\n", "Keyphrase extraction in **ktrain** leverages the [textblob](https://textblob.readthedocs.io/en/dev/) package, which can be installed with:\n", "```\n", "pip install textblob textract\n", "python -m textblob.download_corpora\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from ktrain.text.kw import KeywordExtractor\n", "from ktrain.text.textextractor import TextExtractor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download a Paper from ArXiv and Extract Text\n", "For our test document, let's download the ktrain ArXiv paper and use the `TextExtractor` module to extract text." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "!wget --user-agent=\"Mozilla\" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q\n", "text = TextExtractor().extract('/tmp/downloaded_paper.pdf')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# of words in downloaded paper: 4551\n" ] } ], "source": [ "print(f\"# of words in downloaded paper: {len(text.split())}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using N-Grams as the candidate generator\n", "\n", "Let's first use `ngrams` as the candidate generator, which is comparatively fast:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "kwe = KeywordExtractor()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 396 ms, sys: 19.8 ms, total: 416 ms\n", "Wall time: 415 ms\n" ] }, { "data": { "text/plain": [ "[('machine learning', 0.10548523206751055),\n", " ('step', 0.06751054852320675),\n", " ('learning rate', 0.046413502109704644),\n", " ('arxiv preprint', 0.046413502109704644),\n", " ('text classification', 0.03375527426160337),\n", " ('augmented machine', 0.02531645569620253),\n", " ('open-domain question-answering', 0.02531645569620253),\n", " ('augmented machine learning', 0.02531645569620253),\n", " ('bert', 0.02109704641350211),\n", " ('low-code library', 0.02109704641350211)]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kwe.extract_keywords(text, candidate_generator='ngrams')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Noun Phrases as the candidate generator\n", "\n", "\n", "If we use `noun_phrases` as the candidate generator instead, quality improves slightly at the expense of a longer running time." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.04 s, sys: 0 ns, total: 1.04 s\n", "Wall time: 1.04 s\n" ] }, { "data": { "text/plain": [ "[('machine learning', 0.0784313725490196),\n", " ('text classification', 0.049019607843137254),\n", " ('image classification', 0.049019607843137254),\n", " ('exact answers', 0.0392156862745098),\n", " ('augmented machine learning', 0.0392156862745098),\n", " ('graph data', 0.029411764705882353),\n", " ('node classification', 0.029411764705882353),\n", " ('entity recognition', 0.029411764705882353),\n", " ('code example', 0.029411764705882353),\n", " ('index documents', 0.029411764705882353)]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kwe.extract_keywords(text, candidate_generator='noun_phrases')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other Parameters\n", "The `extract_keywords` method has many other parameters to control the output. For instance, you can control the number of words in keyphrases with the `ngram_range` parameter. Here, we extract 3-word keyphrases:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('augmented machine learning', 0.07017543859649122),\n", " ('a. s. maiya', 0.05263157894736842),\n", " ('optimal learning rate', 0.03508771929824561),\n", " ('natural language questions', 0.03508771929824561),\n", " ('support text data', 0.017543859649122806),\n", " ('learning rate schedules', 0.017543859649122806),\n", " ('machine learning model', 0.017543859649122806),\n", " ('unsupervised topic modeling', 0.017543859649122806),\n", " ('large text corpus', 0.017543859649122806),\n", " ('social media accounts', 0.017543859649122806)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kwe.extract_keywords(text, candidate_generator='noun_phrases', ngram_range=(3,3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining All the Steps: Low-Code Keyphrase Extraction" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('machine learning', 0.0784313725490196),\n", " ('text classification', 0.049019607843137254),\n", " ('image classification', 0.049019607843137254),\n", " ('exact answers', 0.0392156862745098),\n", " ('augmented machine learning', 0.0392156862745098),\n", " ('graph data', 0.029411764705882353),\n", " ('node classification', 0.029411764705882353),\n", " ('entity recognition', 0.029411764705882353),\n", " ('code example', 0.029411764705882353),\n", " ('index documents', 0.029411764705882353)]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from ktrain.text.kw import KeywordExtractor\n", "from ktrain.text.textextractor import TextExtractor\n", "!wget --user-agent=\"Mozilla\" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q\n", "text = TextExtractor().extract('/tmp/downloaded_paper.pdf')\n", "kwe = KeywordExtractor()\n", "kwe.extract_keywords(text, candidate_generator='noun_phrases')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Non-English Keyphrase Extraction\n", "\n", "Keyphrases can be extracted for non-English languages by supplying a 2-character language code as the `lang` argument. For simplified or traditional Chinese, use `zh`.\n", "\n", "#### Chinese" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('监督 学习', 0.06),\n", " ('训练 数据', 0.06),\n", " ('学习 算法', 0.04),\n", " ('机器 学习', 0.02),\n", " ('学习 任务', 0.02),\n", " ('样本 输入', 0.02),\n", " ('输入 输出', 0.02),\n", " ('输入 映射', 0.02),\n", " ('自由 一组', 0.02),\n", " ('一组 训练', 0.02)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = \"\"\"\n", "监督学习是学习一个函数的机器学习任务\n", " 根据样本输入-输出对将输入映射到输出。他推导出一个\n", " 函数来自由一组训练示例组成的标记训练数据。\n", " 在监督学习中,每个示例都是由一个输入对象组成的对\n", " (通常是一个向量)和一个期望的输出值(也称为监控信号)。\n", " 监督学习算法分析训练数据并产生推断函数,\n", " 可用于映射新示例。最佳方案将允许\n", " 算法来正确确定不可见实例的类标签。这需要\n", " 学习算法从训练数据泛化到新情况\n", " “合理”的方式(见归纳偏差)。\n", "\"\"\"\n", "kwe = KeywordExtractor(lang='zh')\n", "kwe.extract_keywords(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### French" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(\"données d'entraînement\", 0.0392156862745098),\n", " (\"l'apprentissage supervisé\", 0.0196078431372549),\n", " (\"tâche d'apprentissage\", 0.0196078431372549),\n", " (\"d'apprentissage automatique\", 0.0196078431372549),\n", " ('automatique consistant', 0.0196078431372549),\n", " (\"base d'exemples\", 0.0196078431372549),\n", " ('paires entrée-sortie', 0.0196078431372549),\n", " (\"d'entraînement étiquetées\", 0.0196078431372549),\n", " ('étiquetées constituées', 0.0196078431372549),\n", " (\"constituées d'un\", 0.0196078431372549)]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = \"\"\"L'apprentissage supervisé est la tâche d'apprentissage automatique consistant à apprendre une fonction qui\n", " mappe une entrée à une sortie sur la base d'exemples de paires entrée-sortie. Il en déduit une\n", " fonction à partir de données d'entraînement étiquetées constituées d'un ensemble d'exemples d'entraînement.\n", " En apprentissage supervisé, chaque exemple est une paire composée d'un objet d'entrée\n", " (généralement un vecteur) et une valeur de sortie souhaitée (également appelée signal de supervision).\n", " Un algorithme d'apprentissage supervisé analyse les données d'apprentissage et produit une fonction inférée,\n", " qui peut être utilisé pour cartographier de nouveaux exemples. Un scénario optimal permettra\n", " algorithme pour déterminer correctement les étiquettes de classe pour les instances invisibles. Cela nécessite\n", " l'algorithme d'apprentissage pour généraliser à partir des données d'entraînement à des situations inédites dans un\n", " manière « raisonnable » (voir biais inductif).\"\"\"\n", "\n", "kwe = KeywordExtractor(lang='fr')\n", "kwe.extract_keywords(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following languages are supported:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "en english\n", "ar arabic\n", "az azerbaijani\n", "da danish\n", "nl dutch\n", "fi finnish\n", "fr french\n", "de german\n", "el greek\n", "hu hungarian\n", "id indonesian\n", "it italian\n", "kk kazakh\n", "ne nepali\n", "no norwegian\n", "pt portuguese\n", "ro romanian\n", "ru russian\n", "sl slovene\n", "es spanish\n", "sv swedish\n", "tg tajik\n", "tr turkish\n", "zh chinese\n" ] } ], "source": [ "from ktrain.text.kw.core import SUPPORTED_LANGS\n", "for k,v in SUPPORTED_LANGS.items():\n", " print(k,v)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scalability\n", "The `KeywordExtractor` is a already fast. With parallelization, keyphrase extraction can easily scale to a large number of documents." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "text = \"\"\"\n", " Supervised learning is the machine learning task of learning a function that\n", " maps an input to an output based on example input-output pairs. It infers a\n", " function from labeled training data consisting of a set of training examples.\n", " In supervised learning, each example is a pair consisting of an input object\n", " (typically a vector) and a desired output value (also called the supervisory signal). \n", " A supervised learning algorithm analyzes the training data and produces an inferred function, \n", " which can be used for mapping new examples. An optimal scenario will allow for the \n", " algorithm to correctly determine the class labels for unseen instances. This requires \n", " the learning algorithm to generalize from the training data to unseen situations in a \n", " 'reasonable' way (see inductive bias).\n", "\n", "\"\"\"\n", "docs = [text] * 10000\n", "kwe = KeywordExtractor()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can process these 10,000 documents using 8 processors in only a few seconds:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.19 s, sys: 225 ms, total: 2.42 s\n", "Wall time: 9.51 s\n" ] } ], "source": [ "%%time\n", "from joblib import Parallel, delayed\n", "results = Parallel(n_jobs=8)(delayed(kwe.extract_keywords)(doc) for doc in docs)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# of results is 10000\n" ] }, { "data": { "text/plain": [ "[('supervised learning', 0.07317073170731707),\n", " ('training data', 0.07317073170731707),\n", " ('learning algorithm', 0.04878048780487805),\n", " ('machine learning', 0.024390243902439025),\n", " ('learning task', 0.024390243902439025),\n", " ('output based', 0.024390243902439025),\n", " ('example input-output', 0.024390243902439025),\n", " ('input-output pairs', 0.024390243902439025),\n", " ('labeled training', 0.024390243902439025),\n", " ('data consisting', 0.024390243902439025)]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(f'# of results is {len(results)}')\n", "results[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }