{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Keyphrase Extraction in `ktrain`\n", "\n", "Keyphrase extraction in **ktrain** leverages the [textblob](https://textblob.readthedocs.io/en/dev/) package, which can be installed with:\n", "```\n", "pip install textblob tika\n", "python -m textblob.download_corpora\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from ktrain.text.kw import KeywordExtractor\n", "from ktrain.text.textextractor import TextExtractor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download a Paper from ArXiv and Extract Text\n", "For our test document, let's download the ktrain ArXiv paper and use the `TextExtractor` module to extract text." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "!wget --user-agent=\"Mozilla\" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q\n", "text = TextExtractor().extract('/tmp/downloaded_paper.pdf')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# of words in downloaded paper: 4316\n" ] } ], "source": [ "print(f\"# of words in downloaded paper: {len(text.split())}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using N-Grams as the candidate generator\n", "\n", "Let's first use `ngrams` as the candidate generator, which is comparatively fast:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "kwe = KeywordExtractor()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 341 ms, sys: 16.9 ms, total: 358 ms\n", "Wall time: 357 ms\n" ] }, { "data": { "text/plain": [ "[('machine learning', 0.5503444817814314),\n", " ('augmented machine', 0.5123881190828152),\n", " ('augmented machine learning', 0.5123881190828152),\n", " ('low-code library', 0.5107922072149182),\n", " ('step', 0.5092460272048237),\n", " ('text classification', 0.5044526957819503),\n", " ('open-domain question-answering', 0.4996712653266335),\n", " ('learning rate', 0.4894264238049616),\n", " ('bert', 0.424790141017796),\n", " ('arxiv preprint', 0.16264098705836771)]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kwe.extract_keywords(text, candidate_generator='ngrams')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Noun Phrases as the candidate generator\n", "\n", "\n", "If we use `noun_phrases` as the candidate generator instead, quality improves slightly at the expense of a longer running time." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 855 ms, sys: 103 µs, total: 856 ms\n", "Wall time: 855 ms\n" ] }, { "data": { "text/plain": [ "[('machine learning', 0.5341716824761019),\n", " ('augmented machine learning', 0.5208544167057394),\n", " ('text classification', 0.5134074336523509),\n", " ('image classification', 0.5071170746851726),\n", " ('node classification', 0.4973034499292447),\n", " ('tabular data', 0.49645958463369566),\n", " ('entity recognition', 0.45195059648705926),\n", " ('exact answers', 0.4462502183477142),\n", " ('import ktrain', 0.32891369271775894),\n", " ('load model', 0.32052348289886556)]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kwe.extract_keywords(text, candidate_generator='noun_phrases')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other Parameters\n", "The `extract_keywords` method has many other parameters to control the output. For instance, you can control the number of words in keyphrases with the `ngram_range` parameter. Here, we extract 3-word keyphrases:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('augmented machine learning', 0.541435342459079),\n", " ('machine learning model', 0.4982195592681719),\n", " ('support text data', 0.49549171563837363),\n", " ('learning rate schedules', 0.47765279578595193),\n", " ('a. s. maiya', 0.4612715229636928),\n", " ('unsupervised topic modeling', 0.44648865417358047),\n", " ('large text corpus', 0.4374416332143215),\n", " ('optimal learning rate', 0.42667304584617965),\n", " ('non-supervised ml tasks', 0.2330746472277638),\n", " ('natural language questions', 0.21662908635171388)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kwe.extract_keywords(text, candidate_generator='noun_phrases', ngram_range=(3,3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining All the Steps: Low-Code Keyphrase Extraction" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('machine learning', 0.5341716824761019),\n", " ('augmented machine learning', 0.5208544167057394),\n", " ('text classification', 0.5134074336523509),\n", " ('image classification', 0.5071170746851726),\n", " ('node classification', 0.4973034499292447),\n", " ('tabular data', 0.49645958463369566),\n", " ('entity recognition', 0.45195059648705926),\n", " ('exact answers', 0.4462502183477142),\n", " ('import ktrain', 0.32891369271775894),\n", " ('load model', 0.32052348289886556)]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from ktrain.text.kw import KeywordExtractor\n", "from ktrain.text.textextractor import TextExtractor\n", "!wget --user-agent=\"Mozilla\" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q\n", "text = TextExtractor().extract('/tmp/downloaded_paper.pdf')\n", "kwe = KeywordExtractor()\n", "kwe.extract_keywords(text, candidate_generator='noun_phrases')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Non-English Keyphrase Extraction\n", "\n", "Keyphrases can be extracted for non-English languages by supplying a 2-character language code as the `lang` argument. For simplified or traditional Chinese, use `zh`.\n", "\n", "#### Chinese" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Building prefix dict from the default dictionary ...\n", "Loading model from cache /tmp/jieba.cache\n", "Loading model cost 0.669 seconds.\n", "Prefix dict has been built successfully.\n" ] }, { "data": { "text/plain": [ "[('监督 学习', 0.53),\n", " ('机器 学习', 0.48103658536585364),\n", " ('学习 任务', 0.4764634146341463),\n", " ('样本 输入', 0.4627439024390244),\n", " ('输入 映射', 0.4398780487804878),\n", " ('自由 一组', 0.39719512195121953),\n", " ('一组 训练', 0.3926219512195122),\n", " ('训练 数据', 0.38670731707317074),\n", " ('学习 算法', 0.22731707317073171),\n", " ('输入 输出', 0.01152439024390244)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = \"\"\"\n", "监督学习是学习一个函数的机器学习任务\n", " 根据样本输入-输出对将输入映射到输出。他推导出一个\n", " 函数来自由一组训练示例组成的标记训练数据。\n", " 在监督学习中,每个示例都是由一个输入对象组成的对\n", " (通常是一个向量)和一个期望的输出值(也称为监控信号)。\n", " 监督学习算法分析训练数据并产生推断函数,\n", " 可用于映射新示例。最佳方案将允许\n", " 算法来正确确定不可见实例的类标签。这需要\n", " 学习算法从训练数据泛化到新情况\n", " “合理”的方式(见归纳偏差)。\n", "\"\"\"\n", "kwe = KeywordExtractor(lang='zh')\n", "kwe.extract_keywords(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### French" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(\"l'apprentissage supervisé\", 0.5098039215686274),\n", " (\"tâche d'apprentissage\", 0.4928634698232476),\n", " (\"d'apprentissage automatique\", 0.489783387687724),\n", " ('automatique consistant', 0.4815698353263277),\n", " (\"base d'exemples\", 0.43588195031606075),\n", " ('paires entrée-sortie', 0.4261283568869026),\n", " (\"données d'entraînement\", 0.4051314571002939),\n", " (\"d'entraînement étiquetées\", 0.39122075935096834),\n", " ('étiquetées constituées', 0.3835205540121593),\n", " (\"constituées d'un\", 0.37787373676369934)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = \"\"\"L'apprentissage supervisé est la tâche d'apprentissage automatique consistant à apprendre une fonction qui\n", " mappe une entrée à une sortie sur la base d'exemples de paires entrée-sortie. Il en déduit une\n", " fonction à partir de données d'entraînement étiquetées constituées d'un ensemble d'exemples d'entraînement.\n", " En apprentissage supervisé, chaque exemple est une paire composée d'un objet d'entrée\n", " (généralement un vecteur) et une valeur de sortie souhaitée (également appelée signal de supervision).\n", " Un algorithme d'apprentissage supervisé analyse les données d'apprentissage et produit une fonction inférée,\n", " qui peut être utilisé pour cartographier de nouveaux exemples. Un scénario optimal permettra\n", " algorithme pour déterminer correctement les étiquettes de classe pour les instances invisibles. Cela nécessite\n", " l'algorithme d'apprentissage pour généraliser à partir des données d'entraînement à des situations inédites dans un\n", " manière « raisonnable » (voir biais inductif).\"\"\"\n", "\n", "kwe = KeywordExtractor(lang='fr')\n", "kwe.extract_keywords(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following languages are supported:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "en english\n", "ar arabic\n", "az azerbaijani\n", "da danish\n", "nl dutch\n", "fi finnish\n", "fr french\n", "de german\n", "el greek\n", "hu hungarian\n", "id indonesian\n", "it italian\n", "kk kazakh\n", "ne nepali\n", "no norwegian\n", "pt portuguese\n", "ro romanian\n", "ru russian\n", "sl slovene\n", "es spanish\n", "sv swedish\n", "tg tajik\n", "tr turkish\n", "zh chinese\n" ] } ], "source": [ "from ktrain.text.kw.core import SUPPORTED_LANGS\n", "for k,v in SUPPORTED_LANGS.items():\n", " print(k,v)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scalability\n", "The `KeywordExtractor` is a already fast. With parallelization, keyphrase extraction can easily scale to a large number of documents." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "text = \"\"\"\n", " Supervised learning is the machine learning task of learning a function that\n", " maps an input to an output based on example input-output pairs. It infers a\n", " function from labeled training data consisting of a set of training examples.\n", " In supervised learning, each example is a pair consisting of an input object\n", " (typically a vector) and a desired output value (also called the supervisory signal). \n", " A supervised learning algorithm analyzes the training data and produces an inferred function, \n", " which can be used for mapping new examples. An optimal scenario will allow for the \n", " algorithm to correctly determine the class labels for unseen instances. This requires \n", " the learning algorithm to generalize from the training data to unseen situations in a \n", " 'reasonable' way (see inductive bias).\n", "\n", "\"\"\"\n", "docs = [text] * 10000\n", "kwe = KeywordExtractor()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can process these 10,000 documents using 8 processors in only a few seconds:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.94 s, sys: 95 ms, total: 4.04 s\n", "Wall time: 9.36 s\n" ] } ], "source": [ "%%time\n", "from joblib import Parallel, delayed\n", "results = Parallel(n_jobs=8)(delayed(kwe.extract_keywords)(doc) for doc in docs)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# of results is 10000\n" ] }, { "data": { "text/plain": [ "[('supervised learning', 0.5357142857142857),\n", " ('machine learning', 0.4946192305347235),\n", " ('learning task', 0.4894975916102677),\n", " ('output based', 0.44980488994573503),\n", " ('example input-output', 0.4395616120968234),\n", " ('input-output pairs', 0.43443997317236754),\n", " ('training data', 0.4236784342418145),\n", " ('labeled training', 0.40499054935674655),\n", " ('data consisting', 0.3941070666422779),\n", " ('learning algorithm', 0.2632461435278337)]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(f'# of results is {len(results)}')\n", "results[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }