In [1]:
%reload_ext autoreload
%autoreload 2

## Keyphrase Extraction in `ktrain`

Keyphrase extraction in **ktrain** leverages the [textblob](https://textblob.readthedocs.io/en/dev/) package, which can be installed with:
```
pip install textblob tika
python -m textblob.download_corpora
```

In [2]:
from ktrain.text.kw import KeywordExtractor
from ktrain.text.textextractor import TextExtractor

### Download a Paper from ArXiv and Extract Text
For our test document, let's download the ktrain ArXiv paper and use the `TextExtractor` module to extract text.

In [3]:
!wget --user-agent="Mozilla" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q
text = TextExtractor().extract('/tmp/downloaded_paper.pdf')

In [4]:
print(f"# of words in downloaded paper: {len(text.split())}")

# of words in downloaded paper: 4316


### Using N-Grams as the candidate generator

Let's first use `ngrams` as the candidate generator, which is comparatively fast:

In [5]:
kwe = KeywordExtractor()

In [6]:
%%time
kwe.extract_keywords(text, candidate_generator='ngrams')

CPU times: user 341 ms, sys: 16.9 ms, total: 358 ms
Wall time: 357 ms


[('machine learning', 0.5503444817814314),
 ('augmented machine', 0.5123881190828152),
 ('augmented machine learning', 0.5123881190828152),
 ('low-code library', 0.5107922072149182),
 ('step', 0.5092460272048237),
 ('text classification', 0.5044526957819503),
 ('open-domain question-answering', 0.4996712653266335),
 ('learning rate', 0.4894264238049616),
 ('bert', 0.424790141017796),
 ('arxiv preprint', 0.16264098705836771)]

### Using Noun Phrases as the candidate generator


If we use `noun_phrases` as the candidate generator instead, quality improves slightly at the expense of a longer running time.

In [8]:
%%time
kwe.extract_keywords(text, candidate_generator='noun_phrases')

CPU times: user 855 ms, sys: 103 µs, total: 856 ms
Wall time: 855 ms


[('machine learning', 0.5341716824761019),
 ('augmented machine learning', 0.5208544167057394),
 ('text classification', 0.5134074336523509),
 ('image classification', 0.5071170746851726),
 ('node classification', 0.4973034499292447),
 ('tabular data', 0.49645958463369566),
 ('entity recognition', 0.45195059648705926),
 ('exact answers', 0.4462502183477142),
 ('import ktrain', 0.32891369271775894),
 ('load model', 0.32052348289886556)]

### Other Parameters
The `extract_keywords` method has many other parameters to control the output. For instance, you can control the number of words in keyphrases with the `ngram_range` parameter. Here, we extract 3-word keyphrases:

In [9]:
kwe.extract_keywords(text, candidate_generator='noun_phrases', ngram_range=(3,3))

[('augmented machine learning', 0.541435342459079),
 ('machine learning model', 0.4982195592681719),
 ('support text data', 0.49549171563837363),
 ('learning rate schedules', 0.47765279578595193),
 ('a. s. maiya', 0.4612715229636928),
 ('unsupervised topic modeling', 0.44648865417358047),
 ('large text corpus', 0.4374416332143215),
 ('optimal learning rate', 0.42667304584617965),
 ('non-supervised ml tasks', 0.2330746472277638),
 ('natural language questions', 0.21662908635171388)]

### Combining All the Steps: Low-Code Keyphrase Extraction

In [10]:
from ktrain.text.kw import KeywordExtractor
from ktrain.text.textextractor import TextExtractor
!wget --user-agent="Mozilla" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q
text = TextExtractor().extract('/tmp/downloaded_paper.pdf')
kwe = KeywordExtractor()
kwe.extract_keywords(text, candidate_generator='noun_phrases')

[('machine learning', 0.5341716824761019),
 ('augmented machine learning', 0.5208544167057394),
 ('text classification', 0.5134074336523509),
 ('image classification', 0.5071170746851726),
 ('node classification', 0.4973034499292447),
 ('tabular data', 0.49645958463369566),
 ('entity recognition', 0.45195059648705926),
 ('exact answers', 0.4462502183477142),
 ('import ktrain', 0.32891369271775894),
 ('load model', 0.32052348289886556)]

### Non-English Keyphrase Extraction

Keyphrases can be extracted for non-English languages by supplying a 2-character language code as the `lang` argument. For simplified or traditional Chinese, use `zh`.

#### Chinese

In [11]:
text = """
监督学习是学习一个函数的机器学习任务
 根据样本输入-输出对将输入映射到输出。他推导出一个
 函数来自由一组训练示例组成的标记训练数据。
 在监督学习中,每个示例都是由一个输入对象组成的对
 (通常是一个向量)和一个期望的输出值(也称为监控信号)。
 监督学习算法分析训练数据并产生推断函数,
 可用于映射新示例。最佳方案将允许
 算法来正确确定不可见实例的类标签。这需要
 学习算法从训练数据泛化到新情况
 “合理”的方式(见归纳偏差)。
"""
kwe = KeywordExtractor(lang='zh')
kwe.extract_keywords(text)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.669 seconds.
Prefix dict has been built successfully.


[('监督 学习', 0.53),
 ('机器 学习', 0.48103658536585364),
 ('学习 任务', 0.4764634146341463),
 ('样本 输入', 0.4627439024390244),
 ('输入 映射', 0.4398780487804878),
 ('自由 一组', 0.39719512195121953),
 ('一组 训练', 0.3926219512195122),
 ('训练 数据', 0.38670731707317074),
 ('学习 算法', 0.22731707317073171),
 ('输入 输出', 0.01152439024390244)]

#### French

In [12]:
text = """L'apprentissage supervisé est la tâche d'apprentissage automatique consistant à apprendre une fonction qui
 mappe une entrée à une sortie sur la base d'exemples de paires entrée-sortie. Il en déduit une
 fonction à partir de données d'entraînement étiquetées constituées d'un ensemble d'exemples d'entraînement.
 En apprentissage supervisé, chaque exemple est une paire composée d'un objet d'entrée
 (généralement un vecteur) et une valeur de sortie souhaitée (également appelée signal de supervision).
 Un algorithme d'apprentissage supervisé analyse les données d'apprentissage et produit une fonction inférée,
 qui peut être utilisé pour cartographier de nouveaux exemples. Un scénario optimal permettra
 algorithme pour déterminer correctement les étiquettes de classe pour les instances invisibles. Cela nécessite
 l'algorithme d'apprentissage pour généraliser à partir des données d'entraînement à des situations inédites dans un
 manière « raisonnable » (voir biais inductif)."""

kwe = KeywordExtractor(lang='fr')
kwe.extract_keywords(text)

[("l'apprentissage supervisé", 0.5098039215686274),
 ("tâche d'apprentissage", 0.4928634698232476),
 ("d'apprentissage automatique", 0.489783387687724),
 ('automatique consistant', 0.4815698353263277),
 ("base d'exemples", 0.43588195031606075),
 ('paires entrée-sortie', 0.4261283568869026),
 ("données d'entraînement", 0.4051314571002939),
 ("d'entraînement étiquetées", 0.39122075935096834),
 ('étiquetées constituées', 0.3835205540121593),
 ("constituées d'un", 0.37787373676369934)]

The following languages are supported:

In [13]:
from ktrain.text.kw.core import SUPPORTED_LANGS
for k,v in SUPPORTED_LANGS.items():
 print(k,v)

en english
ar arabic
az azerbaijani
da danish
nl dutch
fi finnish
fr french
de german
el greek
hu hungarian
id indonesian
it italian
kk kazakh
ne nepali
no norwegian
pt portuguese
ro romanian
ru russian
sl slovene
es spanish
sv swedish
tg tajik
tr turkish
zh chinese


### Scalability
The `KeywordExtractor` is a already fast. With parallelization, keyphrase extraction can easily scale to a large number of documents.

In [14]:
text = """
 Supervised learning is the machine learning task of learning a function that
 maps an input to an output based on example input-output pairs. It infers a
 function from labeled training data consisting of a set of training examples.
 In supervised learning, each example is a pair consisting of an input object
 (typically a vector) and a desired output value (also called the supervisory signal). 
 A supervised learning algorithm analyzes the training data and produces an inferred function, 
 which can be used for mapping new examples. An optimal scenario will allow for the 
 algorithm to correctly determine the class labels for unseen instances. This requires 
 the learning algorithm to generalize from the training data to unseen situations in a 
 'reasonable' way (see inductive bias).

"""
docs = [text] * 10000
kwe = KeywordExtractor()

We can process these 10,000 documents using 8 processors in only a few seconds:

In [16]:
%%time
from joblib import Parallel, delayed
results = Parallel(n_jobs=8)(delayed(kwe.extract_keywords)(doc) for doc in docs)

CPU times: user 3.94 s, sys: 95 ms, total: 4.04 s
Wall time: 9.36 s


In [17]:
print(f'# of results is {len(results)}')
results[0]

# of results is 10000


[('supervised learning', 0.5357142857142857),
 ('machine learning', 0.4946192305347235),
 ('learning task', 0.4894975916102677),
 ('output based', 0.44980488994573503),
 ('example input-output', 0.4395616120968234),
 ('input-output pairs', 0.43443997317236754),
 ('training data', 0.4236784342418145),
 ('labeled training', 0.40499054935674655),
 ('data consisting', 0.3941070666422779),
 ('learning algorithm', 0.2632461435278337)]