{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ['CUDA_VISIBLE_DEVICES'] = '-1' # CPU\n", "os.environ['DISABLE_V2_BEHAVIOR'] = '1' # disable V2 Behavior - required for NER in TF2 right now" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# **ShallowNLP** Tutorial\n", "\n", "The **ShallowNLP** module in *ktrain* is a small collection of text-analytic utilities to help analyze text data in English, Chinese, Russian, and other languages. All methods in **ShallowNLP** are for use on a normal laptop CPU - no GPUs are required. Thus, it is well-suited to those with minimal computational resources and no GPU access. \n", "\n", "Let's begin by importing the `shallownlp` module." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using DISABLE_V2_BEHAVIOR with TensorFlow\n", "using Keras version: 2.2.4-tf\n" ] } ], "source": [ "from ktrain.text import shallownlp as snlp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SECTION 1: Ready-to-Use Named-Entity-Recognition\n", "\n", "**ShallowNLP** includes pre-trained Named Entity Recognition (NER) for English, Chinese, and Russian.\n", "\n", "### English NER\n", "\n", "Extracting entities from:\n", ">Xuetao Cao was head of the Chinese Academy of Medical Sciences and is the current president of Nankai University." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Xuetao Cao', 'PER'),\n", " ('Chinese Academy of Medical Sciences', 'ORG'),\n", " ('Nankai University', 'ORG')]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ner = snlp.NER('en')\n", "text = \"\"\"\n", "Xuetao Cao was head of the Chinese Academy of Medical Sciences and is \n", "the current president of Nankai University.\n", "\"\"\"\n", "ner.predict(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ner.predict` method automatically merges tokens by entity. To see the unmerged results, set `merge_tokens=False`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Xuetao', 'B-PER'),\n", " ('Cao', 'I-PER'),\n", " ('was', 'O'),\n", " ('head', 'O'),\n", " ('of', 'O'),\n", " ('the', 'O'),\n", " ('Chinese', 'B-ORG'),\n", " ('Academy', 'I-ORG'),\n", " ('of', 'I-ORG'),\n", " ('Medical', 'I-ORG'),\n", " ('Sciences', 'I-ORG'),\n", " ('and', 'O'),\n", " ('is', 'O'),\n", " ('the', 'O'),\n", " ('current', 'O'),\n", " ('president', 'O'),\n", " ('of', 'O'),\n", " ('Nankai', 'B-ORG'),\n", " ('University', 'I-ORG'),\n", " ('.', 'O')]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ner.predict(text, merge_tokens=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ner.predict` method typically operates on single sentences, as in the example above. For multi-sentence documents, sentences can be extracted with `snlp.sent_tokenize`:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sentence #1: Paul Newman is a great actor .\n", "sentence #2: Tommy Wiseau is not .\n" ] } ], "source": [ "document = \"\"\"Paul Newman is a great actor. Tommy Wiseau is not.\"\"\"\n", "sents = []\n", "for idx, sent in enumerate(snlp.sent_tokenize(document)):\n", " sents.append(sent)\n", " print('sentence #%d: %s' % (idx+1, sent))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('Paul Newman', 'PER')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ner.predict(sents[0])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('Tommy Wiseau', 'PER')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ner.predict(sents[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Chinese NER\n", "Extracting entities from the Chinese translation of:\n", ">Xuetao Cao was head of the Chinese Academy of Medical Sciences and is the current president of Nankai University." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('曹雪涛', 'PER'), ('中国医学科学院', 'ORG'), ('南开大学', 'ORG')]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ner = snlp.NER('zh')\n", "ner.predict('曹雪涛曾任中国医学科学院院长,现任南开大学校长。')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Discovered entities with English translations:\n", "- 曹雪涛 = Cao Xuetao (PER)\n", "- 中国医学科学院 = Chinese Academy of Medical Sciences (ORG)\n", "- 南开大学 = Nankai University (ORG)\n", "\n", "The `snlp.sent_tokenize` can also be used with Chinese documents:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sentence #1: 这是关于史密斯博士的第一句话。\n", "sentence #2: 第二句话是关于琼斯先生的。\n" ] } ], "source": [ "document = \"\"\"这是关于史密斯博士的第一句话。第二句话是关于琼斯先生的。\"\"\"\n", "for idx, sent in enumerate(snlp.sent_tokenize(document)):\n", " print('sentence #%d: %s' % (idx+1, sent))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Russian NER\n", "Extracting entities from the Russian translation of:\n", ">Katerina Tikhonova, the youngest daughter of Russian President Vladimir Putin, was appointed head of a new artificial intelligence institute at Moscow State University." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Катерина Тихонова', 'PER'),\n", " ('России', 'LOC'),\n", " ('Владимира Путина', 'PER'),\n", " ('МГУ', 'ORG')]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ner = snlp.NER('ru')\n", "russian_sentence = \"\"\"Катерина Тихонова, младшая дочь президента России Владимира Путина, \n", "была назначена руководителем нового института искусственного интеллекта в МГУ.\"\"\"\n", "ner.predict(russian_sentence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Discovered entities with English translations:\n", "- Катерина Тихонова = Katerina Tikhonova (PER)\n", "- России = Russia (LOC)\n", "- Владимира Путина = Vladimir Putin (PER)\n", "- МГУ = Moscow State University (ORG)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SECTION 2: Text Classification\n", "\n", "**ShallowNLP** makes it easy to build a text classifier with minimal computational resources. **ShallowNLP** includes the following sklearn-based text classification models: a non-neural version of [NBSVM](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf), Logistic Regression, and [Linear SVM with SGD training (SGDClassifier)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html). Logistic regression is the default classifier. For these examples, we will use [NBSVM](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf).\n", "\n", "A classifier can be trained with minimal effort for both English and Chinese.\n", "\n", "### English Text Classification\n", "\n", "We'll use the IMDb movie review dataset [available here](https://ai.stanford.edu/~amaas/data/sentiment/) to build a sentiment analysis model for English." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "label names: ['neg', 'pos']\n", "validation accuracy: 92.03%\n", "prediction for \"I loved this movie because it was hilarious.\": 1 (pos)\n", "prediction for \"I hated this movie because it was boring.\": 0 (neg)\n" ] } ], "source": [ "datadir = r'/home/amaiya/data/aclImdb'\n", "(x_train, y_train, label_names) = snlp.Classifier.load_texts_from_folder(datadir+'/train')\n", "(x_test, y_test, _) = snlp.Classifier.load_texts_from_folder(datadir+'/test', shuffle=False)\n", "print('label names: %s' % (label_names))\n", "clf = snlp.Classifier().fit(x_train, y_train, ctype='nbsvm')\n", "print('validation accuracy: %s%%' % (round(clf.evaluate(x_test, y_test)*100, 2)))\n", "pos_text = 'I loved this movie because it was hilarious.'\n", "neg_text = 'I hated this movie because it was boring.'\n", "print('prediction for \"%s\": %s (pos)' % (pos_text, clf.predict(pos_text)))\n", "print('prediction for \"%s\": %s (neg)' % (neg_text, clf.predict(neg_text)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Chinese Text Classification\n", "\n", "We'll use the hotel review dataset [available here](here:https://github.com/Tony607/Chinese_sentiment_analysis/tree/master/data/ChnSentiCorp_htl_ba_6000) to build a sentiment analysis model for Chinese." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Building prefix dict from the default dictionary ...\n", "I0304 17:09:57.303427 140636486911808 __init__.py:111] Building prefix dict from the default dictionary ...\n", "Loading model from cache /tmp/jieba.cache\n", "I0304 17:09:57.305477 140636486911808 __init__.py:131] Loading model from cache /tmp/jieba.cache\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "detected encoding: GB18030\n", "Decoding with GB18030 failed 1st attempt - using GB18030 with skips\n", "skipped 118 lines (0.3%) due to character decoding errors\n", "label names: ['neg', 'pos']\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Loading model cost 0.640 seconds.\n", "I0304 17:09:57.945362 140636486911808 __init__.py:163] Loading model cost 0.640 seconds.\n", "Prefix dict has been built succesfully.\n", "I0304 17:09:57.946959 140636486911808 __init__.py:164] Prefix dict has been built succesfully.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "validation accuracy: 91.55%\n", "prediction for \"我喜欢这家酒店,因为它很干净。\": 1\n", "prediction for \"我讨厌这家酒店,因为它很吵。\": 0\n" ] } ], "source": [ "datadir = '/home/amaiya/data/ChnSentiCorp_htl_ba_6000'\n", "(texts, labels, label_names) = snlp.Classifier.load_texts_from_folder(datadir+'/train')\n", "print('label names: %s' % (label_names))\n", "from sklearn.model_selection import train_test_split\n", "x_train, x_test, y_train, y_test = train_test_split(texts, labels, test_size=0.1, random_state=42)\n", "clf = snlp.Classifier().fit(x_train, y_train, ctype='nbsvm')\n", "print('validation accuracy: %s%%' % (round(clf.evaluate(x_test, y_test)*100, 2)))\n", "pos_text = '我喜欢这家酒店,因为它很干净。' # I loved this hotel because it was very clean.\n", "neg_text = '我讨厌这家酒店,因为它很吵。' # I hated this hotel because it was noisy.\n", "print('prediction for \"%s\": %s' % (pos_text, clf.predict(pos_text)))\n", "print('prediction for \"%s\": %s' % (neg_text, clf.predict(neg_text)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tuning Hyperparameters of a Text Classifier\n", "\n", "The hyperparameters of a particular classifier can be tuned using the `grid_search` method. Let's tune the **C** hyperparameter of a Logistic Regression model to see what is the best value for this dataset." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "clf__C: 1.0\n" ] } ], "source": [ "# setup data\n", "datadir = r'/home/amaiya/data/aclImdb'\n", "(x_train, y_train, label_names) = snlp.Classifier.load_texts_from_folder(datadir+'/train')\n", "(x_test, y_test, _) = snlp.Classifier.load_texts_from_folder(datadir+'/test', shuffle=False)\n", "\n", "# initialize a model to optimize\n", "clf = snlp.Classifier()\n", "clf.create_model('logreg', x_train)\n", "\n", "# create parameter space for values of C\n", "parameters = {'clf__C': (1e0, 1e-1, 1e-2)}\n", "\n", "# tune\n", "clf.grid_search(parameters, x_train[:5000], y_train[:5000], n_jobs=-1)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like a value of `1.0` is best. We can then re-create the model with this hyperparameter value and proceed to train normally:\n", "\n", "```python\n", "clf.create_model('logreg', x_train, hp_dict={'C':1.0})\n", "clf.fit(x_train, y_train)\n", "clf.evaluate(x_test, y_test)\n", "```\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SECTION 3: Examples of Searching Text\n", "\n", "Here we will show some simple searches over multi-language documents.\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "document1 =\"\"\"\n", "Hello there,\n", "\n", "Hope this email finds you well.\n", "\n", "Are you available to talk about our meeting?\n", "\n", "If so, let us plan to schedule the meeting\n", "at the Hefei National Laboratory for Physical Sciences at the Microscale.\n", "\n", "As I always say: живи сегодня надейся на завтра\n", "\n", "Sincerely,\n", "John Doe\n", "合肥微尺度国家物理科学实验室\n", "\"\"\"\n", "\n", "document2 =\"\"\"\n", "This is a random document with Arabic about our meeting.\n", "\n", "عش اليوم الأمل ليوم غد\n", "\n", "Bye for now.\n", "\"\"\"\n", "\n", "docs = [document1, document2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Searching English\n", "\n", "The `search` function returns a list of documents that match query. Each entry shows:\n", "1. the ID of the document\n", "2. the query (multiple queries can be supplied in a list, if desired)\n", "3. the number of word hits in the document\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('doc1', 'physical sciences', 1),\n", " ('doc1', 'meeting', 2),\n", " ('doc2', 'meeting', 1),\n", " ('doc2', 'Arabic', 1)]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "snlp.search(['physical sciences', 'meeting', 'Arabic'], docs, keys=['doc1', 'doc2'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Searching Chinese\n", "\n", "The `search` function returns a list of documents that match query. Each entry shows:\n", "1. the ID of the document\n", "2. the query\n", "3. the number of word hits in the document\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('doc1', '合肥微尺度国家物理科学实验室', 7)]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "snlp.search('合肥微尺度国家物理科学实验室', docs, keys=['doc1', 'doc2'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For Chinese, the number of word hits is the number of words in the query that appear in the document. Seven of the words in the string 合肥微尺度国家物理科学实验室 were found in `doc1`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other Searches\n", "\n", "The `search` function can also be used for other languages.\n", "\n", "#### Arabic" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "doc id:doc2\n", "query:عش اليوم الأمل ليوم غد\n", "# of matches in document:1\n" ] } ], "source": [ "for result in snlp.search('عش اليوم الأمل ليوم غد', docs, keys=['doc1', 'doc2']):\n", " print(\"doc id:%s\"% (result[0]))\n", " print('query:%s' % (result[1]))\n", " print('# of matches in document:%s' % (result[2]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Russian" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('doc1', 'сегодня надейся на завтра', 1)]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "snlp.search('сегодня надейся на завтра', docs, keys=['doc1', 'doc2'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Extract Chinese, Russian, or Arabic from mixed-language documents" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['合肥微尺度国家物理科学实验室']" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "snlp.find_chinese(document1)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['живи', 'сегодня', 'надейся', 'на', 'завтра']" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "snlp.find_russian(document1)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['عش', 'اليوم', 'الأمل', 'ليوم', 'غد']" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "snlp.find_arabic(document2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }