{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*ktrain* uses TensorFlow 2. To support sequence-tagging, *ktrain* also currently uses the CRF module from `keras_contrib`, which is not yet fully compatible with TensorFlow 2.\n", "To use the BiLSTM-CRF model (which currently requires `keras_contrib`) for sequence-tagging in *ktrain*, you must disable V2 behavior in TensorFlow 2\n", "by adding the following line to the top of your notebook or script **before** importing *ktrain*:\n", "```python\n", "import os\n", "os.environ['DISABLE_V2_BEHAVIOR'] = '1'\n", "```\n", "Since we are employing a CRF layer in this notebook, we will set this value here:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "os.environ['DISABLE_V2_BEHAVIOR'] = '1'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using DISABLE_V2_BEHAVIOR with TensorFlow\n" ] } ], "source": [ "import ktrain\n", "from ktrain import text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Sequence Tagging\n", "\n", "Sequence tagging (or sequence labeling) involves classifying words or sequences of words as representing some category or concept of interest. One example of sequence tagging is Named Entity Recognition (NER), where we classify words or sequences of words that identify some entity such as a person, organization, or location. In this tutorial, we will show how to use *ktrain* to perform sequence tagging in three simple steps.\n", "\n", "## STEP 1: Load and Preprocess Data\n", "\n", "The `entities_from_txt` function can be used to load tagged sentences from a text file. The text file can be in one of two different formats: 1) the [CoNLL2003 format](https://www.aclweb.org/anthology/W03-0419) or 2) the [Groningen Meaning Bank (GMB) format](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus). In both formats, there is one word and its associated tag on each line (where the word and tag are delimited by a space, tab or comma). Words are ordered as they appear in the sentence. In the CoNLL2003 format, there is a blank line that delineates sentences. In the GMB format, there is a third column for Sentence ID that assignes a number to each row indicating the sentence to which the word belongs. If you are building a sequence tagger for your own use case with the `entities_from_txt` function, the training data should be formatted into one of these two formats. Alternatively, one can use the `entities_from_aray` function which simply expects arrays of the following form:\n", "```python\n", "x_train = [['Hello', 'world', '!'], ['Hello', 'Barack', 'Obama'], ['I', 'love', 'Chicago']]\n", "y_train = [['O', 'O', 'O'], ['O', 'B-PER', 'I-PER'], ['O', 'O', 'B-LOC']]\n", "```\n", "Note that the tags in this example follow the [IOB2 format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)).\n", "\n", "In this notebook, we will be using `entities_from_txt` and build a sequence tagger using the Groningen Meaning Bank NER dataset available on Kaggle [here](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus). The format essentially looks like this (with fields being delimited by comma):\n", "```\n", " SentenceID Word Tag \n", " 1 Paul B-PER\n", " 1 Newman I-PER\n", " 1 is O\n", " 1 a O\n", " 1 great O\n", " 1 actor O\n", " 1 . O\n", " ```\n", "\n", "We will be using the file `ner_dataset.csv` (which conforms to the format above) and will load and preprocess it using the `entities_from_txt` function. The output is simlar to data-loading functions used in previous tutorials and includes the processed training set, processed validaton set, and an instance of `NERPreprocessor`. \n", "\n", "The Kaggle dataset `ner_dataset.csv` the three columns of interest (mentioned above) are labeled 'Sentence #', 'Word', and 'Tag'. Thus, we specify these in the call to the function." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "detected encoding: WINDOWS-1250 (if wrong, set manually)\n", "Number of sentences: 47959\n", "Number of words in the dataset: 35178\n", "Tags: ['B-art', 'I-art', 'I-eve', 'B-geo', 'B-gpe', 'I-per', 'O', 'B-tim', 'I-gpe', 'B-nat', 'B-eve', 'B-org', 'I-nat', 'B-per', 'I-org', 'I-tim', 'I-geo']\n", "Number of Labels: 17\n", "Longest sentence: 104 words\n" ] } ], "source": [ "DATAFILE = '/home/amaiya/data/groningen_meaning_bank/ner_dataset.csv'\n", "(trn, val, preproc) = text.entities_from_txt(DATAFILE,\n", " sentence_column='Sentence #',\n", " word_column='Word',\n", " tag_column='Tag', \n", " data_format='gmb',\n", " use_char=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When loading the dataset above, we specify `use_char=True` to instruct *ktrain* to extract the character vocabulary to be used in a character embedding layer of a model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 2: Define a Model\n", "\n", "The `print_sequence_taggers` function shows that, as of this writing, *ktrain* currently supports both Bidirectional LSTM-CRM and Bidirectional LSTM as base models for sequence tagging. Theses base models can be used with different embedding schemes.\n", "\n", "For instance, the `bilstm-bert` model employs [BERT word embeddings](https://arxiv.org/abs/1810.04805) as features for a Bidirectional LSTM. See [this notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/CoNLL2002_Dutch-BiLSTM.ipynb) for an example of `bilstm-bert`. In this tutorial, we will use a Bidirectional LSTM model with a CRF layer. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bilstm: Bidirectional LSTM (https://arxiv.org/abs/1603.01360)\n", "bilstm-bert: Bidirectional LSTM w/ BERT embeddings\n", "bilstm-crf: Bidirectional LSTM-CRF (https://arxiv.org/abs/1603.01360)\n", "bilstm-elmo: Bidirectional LSTM w/ Elmo embeddings [English only]\n", "bilstm-crf-elmo: Bidirectional LSTM-CRF w/ Elmo embeddings [English only]\n" ] } ], "source": [ "text.print_sequence_taggers()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Embedding schemes employed (combined with concatenation):\n", "\tword embeddings initialized with fasttext word vectors (cc.en.300.vec.gz)\n", "\tcharacter embeddings\n", "\n", "pretrained word embeddings will be loaded from:\n", "\thttps://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz\n", "loading pretrained word vectors...this may take a few moments...\n" ] }, { "data": { "text/html": [ "done." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "WV_URL = 'https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz'\n", "model = text.sequence_tagger('bilstm-crf', preproc, wv_path_or_url=WV_URL)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the cell above, notice that we suppied the `wv_path_or_url` argument. This directs *ktrain* to initialized word embeddings with one of the pretrained fasttext (word2vec) word vector sets from [Facebook's fastttext site](https://fasttext.cc/docs/en/crawl-vectors.html). When supplied with a valid URL to a `.vec.gz`, the word vectors will be automatically downloaded, extracted, and loaded in STEP 2 (download location is `/ktrain_data`). To disable pretrained word embeddings, set `wv_path_or_url=None` and randomly initialized word embeddings will be employed. Use of pretrained embeddings will typically boost final accuracy. When used in combination with a model that uses an embedding scheme like BERT (e.g., `bilstm-bert`), the different word embeddings are stacked together using concatenation.\n", "\n", "Finally, we will wrap our selected model and datasets in a `Learner` object to facilitate training." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 3: Train and Evaluate the Model\n", "\n", "Here, we will train for a single epoch using an initial learning rate of 0.01 with gradual decay using cosine annealing (via the `cycle_len=1`) parameter and see how well we do. The learning rate of `0.01` is determined with the learning-rate-finder (i.e., `lr_find`)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simulating training for different learning rates... this may take a few moments...\n", "Train for 337 steps\n", "Epoch 1/1024\n", "337/337 [==============================] - 144s 426ms/step - loss: 1.2752\n", "Epoch 2/1024\n", "337/337 [==============================] - 138s 408ms/step - loss: 0.6956\n", "Epoch 3/1024\n", "337/337 [==============================] - 137s 407ms/step - loss: 0.2069\n", "Epoch 4/1024\n", "337/337 [==============================] - 136s 405ms/step - loss: 0.0684\n", "Epoch 5/1024\n", "160/337 [=============>................] - ETA: 1:12 - loss: 0.1804\n", "\n", "done.\n", "Please invoke the Learner.lr_plot() method to visually inspect the loss plot to help identify the maximal learning rate associated with falling loss.\n" ] } ], "source": [ "learner.lr_find()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.lr_plot()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "preparing train data ...done.\n", "preparing valid data ...done.\n", "338/338 [==============================] - 123s 365ms/step - loss: 4.6233 - val_loss: 4.5265\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.fit(1e-2, 1, cycle_len=1)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " F1: 84.19\n", " precision recall f1-score support\n", "\n", " tim 0.90 0.86 0.88 2078\n", " geo 0.84 0.90 0.87 3728\n", " org 0.75 0.69 0.72 1981\n", " per 0.81 0.78 0.79 1717\n", " gpe 0.97 0.93 0.95 1540\n", " eve 0.60 0.21 0.31 29\n", " art 0.00 0.00 0.00 47\n", " nat 0.57 0.19 0.29 21\n", "\n", "micro avg 0.85 0.84 0.84 11141\n", "macro avg 0.84 0.84 0.84 11141\n", "\n" ] }, { "data": { "text/plain": [ "0.8418623591692684" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.validate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our F1-score is **84.19** after a single pass through the dataset. Not bad for a single epoch of training." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's invoke `view_top_losses` to see the sentence we got the most wrong. This single sentence about James Brown contains 10 words that are misclassified. We can see here that our model has trouble with titles of songs. In addition, some of the ground truth labels for this example are sketchy and incomplete, which also makes things difficult." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total incorrect: 10\n", "Word True : (Pred)\n", "==============================\n", "Mr. :B-per (B-per)\n", "Brown :I-per (I-per)\n", "is :O (O)\n", "known :O (O)\n", "by :O (O)\n", "millions :O (O)\n", "of :O (O)\n", "fans :O (O)\n", "as :O (O)\n", "\" :O (O)\n", "The :O (O)\n", "Godfather :B-per (B-org)\n", "of :O (O)\n", "Soul :B-per (B-per)\n", "\" :O (O)\n", "thanks :O (O)\n", "to :O (O)\n", "such :O (O)\n", "classic :O (O)\n", "songs :O (O)\n", "as :O (O)\n", "\" :O (O)\n", "Please :B-art (O)\n", ", :O (O)\n", "Please :O (B-geo)\n", ", :O (O)\n", "Please :O (O)\n", ", :O (O)\n", "\" :O (O)\n", "\" :O (O)\n", "It :O (O)\n", "'s :O (O)\n", "a :O (O)\n", "Man :O (O)\n", "'s :O (O)\n", "World :O (O)\n", ", :O (O)\n", "\" :O (O)\n", "and :O (O)\n", "\" :O (O)\n", "Papa :B-art (B-org)\n", "'s :I-art (O)\n", "Got :I-art (O)\n", "a :I-art (O)\n", "Brand :I-art (B-org)\n", "New :I-art (I-org)\n", "Bag :I-art (I-org)\n", ". :O (O)\n", "\" :O (O)\n", "\n", "\n" ] } ], "source": [ "learner.view_top_losses(n=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Making Predictions on New Sentences\n", "\n", "Let's use our model to extract entities from new sentences. We begin by instantating a `Predictor` object." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "predictor = ktrain.get_predictor(learner.model, preproc)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('As', 'O'),\n", " ('of', 'O'),\n", " ('2019', 'B-tim'),\n", " (',', 'O'),\n", " ('Donald', 'B-per'),\n", " ('Trump', 'I-per'),\n", " ('is', 'O'),\n", " ('still', 'O'),\n", " ('the', 'O'),\n", " ('President', 'B-per'),\n", " ('of', 'O'),\n", " ('the', 'O'),\n", " ('United', 'B-geo'),\n", " ('States', 'I-geo'),\n", " ('.', 'O')]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictor.predict('As of 2019, Donald Trump is still the President of the United States.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can save the predictor for later deployment." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "predictor.save('/tmp/mypred')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "reloaded_predictor = ktrain.load_predictor('/tmp/mypred')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Paul', 'B-per'),\n", " ('Newman', 'I-per'),\n", " ('is', 'O'),\n", " ('my', 'O'),\n", " ('favorite', 'O'),\n", " ('American', 'B-gpe'),\n", " ('actor', 'O'),\n", " ('.', 'O')]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reloaded_predictor.predict('Paul Newman is my favorite American actor.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A Note on Sentence Tokenization\n", "\n", "The `predict` method typically operates on individual sentences instead of entire paragraphs or documents. The model after all was trained on individual sentences. In production, you can use the `sent_tokenize` function to tokenize text into individual sentences.\n", "\n", "```python\n", "from ktrain import text\n", "text.textutils.sent_tokenize('This is the first sentence about Dr. Smith. This is the second sentence.')\n", "```\n", "\n", "The above will output:\n", "```\n", "['This is the first sentence about Dr . Smith .',\n", " 'This is the second sentence .']\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }