{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ktrain\n", "from ktrain import text as txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 1: Load and Preprocess the Dataset\n", "\n", "A Dutch NER dataset can be downloaded from [here](https://www.clips.uantwerpen.be/conll2002/ner/).\n", "\n", "We use the `entities_from_conll2003` function to load and preprocess the data, as the dataset is in a standard **CoNLL** format. (Download the data from the link above to see what the format looks like.)\n", "\n", "See the *ktrain* [sequence-tagging tutorial](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-06-sequence-tagging.ipynb) for more information on how to load data in different ways." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "detected encoding: ISO-8859-1 (if wrong, set manually)\n", "Number of sentences: 15806\n", "Number of words in the dataset: 27803\n", "Tags: ['I-PER', 'B-MISC', 'O', 'B-PER', 'I-ORG', 'I-LOC', 'B-LOC', 'I-MISC', 'B-ORG']\n", "Number of Labels: 9\n", "Longest sentence: 859 words\n" ] } ], "source": [ "TDATA = 'data/dutch_ner/ned.train'\n", "VDATA = 'data/dutch_ner/ned.testb'\n", "(trn, val, preproc) = txt.entities_from_conll2003(TDATA, val_filepath=VDATA)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 2: Build the Model\n", "\n", "Next, we will build a Bidirectional LSTM model that employs the use of transformer embeddings like [BERT word embeddings](https://arxiv.org/abs/1810.04805). By default, the `bilstm-transformer` model will use a pretrained multilingual model (i.e., `bert-base-multilingual-cased`). However, since we are training a Dutch-language model, it is better to select the Dutch pretrained BERT model: `bert-base-dutch-cased`. A full list of available pretrained models is [listed here](https://huggingface.co/transformers/pretrained_models.html). One can also employ the use of [community-uploaded models](https://huggingface.co/models) that focus on specific domains such as the biomedical or scientific domains (e.g, BioBERT, SciBERT). To use SciBERT, for example, set `bert_model` to `allenai/scibert_scivocab_uncased`. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Embedding schemes employed (combined with concatenation):\n", "\tword embeddings initialized with fasttext word vectors (cc.nl.300.vec.gz)\n", "\ttransformer embeddings with bert-base-dutch-cased\n", "\n", "pretrained word embeddings will be loaded from:\n", "\thttps://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.nl.300.vec.gz\n", "loading pretrained word vectors...this may take a few moments...\n" ] }, { "data": { "text/html": [ "done." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "WV_URL='https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.nl.300.vec.gz'\n", "model = txt.sequence_tagger('bilstm-transformer', preproc, \n", " transformer_model='wietsedv/bert-base-dutch-cased', wv_path_or_url=WV_URL)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the cell above, notice that we suppied the `wv_path_or_url` argument. This directs *ktrain* to initialized word embeddings with one of the pretrained fasttext (word2vec) word vector sets from [Facebook's fasttext site](https://fasttext.cc/docs/en/crawl-vectors.html). When supplied with a valid URL to a `.vec.gz`, the word vectors will be automatically downloaded, extracted, and loaded in STEP 2 (download location is `/ktrain_data`). To disable pretrained word embeddings, set `wv_path_or_url=None` and randomly initialized word embeddings will be employed. Use of pretrained embeddings will typically boost final accuracy. When used in combination with a model that uses an embedding scheme like BERT (e.g., `bilstm-bert`), the different word embeddings are stacked together using concatenation.\n", "\n", "Finally, we will wrap our selected model and datasets in a `Learner` object to facilitate training." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 3: Train the Model\n", "\n", "We will train for 5 epochs and decay the learning rate using cosine annealing. This is equivalent to one cycle with a length of 5 epochs. We will save the weights for each epoch in a checkpoint folder. Will train with a learning rate of `0.01`, previously identified using our [learning-rate finder](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-02-tuning-learning-rates.ipynb)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "preparing train data ...done.\n", "preparing valid data ...done.\n", "Train for 124 steps, validate for 41 steps\n", "Epoch 1/5\n", "124/124 [==============================] - 83s 666ms/step - loss: 0.0699 - val_loss: 0.0212\n", "Epoch 2/5\n", "124/124 [==============================] - 73s 587ms/step - loss: 0.0167 - val_loss: 0.0135\n", "Epoch 3/5\n", "124/124 [==============================] - 73s 585ms/step - loss: 0.0083 - val_loss: 0.0131\n", "Epoch 4/5\n", "124/124 [==============================] - 73s 592ms/step - loss: 0.0053 - val_loss: 0.0123\n", "Epoch 5/5\n", "124/124 [==============================] - 73s 591ms/step - loss: 0.0040 - val_loss: 0.0123\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.fit(0.01, 1, cycle_len=5, checkpoint_folder='/tmp/saved_weights')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.plot('lr')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As shown below, our model achieves an F1-Sccore of 83.04 with only a few minutes of training." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " F1: 83.04\n", " precision recall f1-score support\n", "\n", " PER 0.92 0.94 0.93 1097\n", " LOC 0.88 0.90 0.89 772\n", " MISC 0.74 0.76 0.75 1187\n", " ORG 0.72 0.84 0.77 882\n", "\n", "micro avg 0.81 0.85 0.83 3938\n", "macro avg 0.81 0.85 0.83 3938\n", "\n" ] }, { "data": { "text/plain": [ "0.8303891290920322" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.validate(class_names=preproc.get_classes())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 4: Make Predictions" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "predictor = ktrain.get_predictor(learner.model, preproc)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Marke', 'B-PER'),\n", " ('Rutte', 'I-PER'),\n", " ('is', 'O'),\n", " ('een', 'O'),\n", " ('Nederlandse', 'B-MISC'),\n", " ('politicus', 'O'),\n", " ('die', 'O'),\n", " ('momenteel', 'O'),\n", " ('premier', 'O'),\n", " ('van', 'O'),\n", " ('Nederland', 'B-LOC'),\n", " ('is', 'O'),\n", " ('.', 'O')]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dutch_text = \"\"\"Marke Rutte is een Nederlandse politicus die momenteel premier van Nederland is.\"\"\"\n", "predictor.predict(dutch_text)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "predictor.save('/tmp/my_dutch_nermodel')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `predictor` can be re-loaded from disk with with `load_predictor`:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "predictor = ktrain.load_predictor('/tmp/my_dutch_nermodel')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Marke', 'B-PER'),\n", " ('Rutte', 'I-PER'),\n", " ('is', 'O'),\n", " ('een', 'O'),\n", " ('Nederlandse', 'B-MISC'),\n", " ('politicus', 'O'),\n", " ('die', 'O'),\n", " ('momenteel', 'O'),\n", " ('premier', 'O'),\n", " ('van', 'O'),\n", " ('Nederland', 'B-LOC'),\n", " ('is', 'O'),\n", " ('.', 'O')]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictor.predict(dutch_text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }