{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text Classification Example: Sentiment Analysis with IMDb Movie Reviews\n", "\n", "We will begin by importing some required modules for performing text classification in *ktrain*." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ktrain\n", "from ktrain import text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we will load and preprocess the text data for training and validation. *ktrain* can load texts and associated labels from a variety of source:\n", "\n", "- `texts_from_folder`: labels are represented as subfolders containing text files [ [example notebook] ](https://github.com/amaiya/ktrain/blob/master/examples/text/IMDb-BERT.ipynb)\n", "- `texts_from_csv`: texts and associated labels are stored in columns in a CSV file [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/toxic_comments-fasttext.ipynb) ]\n", "- `texts_from_df`: texts and associated labels are stored in columns in a *pandas* DataFrame [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/ArabicHotelReviews-nbsvm.ipynb) ]\n", "- `texts_from_array`: texts and labels are loaded and preprocessed from an array [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/20newsgroup-distilbert.ipynb) ]\n", "\n", "For `texts_from_csv` and `texts_from_df`, labels can either be multi or one-hot-encoded with one column per class or can be a single column storing integers or strings like this:\n", "```python\n", "# my_training_data.csv\n", "TEXT,LABEL\n", "I like this movie,positive\n", "I hate this movie,negative\n", "```\n", "\n", "For `texts_from_array`, the labels are arrays in one of the following forms:\n", "```python\n", "# string labels\n", "y_train = ['negative', 'positive']\n", "# integer labels\n", "y_train = [0, 1] # indices must start from 0\n", "# multi or one-hot encoded labels (used for multi-label problems)\n", "y_train = [[1,0], [0,1]]\n", "```\n", "\n", "In the latter two cases, you must supply a `class_names` argument to the `texts_from_array`, which tells *ktrain* how indices map to class names. In this case, `class_names=['negative', 'positive']` because 0=negative and 1=positive.\n", "\n", "Sample arrays for `texts_from_array` might look like this:\n", "```python\n", "x_train = ['I hate this movie.', 'I like this movie.']\n", "y_train = ['negative', 'positive']\n", "x_test = ['I despise this movie.', 'I love this movie.']\n", "y_test = ['negative', 'positive']\n", "```\n", "\n", "All of the above methods transform the texts into a sequence of word IDs in one way or another, as expected by neural network models.\n", "\n", "\n", "In this first example problem, we use the ```texts_from_folder``` function to load documents as fixed-length sequences of word IDs from a folder of raw documents. This function assumes a directory structure like the following:\n", "\n", "```\n", " ├── datadir\n", " │ ├── train\n", " │ │ ├── class0 # folder containing documents of class 0\n", " │ │ ├── class1 # folder containing documents of class 1\n", " │ │ ├── class2 # folder containing documents of class 2\n", " │ │ └── classN # folder containing documents of class N\n", " │ └── test \n", " │ ├── class0 # folder containing documents of class 0\n", " │ ├── class1 # folder containing documents of class 1\n", " │ ├── class2 # folder containing documents of class 2\n", " │ └── classN # folder containing documents of class N\n", "```\n", "\n", "Each subfolder will contain documents in plain text format (e.g., `.txt` files) pertaining to the class represented by the subfolder.\n", "\n", "For our text classification example, we will again classifiy IMDb movie reviews as either positive or negative. However, instead of using the pre-processed version of the dataset pre-packaged with Keras, we will use the original (or raw) *aclImdb* dataset. The dataset can be downloaded from [here](http://ai.stanford.edu/~amaas/data/sentiment/). Set the ```DATADIR``` variable to the location of the extracted *aclImdb* folder.\n", "\n", "In the cell below, note that we supplied `preprocess_mode='standard'` to the data-loading function (which is the default). For pretrained models like BERT and DistilBERT, the dataset must be preprocessed in a specific way. If you are planning to use BERT for text classification, you should replace this argument with `preprocess_mode='bert'`. Since we will not be using BERT in this example, we leave it as `preprocess_mode='standard'`. See [this notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/IMDb-BERT.ipynb) for an example of how to use BERT for text classification in *ktrain*. There is also a [DistilBERT example notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/20newsgroup-distilbert.ipynb). \n", "**NOTE:** If using `preprocess_mode='bert'` or `preprocess_mode='distilbert'`, an English pretrained model is used for English, a Chinese pretrained model is used for Chinese, and a multilingual pretrained model is used for all other languages. For more flexibility in choosing the model used, you can use the alternative [Transformer API for text classification](https://github.com/amaiya/ktrain/blob/master/tutorials/tutorial-A3-hugging_face_transformers.ipynb) in *ktrain*. \n", "\n", "Please also note that, when specifying `preprocess_mode='distilbert'`, the first two return values are `TransformerDataset` objects, not Numpy arrays. So, it is best to always use `trn, val, preproc` on the left-hand side of the expression (instead of `(x_train, y_train), (x_test, y_test_, preproc`) to avoid confusion, as shown below." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "detected encoding: utf-8\n", "language: en\n", "Word Counts: 88582\n", "Nrows: 25000\n", "25000 train sequences\n", "train sequence lengths:\n", "\tmean : 237\n", "\t95percentile : 608\n", "\t99percentile : 923\n", "Adding 3-gram features\n", "max_features changed to 5151281 with addition of ngrams\n", "Average train sequence length with ngrams: 709\n", "train (w/ngrams) sequence lengths:\n", "\tmean : 709\n", "\t95percentile : 1821\n", "\t99percentile : 2766\n", "x_train shape: (25000,2000)\n", "y_train shape: (25000, 2)\n", "Is Multi-Label? False\n", "25000 test sequences\n", "test sequence lengths:\n", "\tmean : 230\n", "\t95percentile : 584\n", "\t99percentile : 900\n", "Average test sequence length with ngrams: 523\n", "test (w/ngrams) sequence lengths:\n", "\tmean : 524\n", "\t95percentile : 1295\n", "\t99percentile : 1971\n", "x_test shape: (25000,2000)\n", "y_test shape: (25000, 2)\n" ] } ], "source": [ "# load training and validation data from a folder\n", "DATADIR = 'data/aclImdb'\n", "trn, val, preproc = text.texts_from_folder(DATADIR, \n", " max_features=80000, maxlen=2000, \n", " ngram_range=3, \n", " preprocess_mode='standard',\n", " classes=['pos', 'neg'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having loaded the data, we will now create a text classification model. The `print_text_classifier` function prints some available models. The model selected should be consistent with the `preprocess_mode` selected above. \n", "\n", "(As mentioned above, one can also use the alternative `Transformer` API for text classification in *ktrain* to access an even larger library of Hugging Face Transformer models like RoBERTa and XLNet. See [this tutorial](https://github.com/amaiya/ktrain/blob/master/tutorials/tutorial-A3-hugging_face_transformers.ipynb) for more information on this.) \n", "\n", "In this example, the `text_classifier` function will return a [neural implementation of NBSVM](https://medium.com/@asmaiya/a-neural-implementation-of-nbsvm-in-keras-d4ef8c96cb7c), which is a strong baseline that can outperform more complex neural architectures. It may take a few moments to return as it builds a document-term matrix from the input data we provide it. The ```text_classifier``` function expects `trn` to be a preprocessed training set returned from the `texts_from*` function above. In this case where we have used `preprocess_mode='standard'`, `trn` is a numpy array with each document represented as fixed-size sequence of word IDs." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf]\n", "logreg: logistic regression using a trainable Embedding layer\n", "nbsvm: NBSVM model [http://www.aclweb.org/anthology/P12-2018]\n", "bigru: Bidirectional GRU with pretrained fasttext word vectors [https://fasttext.cc/docs/en/crawl-vectors.html]\n", "standard_gru: simple 2-layer GRU with randomly initialized embeddings\n", "bert: Bidirectional Encoder Representations from Transformers (BERT) [https://arxiv.org/abs/1810.04805]\n", "distilbert: distilled, smaller, and faster BERT from Hugging Face [https://arxiv.org/abs/1910.01108]\n" ] } ], "source": [ "text.print_text_classifiers()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Is Multi-Label? False\n", "compiling word ID features...\n", "building document-term matrix... this may take a few moments...\n", "rows: 1-10000\n", "rows: 10001-20000\n", "rows: 20001-25000\n", "computing log-count ratios...\n", "done.\n" ] } ], "source": [ "# load an NBSVM model\n", "model = text.text_classifier('nbsvm', trn, preproc=preproc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we instantiate a Learner object and call the ```lr_find``` and ```lr_plot``` methods to help identify a good learning rate." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "learner = ktrain.get_learner(model, train_data=trn, val_data=val)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simulating training for different learning rates... this may take a few moments...\n", "Epoch 1/5\n", "25000/25000 [==============================] - 6s 226us/step - loss: 0.6906 - acc: 0.5797\n", "Epoch 2/5\n", "25000/25000 [==============================] - 5s 206us/step - loss: 0.6071 - acc: 0.9114\n", "Epoch 3/5\n", "25000/25000 [==============================] - 5s 205us/step - loss: 0.2151 - acc: 0.9711\n", "Epoch 4/5\n", "16032/25000 [==================>...........] - ETA: 1s - loss: 0.0252 - acc: 0.9943\n", "\n", "done.\n", "Please invoke the Learner.lr_plot() method to visually inspect the loss plot to help identify the maximal learning rate associated with falling loss.\n" ] } ], "source": [ "learner.lr_find()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.lr_plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we will fit our model using and [SGDR learning rate schedule](https://github.com/amaiya/ktrain/blob/master/example-02-tuning-learning-rates.ipynb) by invoking the ```fit``` method with the *cycle_len* parameter (along with the *cycle_mult* parameter)." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 25000 samples, validate on 25000 samples\n", "Epoch 1/7\n", "25000/25000 [==============================] - 7s 263us/step - loss: 0.2105 - acc: 0.9461 - val_loss: 0.2481 - val_acc: 0.9187\n", "Epoch 2/7\n", "25000/25000 [==============================] - 7s 261us/step - loss: 0.0458 - acc: 0.9936 - val_loss: 0.2266 - val_acc: 0.9218\n", "Epoch 3/7\n", "25000/25000 [==============================] - 6s 257us/step - loss: 0.0082 - acc: 0.9999 - val_loss: 0.2236 - val_acc: 0.9228\n", "Epoch 4/7\n", "25000/25000 [==============================] - 6s 256us/step - loss: 0.0069 - acc: 0.9999 - val_loss: 0.2169 - val_acc: 0.9227\n", "Epoch 5/7\n", "25000/25000 [==============================] - 6s 259us/step - loss: 0.0029 - acc: 1.0000 - val_loss: 0.2148 - val_acc: 0.9227\n", "Epoch 6/7\n", "25000/25000 [==============================] - 7s 261us/step - loss: 0.0020 - acc: 1.0000 - val_loss: 0.2142 - val_acc: 0.9228\n", "Epoch 7/7\n", "25000/25000 [==============================] - 6s 255us/step - loss: 0.0017 - acc: 1.0000 - val_loss: 0.2141 - val_acc: 0.9227\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.fit(0.001, 3, cycle_len=1, cycle_mult=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### As can be seen, our final model yields a validation accuracy of 92.27%.\n", "\n", "### Making Predictions\n", "\n", "Let's predict the sentiment of new movie reviews (or comments in this case) using our trained model.\n", "\n", "The ```preproc``` object (returned by ```texts_from_folder```) is important here, as it is used to preprocess data in a way our model expects." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "predictor = ktrain.get_predictor(learner.model, preproc)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "data = [ 'This movie was horrible! The plot was boring. Acting was okay, though.',\n", " 'The film really sucked. I want my money back.',\n", " 'What a beautiful romantic comedy. 10/10 would see again!']" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['neg', 'neg', 'pos']" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictor.predict(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can be seen, our model returns predictions that appear to be correct. The predictor instance can also be used to return \"probabilities\" of our predictions with respect to each class. Let us first print the classes and their order. The class *pos* stands for positive sentiment and *neg* stands for negative sentiment. Then, we will re-run ```predictor.predict``` with *return_proba=True* to see the probabilities." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['neg', 'pos']" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictor.get_classes()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.81179327, 0.18820675],\n", " [0.7463994 , 0.25360066],\n", " [0.26558533, 0.7344147 ]], dtype=float32)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictor.predict(data, return_proba=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For text classifiers, there is also `predictor.predict_proba`, which is simply calls `predict` with `return_proba=True`.\n", "\n", "Our movie review sentiment predictor can be saved to disk and reloaded/re-used later as part of an application. This is illustrated below:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "predictor.save('/tmp/my_moviereview_predictor')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "predictor = ktrain.load_predictor('/tmp/my_moviereview_predictor')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['pos']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictor.predict(['Groundhog Day is my favorite movie of all time!'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that both the `load_predictor` and `get_predictor` functions accept an optional `batch_size` argument that is set to 32 by default. The `batch_size` can also be set manually on the `Predictor` instance. That is, the `batch_size` used for inference and predictions can be increased with either of the following:\n", "```python\n", "# you can set the batch_size as an argument to load_predictor (or get_predictor)\n", "predictor = ktrain.load_predictor('/tmp/my_moviereview_predictor', batch_size=128)\n", "\n", "# you can also set the batch_size used for predictions this way\n", "predictor.batch_size = 128\n", "```\n", "Larger batch sizes can potentially speed predictions when `predictor.predict` is supplied with a list of examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multi-Label Text Classification: Identifying Toxic Online Comments\n", "\n", "In the previous example, the classes (or categories) were mutually exclusive. By contrast, in multi-label text classification, a document or text snippet can belong to multiple classes. Here, we will classify Wikipedia comments into one or more categories of so-called *toxic comments*. Categories of toxic online behavior include toxic, severe_toxic, obscene, threat, insult, and identity_hate. The dataset can be downloaded from the [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data) as a CSV file (i.e., download the file ```train.csv```). We will load the data using the ```texts_from_csv``` function. This function expects one column to contain the texts of documents and one or more other columns to store the labels. Labels can be in any of the following formats:\n", "\n", "```\n", "1. one-hot-encoded or arrays representing classes will have a single one in each row:\n", " Binary Classification (two classes):\n", " text|positive|negative\n", " I like this movie.|1|0\n", " I hated this movie.|0|1\n", " Multiclass Classification (more than two classes): \n", " text|negative|neutral|positive\n", " I hated this movie.|1|0|0 # negative\n", " I loved this movie.|0|0|1 # positive\n", " I saw the movie.|0|1|0 # neutral\n", "2. multi-hot-encoded arrays representing classes:\n", " Multi-label classification will have one or more ones in each row:\n", " text|politics|television|sports\n", " I will vote in 2020.|1|0|0 # politics\n", " I watched the debate on CNN.|1|1|0 # politics and television\n", " Did you watch the game on ESPN?|0|1|1 # sports and television\n", " I play basketball.|0|0|1 # sports \n", "3. labels are in a single column of string or integer values representing classs labels\n", " Example with label_columns=['label'] and text_column='text':\n", " text|label\n", " I like this movie.|positive\n", " I hated this movie.|negative\n", "```\n", "\n", "Since the Toxic Comment Classification Challenge is a multi-label problem, we must use the second format, where labels are already multi-hot-encoded. Luckily, the `train.csv` file for this problem is already multi-hot-encoded, so no extra processing is required. \n", "\n", "Since `val_filepath is None`, 10% of the data will automatically be used as a validation set.\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Word Counts: 197340\n", "Nrows: 143613\n", "143613 train sequences\n", "Average train sequence length: 66\n", "15958 test sequences\n", "Average test sequence length: 66\n", "Pad sequences (samples x time)\n", "x_train shape: (143613,150)\n", "x_test shape: (15958,150)\n", "y_train shape: (143613,6)\n", "y_test shape: (15958,6)\n" ] } ], "source": [ "DATA_PATH = 'data/toxic-comments/train.csv'\n", "NUM_WORDS = 50000\n", "MAXLEN = 150\n", "trn, val, preproc = text.texts_from_csv(DATA_PATH,\n", " 'comment_text',\n", " label_columns = [\"toxic\", \"severe_toxic\", \"obscene\", \"threat\", \"insult\", \"identity_hate\"],\n", " val_filepath=None, # if None, 10% of data will be used for validation\n", " max_features=NUM_WORDS, maxlen=MAXLEN,\n", " ngram_range=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, as before, we load a text classification model and wrap the model and data in Learner object. Instead of using the NBSVM model, we will explicitly request a different model called fasttext using the ```name``` parameter of ```text_classifier```. The fastText architecture was created by [Facebook](https://arxiv.org/abs/1607.01759) in 2016. (You can call the ```print_textmodels``` to show the available text classification models.) " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "nbsvm: NBSVM model (http://www.aclweb.org/anthology/P12-2018)\n", "fasttext: a fastText-like model (http://arxiv.org/pdf/1607.01759.pdf)\n", "logreg: logistic regression\n" ] } ], "source": [ "text.print_text_classifiers()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Is Multi-Label? True\n", "compiling word ID features...\n", "done.\n" ] } ], "source": [ "model = text.text_classifier('fasttext', trn, preproc=preproc)\n", "learner = ktrain.get_learner(model, train_data=trn, val_data=val)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, we use our learning rate finder to find a good learning rate. In this case, a learning rate of 0.0007 appears to be good." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simulating training for different learning rates... this may take a few moments...\n", "Epoch 1/5\n", "143613/143613 [==============================] - 47s 325us/step - loss: 0.7361 - acc: 0.5322\n", "Epoch 2/5\n", "143613/143613 [==============================] - 46s 323us/step - loss: 0.4683 - acc: 0.7714\n", "Epoch 3/5\n", "143613/143613 [==============================] - 46s 323us/step - loss: 0.0879 - acc: 0.9729\n", "Epoch 4/5\n", "143613/143613 [==============================] - 46s 323us/step - loss: 0.1106 - acc: 0.9686\n", "Epoch 5/5\n", "143613/143613 [==============================] - 46s 323us/step - loss: 0.1636 - acc: 0.9629\n", "\n", "\n", "done.\n", "Please invoke the Learner.lr_plot() method to visually inspect the loss plot to help identify the maximal learning rate associated with falling loss.\n" ] } ], "source": [ "learner.lr_find()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.lr_plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we will train our model for 8 epochs using ```autofit``` with a learning rate of 0.0007. Having explicitly specified the number of epochs, ```autofit``` will automatically employ a triangular learning rate policy. Our final ROC-AUC score is **0.98**.\n", "\n", "As shown in [this example notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/toxic_comments-bigru.ipynb) on our GitHub project, even better results can be obtained using a Bidirectional GRU with pretrained word vectors (called ‘bigru’ in ktrain)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "begin training using triangular learning rate policy with max lr of 0.0007...\n", "Train on 143613 samples, validate on 15958 samples\n", "Epoch 1/8\n", "143613/143613 [==============================] - 48s 333us/step - loss: 0.1140 - acc: 0.9630 - val_loss: 0.0530 - val_acc: 0.9812\n", "Epoch 2/8\n", "143613/143613 [==============================] - 47s 330us/step - loss: 0.0625 - acc: 0.9790 - val_loss: 0.0501 - val_acc: 0.9819\n", "Epoch 3/8\n", "143613/143613 [==============================] - 48s 331us/step - loss: 0.0572 - acc: 0.9801 - val_loss: 0.0491 - val_acc: 0.9821\n", "Epoch 4/8\n", "143613/143613 [==============================] - 47s 331us/step - loss: 0.0538 - acc: 0.9806 - val_loss: 0.0481 - val_acc: 0.9823\n", "Epoch 5/8\n", "143613/143613 [==============================] - 47s 329us/step - loss: 0.0517 - acc: 0.9813 - val_loss: 0.0476 - val_acc: 0.9823\n", "Epoch 6/8\n", "143613/143613 [==============================] - 47s 329us/step - loss: 0.0501 - acc: 0.9815 - val_loss: 0.0470 - val_acc: 0.9825\n", "Epoch 7/8\n", "143613/143613 [==============================] - 47s 331us/step - loss: 0.0486 - acc: 0.9820 - val_loss: 0.0468 - val_acc: 0.9824\n", "Epoch 8/8\n", "143613/143613 [==============================] - 47s 330us/step - loss: 0.0471 - acc: 0.9824 - val_loss: 0.0470 - val_acc: 0.9826\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.autofit(0.0007, 8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Let's compute for ROC-AUC of our final model for identifying toxic online behavior:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " ROC-AUC score: 0.980092 \n", "\n" ] } ], "source": [ "from sklearn.metrics import roc_auc_score\n", "y_pred = learner.model.predict(x_test, verbose=0)\n", "score = roc_auc_score(y_test, y_pred)\n", "print(\"\\n ROC-AUC score: %.6f \\n\" % (score))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making Predictions\n", "\n", "As before, let's make some predictions about toxic comments using our model by wrapping it in a Predictor instance." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "predictor = ktrain.get_predictor(learner.model, preproc)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[('toxic', 0.5491581),\n", " ('severe_toxic', 0.02454061),\n", " ('obscene', 0.084347874),\n", " ('threat', 0.4110818),\n", " ('insult', 0.17229997),\n", " ('identity_hate', 0.08519211)]]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# correctly predict a toxic comment that includes a threat\n", "predictor.predict([\"If you don't stop immediately, I will kill you.\"])" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[('toxic', 0.021799222),\n", " ('severe_toxic', 7.991817e-07),\n", " ('obscene', 0.000504758),\n", " ('threat', 5.477591e-05),\n", " ('insult', 0.001496369),\n", " ('identity_hate', 9.472556e-05)]]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# non-toxic comment\n", "predictor.predict([\"Okay - I'm calling it a night. See you tomorrow.\"])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "predictor.save('/tmp/toxic_detector')" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "predictor = ktrain.load_predictor('/tmp/toxic_detector')" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[('toxic', 0.86799675),\n", " ('severe_toxic', 0.008107864),\n", " ('obscene', 0.26740596),\n", " ('threat', 0.006626291),\n", " ('insult', 0.39607796),\n", " ('identity_hate', 0.023489485)]]" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# model works correctly and as expected after reloading from disk\n", "predictor.predict([\"You have a really ugly face.\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The `Transformers` API in *ktrain*\n", "\n", "If using transformer models like BERT or DistilBert or RoBERTa, *ktrain* includes an alternative API for text classification, which allows the use of **any** Hugging Face `transformers` model. This API can be used as follows:\n", "\n", "```python\n", "import ktrain\n", "from ktrain import text\n", "MODEL_NAME = 'bert-base-uncased'\n", "t = text.Transformer(MODEL_NAME, maxlen=500, \n", " classes=label_list)\n", "trn = t.preprocess_train(x_train, y_train)\n", "val = t.preprocess_test(x_test, y_test)\n", "model = t.get_classifier()\n", "learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)\n", "learner.fit_onecycle(3e-5, 1)\n", "```\n", "\n", "Note that `x_train` and `x_test` are the raw texts here:\n", "```python\n", "x_train = ['I hate this movie.', 'I like this movie.']\n", "```\n", "Similar to `texts_from_array`, the labels are arrays in one of the following forms:\n", "```python\n", "# string labels\n", "y_train = ['negative', 'positive']\n", "# integer labels\n", "y_train = [0, 1]\n", "# multi or one-hot encoded labels\n", "y_train = [[1,0], [0,1]]\n", "```\n", "In the latter two cases, you must supply a `class_names` argument to the `Transformer` constructor, which tells *ktrain* how indices map to class names. In this case, `class_names=['negative', 'positive']` because 0=negative and 1=positive.\n", "\n", "For an example, see [this notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/ArabicHotelReviews-AraBERT.ipynb), which builds and Arabic sentiment analysis model using [AraBERT](https://huggingface.co/aubmindlab/bert-base-arabert).\n", "\n", "\n", "For more information, see our tutorial on [text classification with Hugging Face Transformers](https://github.com/amaiya/ktrain/blob/master/tutorials/tutorial-A3-hugging_face_transformers.ipynb).\n", "\n", "You may be also interested in some of our blog posts on text classification:\n", "- [Text Classification With Hugging Face Transformers in TensorFlow 2 (Without Tears)](https://towardsdatascience.com/text-classification-with-hugging-face-transformers-in-tensorflow-2-without-tears-ee50e4f3e7ed)\n", "- [BERT Text Classification in 3 Lines of Code](https://towardsdatascience.com/bert-text-classification-in-3-lines-of-code-using-keras-264db7e7a358)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }