{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "import ktrain\n", "from ktrain import text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multi-Label Text Classification: Identifying Toxic Online Comments\n", "\n", "Here, we will classify Wikipedia comments into one or more categories of so-called *toxic comments*. Categories of toxic online behavior include toxic, severe_toxic, obscene, threat, insult, and identity_hate. The dataset can be downloaded from the [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data) as a CSV file (i.e., download the file ```train.csv```). We will load the data using the ```texts_from_csv``` method, which assumes the label_columns are already one-hot-encoded in the spreadsheet. Since *val_filepath* is None, 10% of the data will automatically be used as a validation set.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Word Counts: 197516\n", "Nrows: 143613\n", "143613 train sequences\n", "Average train sequence length: 66\n", "x_train shape: (143613,150)\n", "y_train shape: (143613,6)\n", "15958 test sequences\n", "Average test sequence length: 66\n", "x_test shape: (15958,150)\n", "y_test shape: (15958,6)\n" ] } ], "source": [ "DATA_PATH = 'data/toxic-comments/train.csv'\n", "NUM_WORDS = 50000\n", "MAXLEN = 150\n", "(x_train, y_train), (x_test, y_test), preproc = text.texts_from_csv(DATA_PATH,\n", " 'comment_text',\n", " label_columns = [\"toxic\", \"severe_toxic\", \"obscene\", \"threat\", \"insult\", \"identity_hate\"],\n", " val_filepath=None, # if None, 10% of data will be used for validation\n", " max_features=NUM_WORDS, maxlen=MAXLEN,\n", " ngram_range=1)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fasttext: a fastText-like model (http://arxiv.org/pdf/1607.01759.pdf)\n", "logreg: logistic regression using a trainable Embedding layer\n", "nbsvm: NBSVM model (http://www.aclweb.org/anthology/P12-2018)\n", "bigru: Bidirectional GRU with pretrained word vectors\n", "bert: Bidirectional Encoder Representations from Transformers (https://arxiv.org/abs/1810.04805)\n" ] } ], "source": [ "text.print_text_classifiers()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We weill employ a Bidirectional GRU with pretrained word vectors. The following code cell loads a BIGRU model and defines a ```Learner``` object based on that model. The file ```crawl-300d-2M.vec ``` contains 2 million word vectors trained by Facebook and will be automatically downloaded for use with this model." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Is Multi-Label? True\n", "compiling word ID features...\n", "max_features is 49325\n", "processing pretrained word vectors...\n", "done.\n" ] } ], "source": [ "model = text.text_classifier('bigru', (x_train, y_train), preproc=preproc)\n", "\n", "learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, we use our learning rate finder to find a good learning rate. In this case, a learning rate of 0.0007 appears to be good." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simulating training for different learning rates... this may take a few moments...\n", "Epoch 1/1\n", "100352/143613 [===================>..........] - ETA: 10:26 - loss: 0.3649 - acc: 0.7886\n", "\n", "done.\n", "Please invoke the Learner.lr_plot() method to visually inspect the loss plot to help identify the maximal learning rate associated with falling loss.\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.lr_find()\n", "learner.lr_plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we will train our model for 8 epochs using ```autofit``` with a learning rate of 0.001 for two epochs." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# define a custom callback for ROC-AUC\n", "from tensorflow.keras.callbacks import Callback\n", "from sklearn.metrics import roc_auc_score\n", "class RocAucEvaluation(Callback):\n", " def __init__(self, validation_data=(), interval=1):\n", " super(Callback, self).__init__()\n", "\n", " self.interval = interval\n", " self.X_val, self.y_val = validation_data\n", "\n", " def on_epoch_end(self, epoch, logs={}):\n", " if epoch % self.interval == 0:\n", " y_pred = self.model.predict(self.X_val, verbose=0)\n", " score = roc_auc_score(self.y_val, y_pred)\n", " print(\"\\n ROC-AUC - epoch: %d - score: %.6f \\n\" % (epoch+1, score))\n", "RocAuc = RocAucEvaluation(validation_data=(x_test, y_test), interval=1)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "begin training using triangular learning rate policy with max lr of 0.001...\n", "Train on 143613 samples, validate on 15958 samples\n", "Epoch 1/2\n", "143613/143613 [==============================] - 2157s 15ms/step - loss: 0.0668 - acc: 0.9787 - val_loss: 0.0410 - val_acc: 0.9843\n", "\n", " ROC-AUC - epoch: 1 - score: 0.986431 \n", "\n", "Epoch 2/2\n", "143613/143613 [==============================] - 2142s 15ms/step - loss: 0.0398 - acc: 0.9846 - val_loss: 0.0392 - val_acc: 0.9849\n", "\n", " ROC-AUC - epoch: 2 - score: 0.989871 \n", "\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# train\n", "learner.autofit(0.001, 2, callbacks=[RocAuc])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our final ROC-AUC score is **0.9899**." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }