{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; \n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "import ktrain\n", "from ktrain import text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we will classify Wikipedia comments into one or more categories of so-called *toxic comments*. Categories of toxic online behavior include toxic, severe_toxic, obscene, threat, insult, and identity_hate. The dataset can be downloaded from the [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data) as a CSV file (i.e., download the file ```train.csv```). We will load the data using the ```texts_from_csv``` method, which assumes the label_columns are already one-hot-encoded in the spreadsheet. Since *val_filepath* is None, 10% of the data will automatically be used as a validation set.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Word Counts: 196995\n", "Nrows: 143613\n", "143613 train sequences\n", "Average train sequence length: 66\n", "15958 test sequences\n", "Average test sequence length: 66\n", "Pad sequences (samples x time)\n", "x_train shape: (143613,150)\n", "x_test shape: (15958,150)\n", "y_train shape: (143613,6)\n", "y_test shape: (15958,6)\n" ] } ], "source": [ "DATA_PATH = 'data/toxic-comments/train.csv'\n", "NUM_WORDS = 50000\n", "MAXLEN = 150\n", "(x_train, y_train), (x_test, y_test), preproc = text.texts_from_csv(DATA_PATH,\n", " 'comment_text',\n", " label_columns = [\"toxic\", \"severe_toxic\", \"obscene\", \"threat\", \"insult\", \"identity_hate\"],\n", " val_filepath=None, # if None, 10% of data will be used for validation\n", " max_features=NUM_WORDS, maxlen=MAXLEN,\n", " ngram_range=1)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Is Multi-Label? True\n", "compiling word ID features...\n", "max_features is 49350\n", "done.\n" ] } ], "source": [ "model = text.text_classifier('fasttext', (x_train, y_train), \n", " preproc=preproc)\n", "learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simulating training for different learning rates... this may take a few moments...\n", "Epoch 1/5\n", " 47840/143613 [========>.....................] - ETA: 31s - loss: 0.4965 - acc: 0.7510\n", "\n", "done.\n", "Please invoke the Learner.lr_plot() method to visually inspect the loss plot to help identify the maximal learning rate associated with falling loss.\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.lr_find()\n", "learner.lr_plot()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "early_stopping automatically enabled at patience=5\n", "reduce_on_plateau automatically enabled at patience=2\n", "\n", "\n", "begin training using triangular learning rate policy with max lr of 0.001...\n", "Train on 143613 samples, validate on 15958 samples\n", "Epoch 1/1024\n", "143613/143613 [==============================] - 51s 356us/step - loss: 0.1358 - acc: 0.9530 - val_loss: 0.0536 - val_acc: 0.9817\n", "Epoch 2/1024\n", "143613/143613 [==============================] - 51s 355us/step - loss: 0.0643 - acc: 0.9784 - val_loss: 0.0504 - val_acc: 0.9826\n", "Epoch 3/1024\n", "143613/143613 [==============================] - 51s 356us/step - loss: 0.0577 - acc: 0.9797 - val_loss: 0.0483 - val_acc: 0.9831\n", "Epoch 4/1024\n", "143613/143613 [==============================] - 51s 352us/step - loss: 0.0540 - acc: 0.9806 - val_loss: 0.0475 - val_acc: 0.9830\n", "Epoch 5/1024\n", "143613/143613 [==============================] - 51s 355us/step - loss: 0.0520 - acc: 0.9811 - val_loss: 0.0471 - val_acc: 0.9832\n", "Epoch 6/1024\n", "143613/143613 [==============================] - 51s 355us/step - loss: 0.0500 - acc: 0.9818 - val_loss: 0.0469 - val_acc: 0.9833\n", "Epoch 7/1024\n", "143613/143613 [==============================] - 51s 353us/step - loss: 0.0484 - acc: 0.9820 - val_loss: 0.0466 - val_acc: 0.9832\n", "Epoch 8/1024\n", "143613/143613 [==============================] - 51s 358us/step - loss: 0.0475 - acc: 0.9823 - val_loss: 0.0470 - val_acc: 0.9830\n", "Epoch 9/1024\n", "143613/143613 [==============================] - 52s 360us/step - loss: 0.0465 - acc: 0.9826 - val_loss: 0.0470 - val_acc: 0.9831\n", "\n", "Epoch 00009: Reducing Max LR on Plateau: new max lr will be 0.0005 (if not early_stopping).\n", "Epoch 10/1024\n", "143613/143613 [==============================] - 52s 359us/step - loss: 0.0441 - acc: 0.9832 - val_loss: 0.0473 - val_acc: 0.9830\n", "Epoch 11/1024\n", "143613/143613 [==============================] - 52s 359us/step - loss: 0.0432 - acc: 0.9835 - val_loss: 0.0474 - val_acc: 0.9831\n", "\n", "Epoch 00011: Reducing Max LR on Plateau: new max lr will be 0.00025 (if not early_stopping).\n", "Epoch 12/1024\n", "143613/143613 [==============================] - 51s 357us/step - loss: 0.0420 - acc: 0.9838 - val_loss: 0.0477 - val_acc: 0.9830\n", "Restoring model weights from the end of the best epoch\n", "Epoch 00012: early stopping\n", "Weights from best epoch have been loaded into model.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.autofit(0.001)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }