{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "using Keras version: 2.2.4\n" ] } ], "source": [ "import ktrain\n", "from ktrain import text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Building a Chinese-Language Sentiment Analyzer\n", "\n", "In this notebook, we will build a Chinese-language text classification model in 4 simple steps. More specifically, we will build a model that classifies Chinese hotel reviews as either positive or negative.\n", "\n", "The dataset can be downloaded from Chengwei Zhang's GitHub repository [here](https://github.com/Tony607/Chinese_sentiment_analysis/tree/master/data/ChnSentiCorp_htl_ba_6000).\n", "\n", "(**Disclaimer:** I don't speak Chinese. Please forgive mistakes.) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 1: Load and Preprocess the Data\n", "\n", "First, we use the `texts_from_folder` function to load and preprocess the data. We assume that the data is in the following form:\n", "```\n", " ├── datadir\n", " │ ├── train\n", " │ │ ├── class0 # folder containing documents of class 0\n", " │ │ ├── class1 # folder containing documents of class 1\n", " │ │ ├── class2 # folder containing documents of class 2\n", " │ │ └── classN # folder containing documents of class N\n", "```\n", "We set `val_pct` as 0.1, which will automatically sample 10% of the data for validation. We specifiy `preprocess_mode='standard'` to employ normal text preprocessing. If you are using the BERT model (i.e., 'bert'), you should use `preprocess_mode='bert'`.\n", "\n", "**Notice that there is nothing speical or extra we need to do here for non-English text.** *ktrain* automatically detects the language and character encoding and prepares the data and configures the model appropriately.\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "detected encoding: GB18030\n", "Decoding with GB18030 failed 1st attempt - using GB18030 with skips\n", "skipped 107 lines (0.3%) due to character decoding errors\n", "skipped 11 lines (0.3%) due to character decoding errors\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Building prefix dict from the default dictionary ...\n", "WARNING: Logging before flag parsing goes to stderr.\n", "I1001 17:33:09.975814 140013155014464 __init__.py:111] Building prefix dict from the default dictionary ...\n", "Loading model from cache /tmp/jieba.cache\n", "I1001 17:33:09.978070 140013155014464 __init__.py:131] Loading model from cache /tmp/jieba.cache\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "language: zh-cn\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Loading model cost 0.652 seconds.\n", "I1001 17:33:10.629599 140013155014464 __init__.py:163] Loading model cost 0.652 seconds.\n", "Prefix dict has been built succesfully.\n", "I1001 17:33:10.631566 140013155014464 __init__.py:164] Prefix dict has been built succesfully.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word Counts: 22388\n", "Nrows: 5324\n", "5324 train sequences\n", "Average train sequence length: 82\n", "Adding 3-gram features\n", "max_features changed to 457800 with addition of ngrams\n", "Average train sequence length with ngrams: 245\n", "x_train shape: (5324,100)\n", "y_train shape: (5324,2)\n", "592 test sequences\n", "Average test sequence length: 75\n", "Average test sequence length with ngrams: 183\n", "x_test shape: (592,100)\n", "y_test shape: (592,2)\n" ] } ], "source": [ "(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder('data/ChnSentiCorp_htl_ba_6000', \n", " maxlen=100, \n", " max_features=30000,\n", " preprocess_mode='standard',\n", " train_test_names=['train'],\n", " val_pct=0.1,\n", " ngram_range=3,\n", " classes=['pos', 'neg'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 2: Create a Model and Wrap in Learner Object" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Is Multi-Label? False\n", "compiling word ID features...\n", "maxlen is 100\n", "building document-term matrix... this may take a few moments...\n", "rows: 1-5324\n", "computing log-count ratios...\n", "done.\n" ] } ], "source": [ "model = text.text_classifier('nbsvm', (x_train, y_train) , preproc=preproc)\n", "learner = ktrain.get_learner(model, \n", " train_data=(x_train, y_train), \n", " val_data=(x_test, y_test), \n", " batch_size=32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 3: Estimate the LR\n", "We'll use the *ktrain* learning rate finder to find a good learning rate to use with *nbsvm*. We will, then, select the highest learning rate associated with a still falling loss.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simulating training for different learning rates... this may take a few moments...\n", "Epoch 1/1024\n", "5324/5324 [==============================] - 1s 245us/step - loss: 0.6923 - acc: 0.5255\n", "Epoch 2/1024\n", "5324/5324 [==============================] - 1s 169us/step - loss: 0.6915 - acc: 0.5385\n", "Epoch 3/1024\n", "5324/5324 [==============================] - 1s 161us/step - loss: 0.6872 - acc: 0.6056\n", "Epoch 4/1024\n", "5324/5324 [==============================] - 1s 167us/step - loss: 0.6653 - acc: 0.7979\n", "Epoch 5/1024\n", "5324/5324 [==============================] - 1s 174us/step - loss: 0.5744 - acc: 0.9303\n", "Epoch 6/1024\n", "5324/5324 [==============================] - 1s 173us/step - loss: 0.3491 - acc: 0.9699\n", "Epoch 7/1024\n", "5324/5324 [==============================] - 1s 170us/step - loss: 0.1122 - acc: 0.9900\n", "Epoch 8/1024\n", "5324/5324 [==============================] - 1s 172us/step - loss: 0.0244 - acc: 0.9962\n", "Epoch 9/1024\n", "5324/5324 [==============================] - 1s 171us/step - loss: 0.0106 - acc: 0.9968\n", "Epoch 10/1024\n", "5324/5324 [==============================] - 1s 169us/step - loss: 0.0072 - acc: 0.9970\n", "Epoch 11/1024\n", " 352/5324 [>.............................] - ETA: 0s - loss: 0.0056 - acc: 0.9972 \n", "\n", "done.\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.lr_find(show_plot=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 4: Train the Model\n", "\n", "We will use the `autofit` method that employs a triangular learning rate policy with EarlyStopping and ReduceLROnPlateau automatically enabled, since the epochs argument is omitted. We monitor `val_acc`, so weights from the epoch with the highest validation accuracy will be automatically loaded into our model when training completes.\n", "\n", "As shown in the cell below, our final validation accuracy is **92%** with only 7 seconds of training!" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "early_stopping automatically enabled at patience=5\n", "reduce_on_plateau automatically enabled at patience=2\n", "\n", "\n", "begin training using triangular learning rate policy with max lr of 0.007...\n", "Train on 5324 samples, validate on 592 samples\n", "Epoch 1/1024\n", "5324/5324 [==============================] - 1s 219us/step - loss: 0.3265 - acc: 0.8924 - val_loss: 0.2218 - val_acc: 0.9139\n", "Epoch 2/1024\n", "5324/5324 [==============================] - 1s 208us/step - loss: 0.0274 - acc: 0.9951 - val_loss: 0.2047 - val_acc: 0.9155\n", "Epoch 3/1024\n", "5324/5324 [==============================] - 1s 204us/step - loss: 0.0166 - acc: 0.9968 - val_loss: 0.2060 - val_acc: 0.9155\n", "Epoch 4/1024\n", "5324/5324 [==============================] - 1s 206us/step - loss: 0.0137 - acc: 0.9968 - val_loss: 0.2062 - val_acc: 0.9206\n", "Epoch 5/1024\n", "5324/5324 [==============================] - 1s 213us/step - loss: 0.0120 - acc: 0.9970 - val_loss: 0.2078 - val_acc: 0.9189\n", "Epoch 6/1024\n", "5324/5324 [==============================] - 1s 204us/step - loss: 0.0111 - acc: 0.9970 - val_loss: 0.2082 - val_acc: 0.9206\n", "\n", "Epoch 00006: Reducing Max LR on Plateau: new max lr will be 0.0035 (if not early_stopping).\n", "Epoch 7/1024\n", "5324/5324 [==============================] - 1s 211us/step - loss: 0.0103 - acc: 0.9970 - val_loss: 0.2090 - val_acc: 0.9206\n", "Weights from best epoch have been loaded into model.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.autofit(7e-3, monitor='val_acc')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " neg 0.91 0.94 0.92 310\n", " pos 0.93 0.89 0.91 282\n", "\n", " accuracy 0.92 592\n", " macro avg 0.92 0.91 0.92 592\n", "weighted avg 0.92 0.92 0.92 592\n", "\n" ] }, { "data": { "text/plain": [ "array([[290, 20],\n", " [ 30, 252]])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.validate(class_names=preproc.get_classes())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspecting Misclassifications" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "----------\n", "id:294 | loss:5.13 | true:neg | pred:pos)\n", "\n", "酒店 环境 还 不错 , 装修 也 很 好 。 早餐 不怎么样 , 价格 偏高 。\n" ] } ], "source": [ "learner.view_top_losses(n=1, preproc=preproc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using Google Translate, the above roughly translates to:\n", "```\n", "The hotel environment is not bad, the decoration is also very good. Breakfast is not good, the price is high.\n", "```\n", "\n", "This is a mixed review, but is labeled only as negative. Our classifier is undertandably confused and predicts positive for this reivew." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making Predictions on New Data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "p = ktrain.get_predictor(learner.model, preproc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predicting label for the text\n", "> \"*The view and service of this hotel were terrible and our room was dirty.*\"" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'neg'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.predict(\"这家酒店的看法和服务都很糟糕,我们的房间很脏。\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predicting label for:\n", "> \"*I like the service of this hotel.*\"" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'pos'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.predict('我喜欢这家酒店的服务')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }