{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "using Keras version: 2.2.4\n" ] } ], "source": [ "import ktrain\n", "from ktrain import text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Building a Chinese-Language Sentiment Analyzer\n", "\n", "In this notebook, we will build a Chinese-language text classification model in 4 simple steps. More specifically, we will build a model that classifies Chinese hotel reviews as either positive or negative.\n", "\n", "The dataset can be downloaded from Chengwei Zhang's GitHub repository [here](https://github.com/Tony607/Chinese_sentiment_analysis/tree/master/data/ChnSentiCorp_htl_ba_6000).\n", "\n", "(**Disclaimer:** I don't speak Chinese. Please forgive mistakes.) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 1: Load and Preprocess the Data\n", "\n", "First, we use the `texts_from_folder` function to load and preprocess the data. We assume that the data is in the following form:\n", "```\n", " ├── datadir\n", " │ ├── train\n", " │ │ ├── class0 # folder containing documents of class 0\n", " │ │ ├── class1 # folder containing documents of class 1\n", " │ │ ├── class2 # folder containing documents of class 2\n", " │ │ └── classN # folder containing documents of class N\n", "```\n", "We set `val_pct` as 0.1, which will automatically sample 10% of the data for validation. We specifiy `preprocess_mode='standard'` to employ normal text preprocessing. If you are using the BERT model (i.e., 'bert'), you should use `preprocess_mode='bert'`.\n", "\n", "**Notice that there is nothing speical or extra we need to do here for non-English text.** *ktrain* automatically detects the language and character encoding and prepares the data and configures the model appropriately.\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "detected encoding: GB18030\n", "Decoding with GB18030 failed 1st attempt - using GB18030 with skips\n", "skipped 104 lines (0.3%) due to character decoding errors\n", "skipped 14 lines (0.4%) due to character decoding errors\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Building prefix dict from the default dictionary ...\n", "WARNING: Logging before flag parsing goes to stderr.\n", "I1001 15:11:00.816586 140470484846400 __init__.py:111] Building prefix dict from the default dictionary ...\n", "Loading model from cache /tmp/jieba.cache\n", "I1001 15:11:00.818966 140470484846400 __init__.py:131] Loading model from cache /tmp/jieba.cache\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "language: zh-cn\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Loading model cost 0.641 seconds.\n", "I1001 15:11:01.459813 140470484846400 __init__.py:163] Loading model cost 0.641 seconds.\n", "Prefix dict has been built succesfully.\n", "I1001 15:11:01.461843 140470484846400 __init__.py:164] Prefix dict has been built succesfully.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word Counts: 22066\n", "Nrows: 5324\n", "5324 train sequences\n", "Average train sequence length: 81\n", "x_train shape: (5324,75)\n", "y_train shape: (5324,2)\n", "592 test sequences\n", "Average test sequence length: 85\n", "x_test shape: (592,75)\n", "y_test shape: (592,2)\n" ] } ], "source": [ "(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder('data/ChnSentiCorp_htl_ba_6000', \n", " maxlen=75, \n", " max_features=30000,\n", " preprocess_mode='standard',\n", " train_test_names=['train'],\n", " val_pct=0.1,\n", " classes=['pos', 'neg'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 2: Create a Model and Wrap in Learner Object" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Is Multi-Label? False\n", "compiling word ID features...\n", "maxlen is 75\n", "done.\n" ] } ], "source": [ "model = text.text_classifier('fasttext', (x_train, y_train) , preproc=preproc)\n", "learner = ktrain.get_learner(model, \n", " train_data=(x_train, y_train), \n", " val_data=(x_test, y_test), \n", " batch_size=32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 3: Estimate the LR\n", "We'll use the *ktrain* learning rate finder to find a good learning rate to use with *fasttext*. We select a high learning rate that is associated with a still falling loss from the plot.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simulating training for different learning rates... this may take a few moments...\n", "Epoch 1/1024\n", "5324/5324 [==============================] - 2s 466us/step - loss: 0.9928 - acc: 0.5173\n", "Epoch 2/1024\n", "5324/5324 [==============================] - 2s 308us/step - loss: 1.0088 - acc: 0.5011\n", "Epoch 3/1024\n", "5324/5324 [==============================] - 2s 324us/step - loss: 0.9870 - acc: 0.5066\n", "Epoch 4/1024\n", "5324/5324 [==============================] - 2s 314us/step - loss: 0.9727 - acc: 0.5116\n", "Epoch 5/1024\n", "5324/5324 [==============================] - 2s 319us/step - loss: 0.8829 - acc: 0.5406\n", "Epoch 6/1024\n", "5324/5324 [==============================] - 2s 309us/step - loss: 0.6585 - acc: 0.6597\n", "Epoch 7/1024\n", "5324/5324 [==============================] - 2s 314us/step - loss: 0.5113 - acc: 0.7607\n", "Epoch 8/1024\n", "5324/5324 [==============================] - 2s 309us/step - loss: 0.4962 - acc: 0.7746\n", "Epoch 9/1024\n", "5324/5324 [==============================] - 2s 318us/step - loss: 0.6645 - acc: 0.5920\n", "Epoch 10/1024\n", "5324/5324 [==============================] - 2s 325us/step - loss: 0.7151 - acc: 0.4985\n", "Epoch 11/1024\n", "5324/5324 [==============================] - 2s 317us/step - loss: 0.8465 - acc: 0.5015\n", "Epoch 12/1024\n", " 416/5324 [=>............................] - ETA: 1s - loss: 2.3385 - acc: 0.5048\n", "\n", "done.\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.lr_find(show_plot=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 4: Train the Model\n", "\n", "We will use the `fit_onecycle` method that employs a [1cycle learning rate policy](https://arxiv.org/pdf/1803.09820.pdf) for 10 epochs (i.e., roughly 20 seconds)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "begin training using onecycle policy with max lr of 0.005...\n", "Train on 5324 samples, validate on 592 samples\n", "Epoch 1/10\n", "5324/5324 [==============================] - 2s 356us/step - loss: 0.7315 - acc: 0.6409 - val_loss: 0.4885 - val_acc: 0.7669\n", "Epoch 2/10\n", "5324/5324 [==============================] - 2s 352us/step - loss: 0.4666 - acc: 0.7855 - val_loss: 0.3647 - val_acc: 0.8530\n", "Epoch 3/10\n", "5324/5324 [==============================] - 2s 353us/step - loss: 0.3553 - acc: 0.8492 - val_loss: 0.3181 - val_acc: 0.8750\n", "Epoch 4/10\n", "5324/5324 [==============================] - 2s 356us/step - loss: 0.2746 - acc: 0.8875 - val_loss: 0.3126 - val_acc: 0.8699\n", "Epoch 5/10\n", "5324/5324 [==============================] - 2s 349us/step - loss: 0.2424 - acc: 0.9031 - val_loss: 0.3129 - val_acc: 0.8801\n", "Epoch 6/10\n", "5324/5324 [==============================] - 2s 353us/step - loss: 0.2130 - acc: 0.9174 - val_loss: 0.2984 - val_acc: 0.8750\n", "Epoch 7/10\n", "5324/5324 [==============================] - 2s 352us/step - loss: 0.1643 - acc: 0.9378 - val_loss: 0.2843 - val_acc: 0.9020\n", "Epoch 8/10\n", "5324/5324 [==============================] - 2s 352us/step - loss: 0.1301 - acc: 0.9517 - val_loss: 0.2865 - val_acc: 0.9037\n", "Epoch 9/10\n", "5324/5324 [==============================] - 2s 362us/step - loss: 0.1019 - acc: 0.9592 - val_loss: 0.3035 - val_acc: 0.9037\n", "Epoch 10/10\n", "5324/5324 [==============================] - 2s 363us/step - loss: 0.0823 - acc: 0.9728 - val_loss: 0.3098 - val_acc: 0.9037\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.fit_onecycle(5e-3, 10)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " neg 0.91 0.91 0.91 315\n", " pos 0.90 0.89 0.90 277\n", "\n", " accuracy 0.90 592\n", " macro avg 0.90 0.90 0.90 592\n", "weighted avg 0.90 0.90 0.90 592\n", "\n" ] }, { "data": { "text/plain": [ "array([[288, 27],\n", " [ 30, 247]])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.validate(class_names=preproc.get_classes())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspecting Misclassifications" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "----------\n", "id:345 | loss:10.15 | true:pos | pred:neg)\n", "\n", "所谓山景房,就是非海景房而已,没有什么山景可言,海景房确实,有条件尽量选。只是这种房的窗帘边上拉不严,早上光线进来如同亮着灯一般,可能引发。另外窗外隔音不佳,如果呼呼明显,这想必也必不了了。\n" ] } ], "source": [ "learner.view_top_losses(n=1, preproc=preproc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using Google Translate, the above roughly translates to:\n", "```\n", "\n", "The so-called mountain view room is just a non-sea view room, there is no mountain view at all, the sea view room is indeed, there are conditions to choose as much as possible. It’s just that the curtains in this room are not pulled up. The morning light comes in like a lit lamp, which may be triggered. In addition, the sound insulation outside the window is not good. If the whirring is obvious, it must be no longer necessary.\n", "```\n", "\n", "Mistranslations aside, this is clearly a negative review. It appears to have been incorrectly assigned a ground-truth label of positive." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making Predictions on New Data" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "p = ktrain.get_predictor(learner.model, preproc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predicting label for the text\n", "> \"*The view and service of this hotel were terrible and our room was dirty.*\"" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'neg'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.predict(\"这家酒店的看法和服务都很糟糕,我们的房间很脏。\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predicting label for:\n", "> \"*I like the service of this hotel.*\"" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'pos'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.predict('我喜欢这家酒店的服务')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving Predictor for Later Deployment" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "p.save('/tmp/mypred')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "p = ktrain.load_predictor('/tmp/mypred')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'neg'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# still works\n", "p.predict(\"这家酒店的风景和服务都非常糟糕\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }