{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "using Keras version: 2.2.4\n" ] } ], "source": [ "import ktrain\n", "from ktrain import text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Building a Chinese-Language Sentiment Analyzer\n", "\n", "In this notebook, we will build a Chinese-language text classification model in 3 simple steps. More specifically, we will build a model that classifies Chinese hotel reviews as either positive or negative.\n", "\n", "The dataset can be downloaded from Chengwei Zhang's GitHub repository [here](https://github.com/Tony607/Chinese_sentiment_analysis/tree/master/data/ChnSentiCorp_htl_ba_6000).\n", "\n", "(**Disclaimer:** I don't speak Chinese. Please forgive mistakes.) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 1: Load and Preprocess the Data\n", "\n", "First, we use the `texts_from_folder` function to load and preprocess the data. We assume that the data is in the following form:\n", "```\n", " ├── datadir\n", " │ ├── train\n", " │ │ ├── class0 # folder containing documents of class 0\n", " │ │ ├── class1 # folder containing documents of class 1\n", " │ │ ├── class2 # folder containing documents of class 2\n", " │ │ └── classN # folder containing documents of class N\n", "```\n", "We set `val_pct` as 0.1, which will automatically sample 10% of the data for validation. Since we will be using a pretrained BERT model for classification, we specifiy `preprocess_mode='bert'`. If you are using any other model (e.g., `fasttext`), you should either omit this parameter or use `preprocess_mode='standard'`).\n", "\n", "**Notice that there is nothing speical or extra we need to do here for non-English text.** *ktrain* automatically detects the language and character encoding and prepares the data and configures the model appropriately.\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "detected encoding: GB18030\n", "Decoding with GB18030 failed 1st attempt - using GB18030 with skips\n", "skipped 109 lines (0.3%) due to character decoding errors\n", "skipped 9 lines (0.2%) due to character decoding errors\n", "preprocessing train...\n", "language: zh-cn\n" ] }, { "data": { "text/html": [ "done." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "preprocessing test...\n", "language: zh-cn\n" ] }, { "data": { "text/html": [ "done." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "trn, val, preproc = text.texts_from_folder('/home/amaiya/data/ChnSentiCorp_htl_ba_6000', \n", " maxlen=75, \n", " max_features=30000,\n", " preprocess_mode='bert',\n", " train_test_names=['train'],\n", " val_pct=0.1,\n", " classes=['pos', 'neg'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 2: Create a Model and Wrap in Learner Object" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Is Multi-Label? False\n", "maxlen is 75\n", "done.\n" ] } ], "source": [ "model = text.text_classifier('bert', trn, preproc=preproc)\n", "learner = ktrain.get_learner(model, \n", " train_data=trn, \n", " val_data=val, \n", " batch_size=32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 3: Train the Model\n", "\n", "We will use the `fit_onecycle` method that employs a [1cycle learning rate policy](https://arxiv.org/pdf/1803.09820.pdf) for four epochs. We will save the weights from each epoch using the `checkpoint_folder` argument, so that we can go reload the weights from the best epoch in case we overfit." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "begin training using onecycle policy with max lr of 2e-05...\n", "Train on 5324 samples, validate on 592 samples\n", "Epoch 1/4\n", "5324/5324 [==============================] - 54s 10ms/step - loss: 0.3635 - acc: 0.8422 - val_loss: 0.2793 - val_acc: 0.8801\n", "Epoch 2/4\n", "5324/5324 [==============================] - 42s 8ms/step - loss: 0.2151 - acc: 0.9170 - val_loss: 0.2501 - val_acc: 0.9223\n", "Epoch 3/4\n", "5324/5324 [==============================] - 42s 8ms/step - loss: 0.1153 - acc: 0.9591 - val_loss: 0.2267 - val_acc: 0.9257\n", "Epoch 4/4\n", "5324/5324 [==============================] - 42s 8ms/step - loss: 0.0438 - acc: 0.9859 - val_loss: 0.2596 - val_acc: 0.9324\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.fit_onecycle(2e-5, 4, checkpoint_folder='/tmp/saved_weights')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although Epoch 03 had the lowest validation loss, the final validation accuracy at the end of the last epoch is still the highest (i.e., **93.24%**), so we will just leave the model weights as they are this time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspecting Misclassifications" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "----------\n", "id:299 | loss:7.37 | true:neg | pred:pos)\n", "\n", "[CLS]酒店位置佳,出入西街比较方便;观景房是观西街的景,晚上虽然较吵但基本不会影响到睡眠;酒店对面就是在携程上预订自行车的九九车行,取车方便。通过携程订[SEP]\n" ] } ], "source": [ "learner.view_top_losses(n=1, preproc=preproc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using Google Translate, the above roughly translates to:\n", "```\n", "\n", "Hotel location is good, access to West Street is more convenient; viewing room is the view of Guanxi Street, although the night is noisy, but it will not affect sleep; the opposite side of the hotel is the Jiujiu car line booking bicycles on Ctrip, easy to pick up. By Ctrip\n", "\n", "```\n", "\n", "Although there is a minor negative comment embedded in this review about noise, the review appears to be overall positive and was predicted as positive by our classifier. The ground-truth label, however, is negative, which may be a mistake and may explain the high loss.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making Predictions on New Data" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "p = ktrain.get_predictor(learner.model, preproc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predicting label for the text\n", "> \"*The view and service of this hotel were terrible and our room was dirty.*\"" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'neg'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.predict(\"这家酒店的看法和服务都很糟糕,我们的房间很脏。\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predicting label for:\n", "> \"*I like the service of this hotel.*\"" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'pos'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.predict('我喜欢这家酒店的服务')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save Predictor for Later Deployment" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "p.save('/tmp/mypred')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "p = ktrain.load_predictor('/tmp/mypred')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'pos'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# still works\n", "p.predict('我喜欢这家酒店的服务')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }