{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "using Keras version: 2.2.4\n" ] } ], "source": [ "import ktrain\n", "from ktrain import text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Building an Arabic Sentiment Analyzer\n", "\n", "In this notebook, we will build a simple, fast, and accurate Arabic-language text classification model in 4 simple steps. More specifically, we will build a model that classifies Arabic hotel reviews as either positive or negative.\n", "\n", "The dataset can be downloaded from Ashraf Elnagar's GitHub repository (https://github.com/elnagara/HARD-Arabic-Dataset).\n", "\n", "Each entry in the dataset includes a review in Arabic and a rating between 1 and 5. We will convert this to a binary classification dataset by assigning reviews with a rating of above 3 a positive label of 1 and assigning reviews with a rating of less than 3 a negative label of 0.\n", "\n", "(**Disclaimer:** I don't speak Arabic. Please forgive mistakes.) \n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textnegpos
0“ممتاز”. النظافة والطاقم متعاون.10
1استثنائي. سهولة إنهاء المعاملة في الاستقبال. ل...01
2استثنائي. انصح بأختيار الاسويت و بالاخص غرفه ر...01
3“استغرب تقييم الفندق كخمس نجوم”. لا شي. يستحق ...10
4جيد. المكان جميل وهاديء. كل شي جيد ونظيف بس كا...01
\n", "
" ], "text/plain": [ " text neg pos\n", "0 “ممتاز”. النظافة والطاقم متعاون. 1 0\n", "1 استثنائي. سهولة إنهاء المعاملة في الاستقبال. ل... 0 1\n", "2 استثنائي. انصح بأختيار الاسويت و بالاخص غرفه ر... 0 1\n", "3 “استغرب تقييم الفندق كخمس نجوم”. لا شي. يستحق ... 1 0\n", "4 جيد. المكان جميل وهاديء. كل شي جيد ونظيف بس كا... 0 1" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# convert ratings to a binary format: 1=positive, 0=negative\n", "import pandas as pd\n", "df = pd.read_csv('data/arabic_hotel_reviews/balanced-reviews.txt', delimiter='\\t', encoding='utf-16')\n", "df = df[['rating', 'review']] \n", "df['rating'] = df['rating'].apply(lambda x: 'neg' if x < 3 else 'pos')\n", "df.columns = ['label', 'text']\n", "df = pd.concat([df, df.label.astype('str').str.get_dummies()], axis=1, sort=False)\n", "df = df[['text', 'neg', 'pos']]\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 1: Load and Preprocess the Data\n", "\n", "First, we use the `texts_from_df` function to load and preprocess the data in to arrays that can be directly fed into a neural network model. \n", "\n", "We set `val_pct` as 0.1, which will automatically sample 10% of the data for validation. We specifiy `preprocess_mode='bert'`, as we will fine-tuning a BERT model in this example. If using a different model, you will select `preprocess_mode='standard'`.\n", "\n", "**Notice that there is nothing speical or extra we need to do here for non-English text.** *ktrain* automatically detects the language and character encoding and prepares the data and configures the model appropriately.\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "preprocessing train...\n", "language: ar\n" ] }, { "data": { "text/html": [ "done." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "preprocessing test...\n", "language: ar\n" ] }, { "data": { "text/html": [ "done." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "(x_train, y_train), (x_test, y_test), preproc = text.texts_from_df(df, \n", " 'text', # name of column containing review text\n", " label_columns=['neg', 'pos'],\n", " maxlen=75, \n", " max_features=100000,\n", " preprocess_mode='bert',\n", " val_pct=0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 2: Create a Model and Wrap in Learner Object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will employ a neural implementation of the [NBSVM](https://www.aclweb.org/anthology/P12-2018/)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Is Multi-Label? False\n", "maxlen is 75\n", "done.\n" ] } ], "source": [ "model = text.text_classifier('bert', (x_train, y_train) , preproc=preproc)\n", "learner = ktrain.get_learner(model, \n", " train_data=(x_train, y_train), \n", " val_data=(x_test, y_test), \n", " batch_size=32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 3: Train the Model\n", "\n", "We will use the `fit_onecycle` method that employs a 1cycle learning rate policy and train 1 epoch.\n", "\n", "As shown in the cell below, our final validation accuracy is **95.53%** over a single epoch!" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "begin training using onecycle policy with max lr of 2e-05...\n", "Train on 95128 samples, validate on 10570 samples\n", "Epoch 1/1\n", "95128/95128 [==============================] - 818s 9ms/step - loss: 0.1683 - acc: 0.9322 - val_loss: 0.1225 - val_acc: 0.9553\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.fit_onecycle(2e-5, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making Predictions on New Data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "p = ktrain.get_predictor(learner.model, preproc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predicting label for the text\n", "> \"*The room was clean, the food excellent, and I loved the view from my room.*\"" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'pos'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.predict(\"الغرفة كانت نظيفة ، الطعام ممتاز ، وأنا أحب المنظر من غرفتي.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predicting label for:\n", "> \"*This hotel was too expensive and the staff is rude.*\"" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'neg'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.predict('كان هذا الفندق باهظ الثمن والموظفين غير مهذبين.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save our Predictor for Later Deployment" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# save model for later use\n", "p.save('/tmp/arabic_predictor')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# reload from disk\n", "p = ktrain.load_predictor('/tmp/arabic_predictor')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'pos'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# still works as expected after reloading from disk\n", "p.predict(\"الغرفة كانت نظيفة ، الطعام ممتاز ، وأنا أحب المنظر من غرفتي.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }