{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2\n",
    "%matplotlib inline\n",
    "import os\n",
    "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n",
    "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "using Keras version: 2.2.4\n"
     ]
    }
   ],
   "source": [
    "import ktrain\n",
    "from ktrain import text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building an Arabic Sentiment Analyzer\n",
    "\n",
    "In this notebook, we will build a simple, fast, and accurate Arabic-language text classification model in 4 simple steps. More specifically, we will build a model that classifies Arabic hotel reviews as either positive or negative.\n",
    "\n",
    "The dataset can be downloaded from Ashraf Elnagar's GitHub repository (https://github.com/elnagara/HARD-Arabic-Dataset).\n",
    "\n",
    "Each entry in the dataset includes a review in Arabic and a rating between 1 and 5.  We will convert this to a binary classification dataset by assigning reviews with a rating of above 3 a positive label of 1 and assigning reviews with a rating of less than 3 a negative label of 0.\n",
    "\n",
    "(**Disclaimer:** I don't speak Arabic. Please forgive mistakes.) \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>neg</th>\n",
       "      <th>pos</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>“ممتاز”. النظافة والطاقم متعاون.</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>استثنائي. سهولة إنهاء المعاملة في الاستقبال. ل...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>استثنائي. انصح بأختيار الاسويت و بالاخص غرفه ر...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>“استغرب تقييم الفندق كخمس نجوم”. لا شي. يستحق ...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>جيد. المكان جميل وهاديء. كل شي جيد ونظيف بس كا...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text  neg  pos\n",
       "0                  “ممتاز”. النظافة والطاقم متعاون.     1    0\n",
       "1  استثنائي. سهولة إنهاء المعاملة في الاستقبال. ل...    0    1\n",
       "2  استثنائي. انصح بأختيار الاسويت و بالاخص غرفه ر...    0    1\n",
       "3  “استغرب تقييم الفندق كخمس نجوم”. لا شي. يستحق ...    1    0\n",
       "4  جيد. المكان جميل وهاديء. كل شي جيد ونظيف بس كا...    0    1"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# convert ratings to a binary format:  1=positive, 0=negative\n",
    "import pandas as pd\n",
    "df = pd.read_csv('data/arabic_hotel_reviews/balanced-reviews.txt', delimiter='\\t', encoding='utf-16')\n",
    "df = df[['rating', 'review']] \n",
    "df['rating'] = df['rating'].apply(lambda x: 'neg' if x < 3 else 'pos')\n",
    "df.columns = ['label', 'text']\n",
    "df = pd.concat([df, df.label.astype('str').str.get_dummies()], axis=1, sort=False)\n",
    "df = df[['text', 'neg', 'pos']]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## STEP 1:  Load and Preprocess the Data\n",
    "\n",
    "First, we use the `texts_from_df` function to load and preprocess the data in to arrays that can be directly fed into a neural network model. \n",
    "\n",
    "We set `val_pct` as 0.1, which will automatically sample 10% of the data for validation.  We specifiy `preprocess_mode='bert'`, as we will fine-tuning a BERT model in this example. If using a different model, you will select `preprocess_mode='standard'`.\n",
    "\n",
    "**Notice that there is nothing speical or extra we need to do here for non-English text.**  *ktrain* automatically detects the language and character encoding and prepares the data and configures the model appropriately.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "preprocessing train...\n",
      "language: ar\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "done."
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "preprocessing test...\n",
      "language: ar\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "done."
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "(x_train, y_train), (x_test, y_test), preproc = text.texts_from_df(df, \n",
    "                                                                   'text', # name of column containing review text\n",
    "                                                                   label_columns=['neg', 'pos'],\n",
    "                                                                   maxlen=75, \n",
    "                                                                   max_features=100000,\n",
    "                                                                   preprocess_mode='bert',\n",
    "                                                                   val_pct=0.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## STEP 2:  Create a Model and Wrap in Learner Object"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will employ a neural implementation of the [NBSVM](https://www.aclweb.org/anthology/P12-2018/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Is Multi-Label? False\n",
      "maxlen is 75\n",
      "done.\n"
     ]
    }
   ],
   "source": [
    "model = text.text_classifier('bert', (x_train, y_train) , preproc=preproc)\n",
    "learner = ktrain.get_learner(model, \n",
    "                             train_data=(x_train, y_train), \n",
    "                             val_data=(x_test, y_test), \n",
    "                             batch_size=32)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## STEP 3: Train the Model\n",
    "\n",
    "We will use the `fit_onecycle` method that employs a 1cycle learning rate policy and train 1 epoch.\n",
    "\n",
    "As shown in the cell below, our final validation accuracy is **95.53%** over a single epoch!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "begin training using onecycle policy with max lr of 2e-05...\n",
      "Train on 95128 samples, validate on 10570 samples\n",
      "Epoch 1/1\n",
      "95128/95128 [==============================] - 818s 9ms/step - loss: 0.1683 - acc: 0.9322 - val_loss: 0.1225 - val_acc: 0.9553\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7f941a4bf9b0>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "learner.fit_onecycle(2e-5, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Making Predictions on New Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = ktrain.get_predictor(learner.model, preproc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Predicting label for the text\n",
    "> \"*The room was clean, the food excellent, and I loved the view from my room.*\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'pos'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p.predict(\"الغرفة كانت نظيفة ، الطعام ممتاز ، وأنا أحب المنظر من غرفتي.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Predicting label for:\n",
    "> \"*This hotel was too expensive and the staff is rude.*\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'neg'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p.predict('كان هذا الفندق باهظ الثمن والموظفين غير مهذبين.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save our Predictor for Later Deployment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save model for later use\n",
    "p.save('/tmp/arabic_predictor')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# reload from disk\n",
    "p = ktrain.load_predictor('/tmp/arabic_predictor')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'pos'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# still works as expected after reloading from disk\n",
    "p.predict(\"الغرفة كانت نظيفة ، الطعام ممتاز ، وأنا أحب المنظر من غرفتي.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}