{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2\n",
    "%matplotlib inline\n",
    "import os\n",
    "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n",
    "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building an Arabic Sentiment Analyzer With BERT\n",
    "\n",
    "In this notebook, we will build a simple, fast, and accurate Arabic-language text classification model with minimal effort. More specifically, we will build a model that classifies Arabic hotel reviews as either positive or negative.\n",
    "\n",
    "The dataset can be downloaded from Ashraf Elnagar's GitHub repository (https://github.com/elnagara/HARD-Arabic-Dataset).\n",
    "\n",
    "Each entry in the dataset includes a review in Arabic and a rating between 1 and 5.  We will convert this to a binary classification dataset by assigning reviews with a rating of above 3 a positive label and assigning reviews with a rating of less than 3 a negative label.\n",
    "\n",
    "(**Disclaimer:** I don't speak Arabic. Please forgive mistakes.) \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rating</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>neg</td>\n",
       "      <td>“ممتاز”. النظافة والطاقم متعاون.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>pos</td>\n",
       "      <td>استثنائي. سهولة إنهاء المعاملة في الاستقبال. ل...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>pos</td>\n",
       "      <td>استثنائي. انصح بأختيار الاسويت و بالاخص غرفه ر...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>neg</td>\n",
       "      <td>“استغرب تقييم الفندق كخمس نجوم”. لا شي. يستحق ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>pos</td>\n",
       "      <td>جيد. المكان جميل وهاديء. كل شي جيد ونظيف بس كا...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  rating                                             review\n",
       "0    neg                  “ممتاز”. النظافة والطاقم متعاون. \n",
       "1    pos  استثنائي. سهولة إنهاء المعاملة في الاستقبال. ل...\n",
       "2    pos  استثنائي. انصح بأختيار الاسويت و بالاخص غرفه ر...\n",
       "3    neg  “استغرب تقييم الفندق كخمس نجوم”. لا شي. يستحق ...\n",
       "4    pos  جيد. المكان جميل وهاديء. كل شي جيد ونظيف بس كا..."
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# convert ratings to a binary format:  pos=positive, neg=negative\n",
    "import pandas as pd\n",
    "df = pd.read_csv('data/arabic_hotel_reviews/balanced-reviews.txt', delimiter='\\t', encoding='utf-16')\n",
    "df = df[['rating', 'review']] \n",
    "df['rating'] = df['rating'].apply(lambda x: 'neg' if x < 3 else 'pos')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's split out a training and validation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(89843, 15855)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train = df.sample(frac=0.85, random_state=42)\n",
    "df_test = df.drop(df_train.index)\n",
    "len(df_train), len(df_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the [Transformer API in *ktrain*](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-A3-hugging_face_transformers.ipynb), we can select any Hugging Face `transformers` model appropriate for our data.  Since we are dealing with Arabic, we will use [AraBERT](https://huggingface.co/aubmindlab/bert-base-arabert) by the AUB MIND Lab instead of multilingual BERT (which is normally used by *ktrain* for non-English datasets in the alternative [text_classifier API in *ktrain*](https://github.com/amaiya/ktrain/blob/master/examples/text/ArabicHotelReviews-BERT.ipynb)).  As you can see below, with only 1 epoch, we obtain a **96.37** accuracy on the validation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "preprocessing train...\n",
      "language: ar\n",
      "train sequence lengths:\n",
      "\tmean : 24\n",
      "\t95percentile : 67\n",
      "\t99percentile : 120\n"
     ]
    },
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Is Multi-Label? False\n",
      "preprocessing test...\n",
      "language: ar\n",
      "test sequence lengths:\n",
      "\tmean : 24\n",
      "\t95percentile : 67\n",
      "\t99percentile : 121\n"
     ]
    },
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "begin training using onecycle policy with max lr of 5e-05...\n",
      "Train for 2808 steps, validate for 496 steps\n",
      "2808/2808 [==============================] - 1104s 393ms/step - loss: 0.1447 - accuracy: 0.9466 - val_loss: 0.1054 - val_accuracy: 0.9637\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<tensorflow.python.keras.callbacks.History at 0x7f06344b84a8>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import ktrain\n",
    "from ktrain import text\n",
    "MODEL_NAME = 'aubmindlab/bert-base-arabertv01'\n",
    "t = text.Transformer(MODEL_NAME, maxlen=128)\n",
    "trn = t.preprocess_train(df_train.review.values, df_train.rating.values)\n",
    "val = t.preprocess_test(df_test.review.values, df_test.rating.values)\n",
    "model = t.get_classifier()\n",
    "learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=32)\n",
    "learner.fit_onecycle(5e-5, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Making Predictions on New Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = ktrain.get_predictor(learner.model, t)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Predicting label for the text\n",
    "> \"*The room was clean, the food excellent, and I loved the view from my room.*\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'pos'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p.predict(\"الغرفة كانت نظيفة ، الطعام ممتاز ، وأنا أحب المنظر من غرفتي.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Predicting label for:\n",
    "> \"*This hotel was too expensive and the staff is rude.*\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'neg'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p.predict('كان هذا الفندق باهظ الثمن والموظفين غير مهذبين.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save our Predictor for Later Deployment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save model for later use\n",
    "p.save('/tmp/arabic_predictor')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# reload from disk\n",
    "p = ktrain.load_predictor('/tmp/arabic_predictor')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'pos'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# still works as expected after reloading from disk\n",
    "p.predict(\"الغرفة كانت نظيفة ، الطعام ممتاز ، وأنا أحب المنظر من غرفتي.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}