{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Building an Arabic Sentiment Analyzer With BERT\n", "\n", "In this notebook, we will build a simple, fast, and accurate Arabic-language text classification model with minimal effort. More specifically, we will build a model that classifies Arabic hotel reviews as either positive or negative.\n", "\n", "The dataset can be downloaded from Ashraf Elnagar's GitHub repository (https://github.com/elnagara/HARD-Arabic-Dataset).\n", "\n", "Each entry in the dataset includes a review in Arabic and a rating between 1 and 5. We will convert this to a binary classification dataset by assigning reviews with a rating of above 3 a positive label and assigning reviews with a rating of less than 3 a negative label.\n", "\n", "(**Disclaimer:** I don't speak Arabic. Please forgive mistakes.) \n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ratingreview
0neg“ممتاز”. النظافة والطاقم متعاون.
1posاستثنائي. سهولة إنهاء المعاملة في الاستقبال. ل...
2posاستثنائي. انصح بأختيار الاسويت و بالاخص غرفه ر...
3neg“استغرب تقييم الفندق كخمس نجوم”. لا شي. يستحق ...
4posجيد. المكان جميل وهاديء. كل شي جيد ونظيف بس كا...
\n", "
" ], "text/plain": [ " rating review\n", "0 neg “ممتاز”. النظافة والطاقم متعاون. \n", "1 pos استثنائي. سهولة إنهاء المعاملة في الاستقبال. ل...\n", "2 pos استثنائي. انصح بأختيار الاسويت و بالاخص غرفه ر...\n", "3 neg “استغرب تقييم الفندق كخمس نجوم”. لا شي. يستحق ...\n", "4 pos جيد. المكان جميل وهاديء. كل شي جيد ونظيف بس كا..." ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# convert ratings to a binary format: pos=positive, neg=negative\n", "import pandas as pd\n", "df = pd.read_csv('data/arabic_hotel_reviews/balanced-reviews.txt', delimiter='\\t', encoding='utf-16')\n", "df = df[['rating', 'review']] \n", "df['rating'] = df['rating'].apply(lambda x: 'neg' if x < 3 else 'pos')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's split out a training and validation set." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(89843, 15855)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train = df.sample(frac=0.85, random_state=42)\n", "df_test = df.drop(df_train.index)\n", "len(df_train), len(df_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the [Transformer API in *ktrain*](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-A3-hugging_face_transformers.ipynb), we can select any Hugging Face `transformers` model appropriate for our data. Since we are dealing with Arabic, we will use [AraBERT](https://huggingface.co/aubmindlab/bert-base-arabert) by the AUB MIND Lab instead of multilingual BERT (which is normally used by *ktrain* for non-English datasets in the alternative [text_classifier API in *ktrain*](https://github.com/amaiya/ktrain/blob/master/examples/text/ArabicHotelReviews-BERT.ipynb)). As you can see below, with only 1 epoch, we obtain a **96.37** accuracy on the validation set." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "preprocessing train...\n", "language: ar\n", "train sequence lengths:\n", "\tmean : 24\n", "\t95percentile : 67\n", "\t99percentile : 120\n" ] }, { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Is Multi-Label? False\n", "preprocessing test...\n", "language: ar\n", "test sequence lengths:\n", "\tmean : 24\n", "\t95percentile : 67\n", "\t99percentile : 121\n" ] }, { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "begin training using onecycle policy with max lr of 5e-05...\n", "Train for 2808 steps, validate for 496 steps\n", "2808/2808 [==============================] - 1104s 393ms/step - loss: 0.1447 - accuracy: 0.9466 - val_loss: 0.1054 - val_accuracy: 0.9637\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import ktrain\n", "from ktrain import text\n", "MODEL_NAME = 'aubmindlab/bert-base-arabertv01'\n", "t = text.Transformer(MODEL_NAME, maxlen=128)\n", "trn = t.preprocess_train(df_train.review.values, df_train.rating.values)\n", "val = t.preprocess_test(df_test.review.values, df_test.rating.values)\n", "model = t.get_classifier()\n", "learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=32)\n", "learner.fit_onecycle(5e-5, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making Predictions on New Data" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "p = ktrain.get_predictor(learner.model, t)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predicting label for the text\n", "> \"*The room was clean, the food excellent, and I loved the view from my room.*\"" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'pos'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.predict(\"الغرفة كانت نظيفة ، الطعام ممتاز ، وأنا أحب المنظر من غرفتي.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predicting label for:\n", "> \"*This hotel was too expensive and the staff is rude.*\"" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'neg'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.predict('كان هذا الفندق باهظ الثمن والموظفين غير مهذبين.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save our Predictor for Later Deployment" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# save model for later use\n", "p.save('/tmp/arabic_predictor')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# reload from disk\n", "p = ktrain.load_predictor('/tmp/arabic_predictor')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'pos'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# still works as expected after reloading from disk\n", "p.predict(\"الغرفة كانت نظيفة ، الطعام ممتاز ، وأنا أحب المنظر من غرفتي.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }