{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Building an Arabic Sentiment Analyzer With BERT\n", "\n", "In this notebook, we will build a simple, fast, and accurate Arabic-language text classification model with minimal effort. More specifically, we will build a model that classifies Arabic hotel reviews as either positive or negative.\n", "\n", "The dataset can be downloaded from Ashraf Elnagar's GitHub repository (https://github.com/elnagara/HARD-Arabic-Dataset).\n", "\n", "Each entry in the dataset includes a review in Arabic and a rating between 1 and 5. We will convert this to a binary classification dataset by assigning reviews with a rating of above 3 a positive label and assigning reviews with a rating of less than 3 a negative label.\n", "\n", "(**Disclaimer:** I don't speak Arabic. Please forgive mistakes.) \n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | rating | \n", "review | \n", "
---|---|---|
0 | \n", "neg | \n", "“ممتاز”. النظافة والطاقم متعاون. | \n", "
1 | \n", "pos | \n", "استثنائي. سهولة إنهاء المعاملة في الاستقبال. ل... | \n", "
2 | \n", "pos | \n", "استثنائي. انصح بأختيار الاسويت و بالاخص غرفه ر... | \n", "
3 | \n", "neg | \n", "“استغرب تقييم الفندق كخمس نجوم”. لا شي. يستحق ... | \n", "
4 | \n", "pos | \n", "جيد. المكان جميل وهاديء. كل شي جيد ونظيف بس كا... | \n", "