{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multinomial Naive Bayes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cargamos el conjunto de datos de spam o no spam (ham)\n",
    "\n",
    "La idea es poder predecir si un mensaje entrante es spam o no lo es (en caso que no lo sea se le llama *ham*)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>target</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ham</td>\n",
       "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ham</td>\n",
       "      <td>Ok lar... Joking wif u oni...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>spam</td>\n",
       "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ham</td>\n",
       "      <td>U dun say so early hor... U c already then say...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ham</td>\n",
       "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  target                                               text\n",
       "0    ham  Go until jurong point, crazy.. Available only ...\n",
       "1    ham                      Ok lar... Joking wif u oni...\n",
       "2   spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
       "3    ham  U dun say so early hor... U c already then say...\n",
       "4    ham  Nah I don't think he goes to usf, he lives aro..."
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset = pd.read_table(\"./demo_3_dataset/spam_or_ham.txt\", header=None, names=[\"target\", \"text\"])\n",
    "dataset.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ham:\n",
      "Nah I don't think he goes to usf, he lives around here though\n",
      "spam:\n",
      "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n"
     ]
    }
   ],
   "source": [
    "print('ham:')\n",
    "print(dataset.iloc[4,1])\n",
    "print('spam:')\n",
    "print(dataset.iloc[2,1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(5572, 2)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vectorize\n",
    "\n",
    "Transform the input from text into a bag of words matrix ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "vectorizer = CountVectorizer()\n",
    "vectorized_data= vectorizer.fit_transform(dataset[\"text\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['huge', 'hugging', 'hugh', 'hugs', 'huh', 'hui', 'huiming', 'hum',\n",
       "       'humanities', 'humans', 'hun', 'hundred', 'hundreds', 'hungover',\n",
       "       'hungry', 'hunks', 'hunny', 'hunt', 'hunting', 'hurricanes',\n",
       "       'hurried', 'hurry', 'hurt', 'hurting', 'hurts', 'husband',\n",
       "       'hussey', 'hustle', 'hut', 'hv', 'hv9d', 'hvae', 'hw', 'hyde',\n",
       "       'hype', 'hypertension', 'hypotheticalhuagauahahuagahyuhagga',\n",
       "       'iam', 'ias', 'ibh', 'ibhltd', 'ibiza', 'ibm', 'ibn', 'ibored',\n",
       "       'ibuprofens', 'ic', 'iccha', 'ice', 'icic', 'icicibank', 'icky',\n",
       "       'icmb3cktz8r7', 'icon', 'id', 'idc', 'idea', 'ideal', 'ideas',\n",
       "       'identification', 'identifier', 'idew', 'idiot', 'idk', 'idps',\n",
       "       'idu', 'ie', 'if', 'iff', 'ifink', 'ig11', 'ignorant', 'ignore',\n",
       "       'ignoring', 'ihave', 'ijust', 'ikea', 'ikno', 'iknow', 'il',\n",
       "       'ileave', 'ill', 'illness', 'illspeak', 'ilol', 'im', 'image',\n",
       "       'images', 'imagination', 'imagine', 'imat', 'imf', 'img', 'imin',\n",
       "       'imma', 'immed', 'immediately', 'immunisation', 'imp', 'impatient',\n",
       "       'impede', 'implications', 'important', 'importantly', 'imposed',\n",
       "       'impossible', 'imposter', 'impress', 'impressed', 'impression',\n",
       "       'impressively', 'improve', 'improved', 'imprtant', 'in', 'in2',\n",
       "       'inc', 'inch', 'inches', 'incident', 'inclu', 'include',\n",
       "       'includes', 'including', 'inclusive', 'incomm', 'inconsiderate',\n",
       "       'inconvenience', 'inconvenient', 'incorrect', 'increase',\n",
       "       'incredible', 'increments', 'inde', 'indeed', 'independence',\n",
       "       'independently', 'index', 'india', 'indian', 'indians', 'indicate',\n",
       "       'individual', 'indyarocks', 'inever', 'infact', 'infections',\n",
       "       'infernal', 'influx', 'info', 'inform', 'information', 'informed',\n",
       "       'infra', 'infront', 'ing', 'ingredients', 'initiate', 'ink',\n",
       "       'inlude', 'inmind', 'inner', 'innings', 'innocent', 'innu',\n",
       "       'inperialmusic', 'inpersonation', 'inr', 'insects', 'insha',\n",
       "       'inshah', 'inside', 'inspection', 'inst', 'install',\n",
       "       'installation', 'installing', 'instant', 'instantly', 'instead',\n",
       "       'instituitions', 'instructions', 'insurance', 'intelligent',\n",
       "       'intend', 'intention', 'intentions', 'interest', 'interested',\n",
       "       'interesting', 'interflora', 'interfued', 'internal', 'internet',\n",
       "       'interview', 'interviews', 'interviw', 'intha', 'into', 'intrepid'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer.get_feature_names_out()[4000:4200]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(5572, 8713)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorized_data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 0, 0, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 0, 0, 0],\n",
       "       ...,\n",
       "       [0, 0, 0, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorized_data.toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dividimos los datos en conjunto de entrenamiento y de prueba"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(vectorized_data, dataset.target, test_size=0.2, random_state=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Entrenamos el modelo con el conjunto de entrenamiento"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = MultinomialNB().fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluamos el modelo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Podemos ver algún caso en particular"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Busquemos una entrada spam:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1658     ham\n",
       "1509     ham\n",
       "3266    spam\n",
       "5199     ham\n",
       "3217    spam\n",
       "2579     ham\n",
       "1330     ham\n",
       "151      ham\n",
       "209      ham\n",
       "2066     ham\n",
       "691      ham\n",
       "4727     ham\n",
       "5289     ham\n",
       "5235     ham\n",
       "2147     ham\n",
       "2409     ham\n",
       "5194     ham\n",
       "5099     ham\n",
       "2315     ham\n",
       "2506     ham\n",
       "Name: target, dtype: object"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_train[:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Primero comparemos a mano si nuestro clasificador está prediciendo correctamente ese dato del conjunto de entrenamiento"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "caso = 3266"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.predict(vectorized_data)[caso] == dataset.target[caso]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "spam 44 7732584351, Do you want a New Nokia 3510i colour phone DeliveredTomorrow? With 300 free minutes to any mobile + 100 free texts + Free Camcorder reply or call 08000930705.\n"
     ]
    }
   ],
   "source": [
    "print(*dataset.iloc[caso])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Podemos obtener la probabilidad de que haya sido spam o ham:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[4.31330826e-18, 1.00000000e+00]])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.exp(clf.predict_log_proba(vectorized_data[caso]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "En este caso, la probabilidad de spam es mayor que la de ham. Veamos lo mismo pero para un caso que no era spam:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ham Ugh my leg hurts. Musta overdid it on mon.\n",
      "[[9.99957311e-01 4.26892776e-05]]\n"
     ]
    }
   ],
   "source": [
    "print(*dataset.iloc[5199])\n",
    "print(np.exp(clf.predict_log_proba(vectorized_data[5199])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "En este caso la probabilidad de ham es mayor que la de spam, por eso lo clasifica como ham.\n",
    "\n",
    "Veamos algunos casos del conjunto de prueba:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2095     True\n",
       "5343     True\n",
       "564      True\n",
       "3849     True\n",
       "3317     True\n",
       "5277     True\n",
       "1674     True\n",
       "3753     True\n",
       "5507     True\n",
       "265      True\n",
       "4413     True\n",
       "5111     True\n",
       "4896     True\n",
       "3161     True\n",
       "3743     True\n",
       "2887     True\n",
       "869     False\n",
       "4061     True\n",
       "1072     True\n",
       "4559     True\n",
       "Name: target, dtype: bool"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.predict(X_test)[:20] == y_test[:20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "El caso estudiado es:\n",
      "spam Hello. We need some posh birds and chaps to user trial prods for champneys. Can i put you down? I need your address and dob asap. Ta r \n",
      "\n",
      "Clasificó como ham y era spam. La probabilidad de ser ham fue 1.0000 y la de spam 0.0000\n",
      "Clasificó incorrectamente...\n"
     ]
    }
   ],
   "source": [
    "caso = 869\n",
    "print(f\"El caso estudiado es:\")\n",
    "print(*dataset.iloc[caso],\"\\n\")\n",
    "print(f\"Clasificó como {clf.predict(vectorized_data)[caso]} y era {dataset.target[caso]}. La probabilidad de ser ham fue {np.exp(clf.predict_log_proba(vectorized_data)[caso][0]):.4f} y la de spam {np.exp(clf.predict_log_proba(vectorized_data)[caso][1]):.4f}\")\n",
    "if clf.predict(vectorized_data)[caso] == dataset.target[caso]:\n",
    "    print(\"Clasificó correctamente!\")\n",
    "else:\n",
    "    print(\"Clasificó incorrectamente...\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "         ham       0.99      0.99      0.99       970\n",
      "        spam       0.95      0.97      0.96       145\n",
      "\n",
      "    accuracy                           0.99      1115\n",
      "   macro avg       0.97      0.98      0.98      1115\n",
      "weighted avg       0.99      0.99      0.99      1115\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(classification_report(y_test, clf.predict(X_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label a negative sample as positive.\n",
    "\n",
    "The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.\n",
    "\n",
    "The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.\n",
    "\n",
    "The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.13 ('diplodatos')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  },
  "vscode": {
   "interpreter": {
    "hash": "e6b65fc4380ac725e50a330b268a227bbdbe91bddfffbf68e5f7ce9848a2b8d5"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}