{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Multinomial Naive Bayes" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.metrics import classification_report\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn.model_selection import train_test_split\n", "\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cargamos el conjunto de datos de spam o no spam (ham)\n", "\n", "La idea es poder predecir si un mensaje entrante es spam o no lo es (en caso que no lo sea se le llama *ham*)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
targettext
0hamGo until jurong point, crazy.. Available only ...
1hamOk lar... Joking wif u oni...
2spamFree entry in 2 a wkly comp to win FA Cup fina...
3hamU dun say so early hor... U c already then say...
4hamNah I don't think he goes to usf, he lives aro...
\n", "
" ], "text/plain": [ " target text\n", "0 ham Go until jurong point, crazy.. Available only ...\n", "1 ham Ok lar... Joking wif u oni...\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", "3 ham U dun say so early hor... U c already then say...\n", "4 ham Nah I don't think he goes to usf, he lives aro..." ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset = pd.read_table(\"./demo_3_dataset/spam_or_ham.txt\", header=None, names=[\"target\", \"text\"])\n", "dataset.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ham:\n", "Nah I don't think he goes to usf, he lives around here though\n", "spam:\n", "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n" ] } ], "source": [ "print('ham:')\n", "print(dataset.iloc[4,1])\n", "print('spam:')\n", "print(dataset.iloc[2,1])" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(5572, 2)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vectorize\n", "\n", "Transform the input from text into a bag of words matrix ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html))." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "vectorizer = CountVectorizer()\n", "vectorized_data= vectorizer.fit_transform(dataset[\"text\"])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['huge', 'hugging', 'hugh', 'hugs', 'huh', 'hui', 'huiming', 'hum',\n", " 'humanities', 'humans', 'hun', 'hundred', 'hundreds', 'hungover',\n", " 'hungry', 'hunks', 'hunny', 'hunt', 'hunting', 'hurricanes',\n", " 'hurried', 'hurry', 'hurt', 'hurting', 'hurts', 'husband',\n", " 'hussey', 'hustle', 'hut', 'hv', 'hv9d', 'hvae', 'hw', 'hyde',\n", " 'hype', 'hypertension', 'hypotheticalhuagauahahuagahyuhagga',\n", " 'iam', 'ias', 'ibh', 'ibhltd', 'ibiza', 'ibm', 'ibn', 'ibored',\n", " 'ibuprofens', 'ic', 'iccha', 'ice', 'icic', 'icicibank', 'icky',\n", " 'icmb3cktz8r7', 'icon', 'id', 'idc', 'idea', 'ideal', 'ideas',\n", " 'identification', 'identifier', 'idew', 'idiot', 'idk', 'idps',\n", " 'idu', 'ie', 'if', 'iff', 'ifink', 'ig11', 'ignorant', 'ignore',\n", " 'ignoring', 'ihave', 'ijust', 'ikea', 'ikno', 'iknow', 'il',\n", " 'ileave', 'ill', 'illness', 'illspeak', 'ilol', 'im', 'image',\n", " 'images', 'imagination', 'imagine', 'imat', 'imf', 'img', 'imin',\n", " 'imma', 'immed', 'immediately', 'immunisation', 'imp', 'impatient',\n", " 'impede', 'implications', 'important', 'importantly', 'imposed',\n", " 'impossible', 'imposter', 'impress', 'impressed', 'impression',\n", " 'impressively', 'improve', 'improved', 'imprtant', 'in', 'in2',\n", " 'inc', 'inch', 'inches', 'incident', 'inclu', 'include',\n", " 'includes', 'including', 'inclusive', 'incomm', 'inconsiderate',\n", " 'inconvenience', 'inconvenient', 'incorrect', 'increase',\n", " 'incredible', 'increments', 'inde', 'indeed', 'independence',\n", " 'independently', 'index', 'india', 'indian', 'indians', 'indicate',\n", " 'individual', 'indyarocks', 'inever', 'infact', 'infections',\n", " 'infernal', 'influx', 'info', 'inform', 'information', 'informed',\n", " 'infra', 'infront', 'ing', 'ingredients', 'initiate', 'ink',\n", " 'inlude', 'inmind', 'inner', 'innings', 'innocent', 'innu',\n", " 'inperialmusic', 'inpersonation', 'inr', 'insects', 'insha',\n", " 'inshah', 'inside', 'inspection', 'inst', 'install',\n", " 'installation', 'installing', 'instant', 'instantly', 'instead',\n", " 'instituitions', 'instructions', 'insurance', 'intelligent',\n", " 'intend', 'intention', 'intentions', 'interest', 'interested',\n", " 'interesting', 'interflora', 'interfued', 'internal', 'internet',\n", " 'interview', 'interviews', 'interviw', 'intha', 'into', 'intrepid'],\n", " dtype=object)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorizer.get_feature_names_out()[4000:4200]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(5572, 8713)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorized_data.shape" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " ...,\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0]], dtype=int64)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorized_data.toarray()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dividimos los datos en conjunto de entrenamiento y de prueba" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(vectorized_data, dataset.target, test_size=0.2, random_state=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Entrenamos el modelo con el conjunto de entrenamiento" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "clf = MultinomialNB().fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluamos el modelo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Podemos ver algún caso en particular" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Busquemos una entrada spam:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1658 ham\n", "1509 ham\n", "3266 spam\n", "5199 ham\n", "3217 spam\n", "2579 ham\n", "1330 ham\n", "151 ham\n", "209 ham\n", "2066 ham\n", "691 ham\n", "4727 ham\n", "5289 ham\n", "5235 ham\n", "2147 ham\n", "2409 ham\n", "5194 ham\n", "5099 ham\n", "2315 ham\n", "2506 ham\n", "Name: target, dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train[:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Primero comparemos a mano si nuestro clasificador está prediciendo correctamente ese dato del conjunto de entrenamiento" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "caso = 3266" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.predict(vectorized_data)[caso] == dataset.target[caso]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "spam 44 7732584351, Do you want a New Nokia 3510i colour phone DeliveredTomorrow? With 300 free minutes to any mobile + 100 free texts + Free Camcorder reply or call 08000930705.\n" ] } ], "source": [ "print(*dataset.iloc[caso])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Podemos obtener la probabilidad de que haya sido spam o ham:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[4.31330826e-18, 1.00000000e+00]])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.exp(clf.predict_log_proba(vectorized_data[caso]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En este caso, la probabilidad de spam es mayor que la de ham. Veamos lo mismo pero para un caso que no era spam:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ham Ugh my leg hurts. Musta overdid it on mon.\n", "[[9.99957311e-01 4.26892776e-05]]\n" ] } ], "source": [ "print(*dataset.iloc[5199])\n", "print(np.exp(clf.predict_log_proba(vectorized_data[5199])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En este caso la probabilidad de ham es mayor que la de spam, por eso lo clasifica como ham.\n", "\n", "Veamos algunos casos del conjunto de prueba:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2095 True\n", "5343 True\n", "564 True\n", "3849 True\n", "3317 True\n", "5277 True\n", "1674 True\n", "3753 True\n", "5507 True\n", "265 True\n", "4413 True\n", "5111 True\n", "4896 True\n", "3161 True\n", "3743 True\n", "2887 True\n", "869 False\n", "4061 True\n", "1072 True\n", "4559 True\n", "Name: target, dtype: bool" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.predict(X_test)[:20] == y_test[:20]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "El caso estudiado es:\n", "spam Hello. We need some posh birds and chaps to user trial prods for champneys. Can i put you down? I need your address and dob asap. Ta r \n", "\n", "Clasificó como ham y era spam. La probabilidad de ser ham fue 1.0000 y la de spam 0.0000\n", "Clasificó incorrectamente...\n" ] } ], "source": [ "caso = 869\n", "print(f\"El caso estudiado es:\")\n", "print(*dataset.iloc[caso],\"\\n\")\n", "print(f\"Clasificó como {clf.predict(vectorized_data)[caso]} y era {dataset.target[caso]}. La probabilidad de ser ham fue {np.exp(clf.predict_log_proba(vectorized_data)[caso][0]):.4f} y la de spam {np.exp(clf.predict_log_proba(vectorized_data)[caso][1]):.4f}\")\n", "if clf.predict(vectorized_data)[caso] == dataset.target[caso]:\n", " print(\"Clasificó correctamente!\")\n", "else:\n", " print(\"Clasificó incorrectamente...\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " ham 0.99 0.99 0.99 970\n", " spam 0.95 0.97 0.96 145\n", "\n", " accuracy 0.99 1115\n", " macro avg 0.97 0.98 0.98 1115\n", "weighted avg 0.99 0.99 0.99 1115\n", "\n" ] } ], "source": [ "print(classification_report(y_test, clf.predict(X_test)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label a negative sample as positive.\n", "\n", "The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.\n", "\n", "The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.\n", "\n", "The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.13 ('diplodatos')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "vscode": { "interpreter": { "hash": "e6b65fc4380ac725e50a330b268a227bbdbe91bddfffbf68e5f7ce9848a2b8d5" } } }, "nbformat": 4, "nbformat_minor": 4 }