{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1bd3a4e9-5e71-4d5f-8582-c2141e186370",
   "metadata": {
    "id": "1bd3a4e9-5e71-4d5f-8582-c2141e186370"
   },
   "source": [
    "# Multinomial Naïve Bayes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08fcb7b3-fc94-4ce9-b058-0b26a3b93f7a",
   "metadata": {
    "id": "08fcb7b3-fc94-4ce9-b058-0b26a3b93f7a"
   },
   "source": [
    "This Notebook will have you working and experimenting with the Multinomial Naïve Bayes classifier. Initially, you will transform the given data in csv file to count matrix, then calculate the priors. Use those priors to compute likelyhoods according to Multinomial Naive Bayes and then classify the test data. Please note that use of `sklearn` implementations is only for the final question of the assignment, for other doubts regarding libraries you can reach out to the TAs.\n",
    "\n",
    "The dataset is about `Spam SMS`. There is 1 attribute that is the `message`, and the class label which could be `spam` or `ham`. The data is present in `spam.csv`. It contains about 5-6000 samples.\n",
    "For your convinience the data is already pre-processed and loaded, but I suggest you to just take a look at the code for your own knowledge, and parts vectorization is left up to you which could be easily done."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "aa8a96d4-58f3-4360-b6ae-a02405ffdddb",
   "metadata": {
    "id": "aa8a96d4-58f3-4360-b6ae-a02405ffdddb"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "630170e6-57f8-4dae-b644-ab275a3f53a2",
   "metadata": {
    "id": "630170e6-57f8-4dae-b644-ab275a3f53a2",
    "tags": []
   },
   "source": [
    "## Reading text-based data using pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2430d4f2-e3ed-4e09-9e50-a446889d58cf",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "2430d4f2-e3ed-4e09-9e50-a446889d58cf",
    "outputId": "4332c5cc-32fe-4efd-da0a-d8894b0df133"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "  <div id=\"df-34ff9f69-916d-446c-8f8c-b5432fb4f20d\">\n",
       "    <div class=\"colab-df-container\">\n",
       "      <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ham</td>\n",
       "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ham</td>\n",
       "      <td>Ok lar... Joking wif u oni...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>spam</td>\n",
       "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ham</td>\n",
       "      <td>U dun say so early hor... U c already then say...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ham</td>\n",
       "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-34ff9f69-916d-446c-8f8c-b5432fb4f20d')\"\n",
       "              title=\"Convert this dataframe to an interactive table.\"\n",
       "              style=\"display:none;\">\n",
       "        \n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "       width=\"24px\">\n",
       "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
       "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
       "  </svg>\n",
       "      </button>\n",
       "      \n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      flex-wrap:wrap;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "      <script>\n",
       "        const buttonEl =\n",
       "          document.querySelector('#df-34ff9f69-916d-446c-8f8c-b5432fb4f20d button.colab-df-convert');\n",
       "        buttonEl.style.display =\n",
       "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "        async function convertToInteractive(key) {\n",
       "          const element = document.querySelector('#df-34ff9f69-916d-446c-8f8c-b5432fb4f20d');\n",
       "          const dataTable =\n",
       "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                     [key], {});\n",
       "          if (!dataTable) return;\n",
       "\n",
       "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "            + ' to learn more about interactive tables.';\n",
       "          element.innerHTML = '';\n",
       "          dataTable['output_type'] = 'display_data';\n",
       "          await google.colab.output.renderOutput(dataTable, element);\n",
       "          const docLink = document.createElement('div');\n",
       "          docLink.innerHTML = docLinkHtml;\n",
       "          element.appendChild(docLink);\n",
       "        }\n",
       "      </script>\n",
       "    </div>\n",
       "  </div>\n",
       "  "
      ],
      "text/plain": [
       "  label                                            message\n",
       "0   ham  Go until jurong point, crazy.. Available only ...\n",
       "1   ham                      Ok lar... Joking wif u oni...\n",
       "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
       "3   ham  U dun say so early hor... U c already then say...\n",
       "4   ham  Nah I don't think he goes to usf, he lives aro..."
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# read file into pandas using a relative path\n",
    "\n",
    "df = pd.read_csv(\"./spam.csv\", encoding='latin-1')\n",
    "df.dropna(how=\"any\", inplace=True, axis=1)\n",
    "df.columns = ['label', 'message']\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5fbeec1-2b88-4af3-b8c6-906936938951",
   "metadata": {
    "id": "c5fbeec1-2b88-4af3-b8c6-906936938951"
   },
   "source": [
    "## Pre-processing\n",
    "\n",
    "- Our main issue with our data is that it is all in text format (strings). The classification algorithms that we usally use need some sort of numerical feature vector in order to perform the classification task. There are actually many methods to convert a corpus to a vector format. The simplest is the bag-of-words approach, where each unique word in a text will be represented by one number.\n",
    "\n",
    "- As a first step, let's write a function that will split a message into its individual words and return a list. We'll also remove very common words, ('the', 'a', etc..). To do this we will take advantage of the NLTK library. It's pretty much the standard library in Python for processing text and has a lot of useful features. We'll only use some of the basic ones here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d736a35c-e849-4029-bee7-42a31f7d0b11",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "d736a35c-e849-4029-bee7-42a31f7d0b11",
    "outputId": "5326f654-9e60-4951-8477-2b54dfb0b7c8"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
      "[nltk_data]   Unzipping corpora/stopwords.zip.\n"
     ]
    }
   ],
   "source": [
    "import string\n",
    "import nltk\n",
    "nltk.download('stopwords')\n",
    "\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "def text_process(mess):\n",
    "    \"\"\"\n",
    "    Takes in a string of text, then performs the following:\n",
    "    1. Remove all punctuation\n",
    "    2. Remove all stopwords\n",
    "    3. Returns a list of the cleaned text\n",
    "    \"\"\"\n",
    "    STOPWORDS = stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure']\n",
    "    # Check characters to see if they are in punctuation\n",
    "    nopunc = [char for char in mess if char not in string.punctuation]\n",
    "\n",
    "    # Join the characters again to form the string.\n",
    "    nopunc = ''.join(nopunc)\n",
    "    \n",
    "    # Now just remove any stopwords\n",
    "    return ' '.join([word for word in nopunc.split() if word.lower() not in STOPWORDS])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ed807b40-7762-4968-b546-2f72817beed3",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "ed807b40-7762-4968-b546-2f72817beed3",
    "outputId": "02ee09cc-8762-4677-ad87-50465fe8202a"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "  <div id=\"df-0c8b178c-92e2-4c11-ab1d-f538b93820df\">\n",
       "    <div class=\"colab-df-container\">\n",
       "      <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ham</td>\n",
       "      <td>Go jurong point crazy Available bugis n great ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ham</td>\n",
       "      <td>Ok lar Joking wif oni</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>spam</td>\n",
       "      <td>Free entry wkly comp win FA Cup final tkts 21s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ham</td>\n",
       "      <td>dun say early hor c already say</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ham</td>\n",
       "      <td>Nah think goes usf lives around though</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0c8b178c-92e2-4c11-ab1d-f538b93820df')\"\n",
       "              title=\"Convert this dataframe to an interactive table.\"\n",
       "              style=\"display:none;\">\n",
       "        \n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "       width=\"24px\">\n",
       "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
       "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
       "  </svg>\n",
       "      </button>\n",
       "      \n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      flex-wrap:wrap;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "      <script>\n",
       "        const buttonEl =\n",
       "          document.querySelector('#df-0c8b178c-92e2-4c11-ab1d-f538b93820df button.colab-df-convert');\n",
       "        buttonEl.style.display =\n",
       "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "        async function convertToInteractive(key) {\n",
       "          const element = document.querySelector('#df-0c8b178c-92e2-4c11-ab1d-f538b93820df');\n",
       "          const dataTable =\n",
       "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                     [key], {});\n",
       "          if (!dataTable) return;\n",
       "\n",
       "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "            + ' to learn more about interactive tables.';\n",
       "          element.innerHTML = '';\n",
       "          dataTable['output_type'] = 'display_data';\n",
       "          await google.colab.output.renderOutput(dataTable, element);\n",
       "          const docLink = document.createElement('div');\n",
       "          docLink.innerHTML = docLinkHtml;\n",
       "          element.appendChild(docLink);\n",
       "        }\n",
       "      </script>\n",
       "    </div>\n",
       "  </div>\n",
       "  "
      ],
      "text/plain": [
       "  label                                            message\n",
       "0   ham  Go jurong point crazy Available bugis n great ...\n",
       "1   ham                              Ok lar Joking wif oni\n",
       "2  spam  Free entry wkly comp win FA Cup final tkts 21s...\n",
       "3   ham                    dun say early hor c already say\n",
       "4   ham             Nah think goes usf lives around though"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['message'] = df.message.apply(text_process)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ad88d89f-1452-4e7a-9f9c-cb1094f890d1",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "ad88d89f-1452-4e7a-9f9c-cb1094f890d1",
    "outputId": "31462e8c-82e6-46ab-e7fe-1b3b469f56d6"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "  <div id=\"df-1ca6c4f6-e9e1-435d-903a-eddf8c29f9f9\">\n",
       "    <div class=\"colab-df-container\">\n",
       "      <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>Go jurong point crazy Available bugis n great ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>Ok lar Joking wif oni</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>Free entry wkly comp win FA Cup final tkts 21s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>dun say early hor c already say</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>Nah think goes usf lives around though</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-1ca6c4f6-e9e1-435d-903a-eddf8c29f9f9')\"\n",
       "              title=\"Convert this dataframe to an interactive table.\"\n",
       "              style=\"display:none;\">\n",
       "        \n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "       width=\"24px\">\n",
       "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
       "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
       "  </svg>\n",
       "      </button>\n",
       "      \n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      flex-wrap:wrap;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "      <script>\n",
       "        const buttonEl =\n",
       "          document.querySelector('#df-1ca6c4f6-e9e1-435d-903a-eddf8c29f9f9 button.colab-df-convert');\n",
       "        buttonEl.style.display =\n",
       "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "        async function convertToInteractive(key) {\n",
       "          const element = document.querySelector('#df-1ca6c4f6-e9e1-435d-903a-eddf8c29f9f9');\n",
       "          const dataTable =\n",
       "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                     [key], {});\n",
       "          if (!dataTable) return;\n",
       "\n",
       "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "            + ' to learn more about interactive tables.';\n",
       "          element.innerHTML = '';\n",
       "          dataTable['output_type'] = 'display_data';\n",
       "          await google.colab.output.renderOutput(dataTable, element);\n",
       "          const docLink = document.createElement('div');\n",
       "          docLink.innerHTML = docLinkHtml;\n",
       "          element.appendChild(docLink);\n",
       "        }\n",
       "      </script>\n",
       "    </div>\n",
       "  </div>\n",
       "  "
      ],
      "text/plain": [
       "   label                                            message\n",
       "0      0  Go jurong point crazy Available bugis n great ...\n",
       "1      0                              Ok lar Joking wif oni\n",
       "2      1  Free entry wkly comp win FA Cup final tkts 21s...\n",
       "3      0                    dun say early hor c already say\n",
       "4      0             Nah think goes usf lives around though"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['label'] = df.label.map({'ham':0, 'spam':1})\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f22eddba-546a-4e97-b013-2906e90946b7",
   "metadata": {
    "id": "f22eddba-546a-4e97-b013-2906e90946b7"
   },
   "source": [
    "## Splitting the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f859993a-2b29-4baf-a168-117151ed240b",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "f859993a-2b29-4baf-a168-117151ed240b",
    "outputId": "1c43a691-0556-4326-fe09-172c05576f96"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X: (5572,)\n",
      "y: (5572,)\n",
      "\n",
      "X_train: (4179,)\n",
      "y_train: (4179,)\n",
      "\n",
      "X_test: (1393,)\n",
      "y_test: (1393,)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# split X and y into training and testing sets \n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X = df.message\n",
    "y = df.label\n",
    "\n",
    "print(f'X: {X.shape}')\n",
    "print(f'y: {y.shape}')\n",
    "print()\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)\n",
    "\n",
    "print(f'X_train: {X_train.shape}')\n",
    "print(f'y_train: {y_train.shape}')\n",
    "print()\n",
    "\n",
    "print(f'X_test: {X_test.shape}')\n",
    "print(f'y_test: {y_test.shape}')\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba6deffc-0424-4cb6-85d2-eb74f97e8cdd",
   "metadata": {
    "id": "ba6deffc-0424-4cb6-85d2-eb74f97e8cdd",
    "tags": []
   },
   "source": [
    "## Helper code / Example code for Representing text as Numerical data using Sci-kit learn\n",
    "\n",
    "📌 From the scikit-learn documentation:\n",
    "- Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.\n",
    "- We will use CountVectorizer to \"convert text into a matrix of token counts\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "1309604c-b9c1-4910-ab60-432f7e12824b",
   "metadata": {
    "id": "1309604c-b9c1-4910-ab60-432f7e12824b"
   },
   "outputs": [],
   "source": [
    "# example text for model training (SMS messages)\n",
    "simple_train = ['call you tonight', 'Call me a cab', 'Please call me... PLEASE!']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "0fde24ff-b303-4526-899d-fb902eabc1d8",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "0fde24ff-b303-4526-899d-fb902eabc1d8",
    "outputId": "a9ec619e-c16e-4688-f1ed-d85dd17acc4e"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['cab', 'call', 'me', 'please', 'tonight', 'you'], dtype=object)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# import and instantiate CountVectorizer (with the default parameters)\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "vect = CountVectorizer()\n",
    "simple_train = vect.fit_transform(simple_train)\n",
    "\n",
    "vect.get_feature_names_out()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "576f6e05-6e94-45a1-96cc-3e694cc971d6",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "576f6e05-6e94-45a1-96cc-3e694cc971d6",
    "outputId": "d566e0ad-2994-4004-a515-41edb5de0ae2"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['cab', 'call', 'me', 'please', 'tonight', 'you'], dtype=object)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vect.get_feature_names_out()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "0aef831c-50f0-4723-a04a-27af774f0542",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "0aef831c-50f0-4723-a04a-27af774f0542",
    "outputId": "a6e5da20-9914-4ee4-aa64-04bd3ea6dc36"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 1, 0, 0, 1, 1],\n",
       "       [1, 1, 1, 0, 0, 0],\n",
       "       [0, 1, 1, 2, 0, 0]])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# convert sparse matrix to a dense matrix\n",
    "simple_train.toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa523614-ff0d-4f22-b8f2-4505e33a2460",
   "metadata": {
    "id": "aa523614-ff0d-4f22-b8f2-4505e33a2460"
   },
   "source": [
    "In this scheme, features and samples are defined as follows:\n",
    "\n",
    "- Each individual token occurrence frequency (normalized or not) is treated as a feature.\n",
    "- The vector of all the token frequencies for a given document is considered a multivariate sample.\n",
    "\n",
    "A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "5d0a1561-6e5c-495e-b056-fc9fe376121d",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 143
    },
    "id": "5d0a1561-6e5c-495e-b056-fc9fe376121d",
    "outputId": "2321fa9f-01c8-4d33-e5c6-658fac8f2a9a"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "  <div id=\"df-0c96fb55-e274-4826-ab4a-b53f84c6b08b\">\n",
       "    <div class=\"colab-df-container\">\n",
       "      <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cab</th>\n",
       "      <th>call</th>\n",
       "      <th>me</th>\n",
       "      <th>please</th>\n",
       "      <th>tonight</th>\n",
       "      <th>you</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0c96fb55-e274-4826-ab4a-b53f84c6b08b')\"\n",
       "              title=\"Convert this dataframe to an interactive table.\"\n",
       "              style=\"display:none;\">\n",
       "        \n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "       width=\"24px\">\n",
       "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
       "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
       "  </svg>\n",
       "      </button>\n",
       "      \n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      flex-wrap:wrap;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "      <script>\n",
       "        const buttonEl =\n",
       "          document.querySelector('#df-0c96fb55-e274-4826-ab4a-b53f84c6b08b button.colab-df-convert');\n",
       "        buttonEl.style.display =\n",
       "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "        async function convertToInteractive(key) {\n",
       "          const element = document.querySelector('#df-0c96fb55-e274-4826-ab4a-b53f84c6b08b');\n",
       "          const dataTable =\n",
       "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                     [key], {});\n",
       "          if (!dataTable) return;\n",
       "\n",
       "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "            + ' to learn more about interactive tables.';\n",
       "          element.innerHTML = '';\n",
       "          dataTable['output_type'] = 'display_data';\n",
       "          await google.colab.output.renderOutput(dataTable, element);\n",
       "          const docLink = document.createElement('div');\n",
       "          docLink.innerHTML = docLinkHtml;\n",
       "          element.appendChild(docLink);\n",
       "        }\n",
       "      </script>\n",
       "    </div>\n",
       "  </div>\n",
       "  "
      ],
      "text/plain": [
       "   cab  call  me  please  tonight  you\n",
       "0    0     1   0       0        1    1\n",
       "1    1     1   1       0        0    0\n",
       "2    0     1   1       2        0    0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the vocabulary and document-term matrix together\n",
    "pd.DataFrame(simple_train.toarray(), columns=vect.get_feature_names_out())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02042133-9bad-4aa3-b0be-a53158e7102e",
   "metadata": {
    "id": "02042133-9bad-4aa3-b0be-a53158e7102e"
   },
   "source": [
    "### Transform Testing data into a document-term matrix (using existing / training vocabulary)\n",
    "\n",
    "- You are supposed to use the training vocabolary to make the count matrix for test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "1a50c164-91ac-4861-993d-e14a2be7c6aa",
   "metadata": {
    "id": "1a50c164-91ac-4861-993d-e14a2be7c6aa"
   },
   "outputs": [],
   "source": [
    "simple_test = [\"please don't call me\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "30e0b5eb-d7ab-468a-86e2-89dcd0fa0184",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "30e0b5eb-d7ab-468a-86e2-89dcd0fa0184",
    "outputId": "0efbd037-8881-4783-fb53-b99441c6444c"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 1, 1, 1, 0, 0]])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "simple_test_dtm = vect.transform(simple_test)\n",
    "simple_test_dtm.toarray()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "9df0574c-6088-4418-92b4-d6d69c9d9d41",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 81
    },
    "id": "9df0574c-6088-4418-92b4-d6d69c9d9d41",
    "outputId": "250a73d8-a182-4b3d-8833-d12e8a835981"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "  <div id=\"df-3cb7bf06-0e50-4e94-a5be-8897566affd9\">\n",
       "    <div class=\"colab-df-container\">\n",
       "      <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cab</th>\n",
       "      <th>call</th>\n",
       "      <th>me</th>\n",
       "      <th>please</th>\n",
       "      <th>tonight</th>\n",
       "      <th>you</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-3cb7bf06-0e50-4e94-a5be-8897566affd9')\"\n",
       "              title=\"Convert this dataframe to an interactive table.\"\n",
       "              style=\"display:none;\">\n",
       "        \n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "       width=\"24px\">\n",
       "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
       "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
       "  </svg>\n",
       "      </button>\n",
       "      \n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      flex-wrap:wrap;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "      <script>\n",
       "        const buttonEl =\n",
       "          document.querySelector('#df-3cb7bf06-0e50-4e94-a5be-8897566affd9 button.colab-df-convert');\n",
       "        buttonEl.style.display =\n",
       "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "        async function convertToInteractive(key) {\n",
       "          const element = document.querySelector('#df-3cb7bf06-0e50-4e94-a5be-8897566affd9');\n",
       "          const dataTable =\n",
       "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                     [key], {});\n",
       "          if (!dataTable) return;\n",
       "\n",
       "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "            + ' to learn more about interactive tables.';\n",
       "          element.innerHTML = '';\n",
       "          dataTable['output_type'] = 'display_data';\n",
       "          await google.colab.output.renderOutput(dataTable, element);\n",
       "          const docLink = document.createElement('div');\n",
       "          docLink.innerHTML = docLinkHtml;\n",
       "          element.appendChild(docLink);\n",
       "        }\n",
       "      </script>\n",
       "    </div>\n",
       "  </div>\n",
       "  "
      ],
      "text/plain": [
       "   cab  call  me  please  tonight  you\n",
       "0    0     1   1       1        0    0"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the vocabulary and document-term matrix together\n",
    "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names_out())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec45268a-bc57-4a34-b496-b30751119403",
   "metadata": {
    "id": "ec45268a-bc57-4a34-b496-b30751119403"
   },
   "source": [
    "## Multinomial Naive Bayes Implementation\n",
    "\n",
    "- This task implements Mutlinomial Naive Bayes from scratch, we have used numpy to vectorize your code and matplotlib  to show your analysis.\n",
    "- Below some information has given from the documentation about Multinomial Naive Bayes, this will give you some idea about using *Smoothing Priors*.\n",
    "- There is a sub-question for experimenting with $\\alpha > 0$, you don't have to implement it separetely, try to incomporate it in same Model Class / Function."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3c9e325-5e71-4810-83c9-a2e3b5a547b2",
   "metadata": {
    "id": "d3c9e325-5e71-4810-83c9-a2e3b5a547b2"
   },
   "source": [
    "📌 From the scikit-learn documentation:\n",
    "\n",
    "- Multinomial Naive Bayes implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice).\n",
    "\n",
    "- The distribution $\\theta_y = (\\theta_{y1}, \\theta_{y2}, \\dots, \\theta_{yn})$ is parametrized by vectors for each class $y$, where $n$ is the number of features (in text classification, the size of the vocabulary) and $\\theta_{yi}$ is the probability $P(x_i|y)$ of feature appearing in a sample belonging to class.\n",
    "\n",
    "- The parameters $\\theta_y$ is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:\n",
    "\n",
    "$$\n",
    "\\hat{\\theta}_{yi} = \\frac{N_{yi} + \\alpha}{N_{y} + \\alpha n}\n",
    "$$\n",
    "\n",
    " where $N_{yi} = \\sum_{x \\in T}{x_i}$ is the number of times feature $i $ appears in a sample of class in the training set $T$, and $N_{y} = \\sum^{n}_{i=1}{N_{yi}}$ is the total count of all features for class $y$.\n",
    "\n",
    "- The smoothing priors $\\alpha \\gt 0$ accounts for features not present in the learning samples and **prevents zero probabilities** in further computations. Setting $\\alpha = 1$ is called Laplace smoothing, while $\\alpha \\lt 1$ is called Lidstone smoothing.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "14862c11-1b3c-4314-a64e-f98a185ce394",
   "metadata": {
    "id": "14862c11-1b3c-4314-a64e-f98a185ce394"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Your code here\n",
    "\"\"\"\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.metrics import accuracy_score\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0dbcd1df-1c3d-477d-b0f6-c8c17868c797",
   "metadata": {
    "id": "0dbcd1df-1c3d-477d-b0f6-c8c17868c797"
   },
   "source": [
    "## Vectorizing Training Sample\n",
    "\n",
    "- Use the Helper code above to vectorize for training samples\n",
    "- Don't overthink it, its very easy to do"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "c2561b73-fc1b-47aa-8ae9-6d1ba8ad3015",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "c2561b73-fc1b-47aa-8ae9-6d1ba8ad3015",
    "outputId": "09c73440-8f6b-4c60-90a0-f29dada93576"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['008704050406', '0121', '01223585236', ..., 'ûïharry', 'ûò',\n",
       "       'ûówell'], dtype=object)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"\"\"\n",
    "Your code here\n",
    "\"\"\"\n",
    "\n",
    "vect = CountVectorizer()\n",
    "X_train = vect.fit_transform(X_train)\n",
    "\n",
    "vect.get_feature_names_out()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "4xY4phXLcCq1",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "4xY4phXLcCq1",
    "outputId": "0b670820-35ff-4f9d-f60b-5ba529a5d70d"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4179, 7996)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train = X_train.toarray()\n",
    "X_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "V3-pDtuhcCtZ",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 488
    },
    "id": "V3-pDtuhcCtZ",
    "outputId": "a4a03d05-0073-4e26-bec5-c63a16c56f8f"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "  <div id=\"df-06c90a9e-2b86-402f-82be-8afa132fd7cf\">\n",
       "    <div class=\"colab-df-container\">\n",
       "      <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>008704050406</th>\n",
       "      <th>0121</th>\n",
       "      <th>01223585236</th>\n",
       "      <th>01223585334</th>\n",
       "      <th>0125698789</th>\n",
       "      <th>020603</th>\n",
       "      <th>02070836089</th>\n",
       "      <th>02072069400</th>\n",
       "      <th>02073162414</th>\n",
       "      <th>02085076972</th>\n",
       "      <th>...</th>\n",
       "      <th>åòits</th>\n",
       "      <th>åômorrow</th>\n",
       "      <th>åôrents</th>\n",
       "      <th>ìll</th>\n",
       "      <th>ìï</th>\n",
       "      <th>ìïll</th>\n",
       "      <th>ûªve</th>\n",
       "      <th>ûïharry</th>\n",
       "      <th>ûò</th>\n",
       "      <th>ûówell</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4174</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4175</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4176</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4177</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4178</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>4179 rows × 7996 columns</p>\n",
       "</div>\n",
       "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-06c90a9e-2b86-402f-82be-8afa132fd7cf')\"\n",
       "              title=\"Convert this dataframe to an interactive table.\"\n",
       "              style=\"display:none;\">\n",
       "        \n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "       width=\"24px\">\n",
       "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
       "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
       "  </svg>\n",
       "      </button>\n",
       "      \n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      flex-wrap:wrap;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "      <script>\n",
       "        const buttonEl =\n",
       "          document.querySelector('#df-06c90a9e-2b86-402f-82be-8afa132fd7cf button.colab-df-convert');\n",
       "        buttonEl.style.display =\n",
       "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "        async function convertToInteractive(key) {\n",
       "          const element = document.querySelector('#df-06c90a9e-2b86-402f-82be-8afa132fd7cf');\n",
       "          const dataTable =\n",
       "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                     [key], {});\n",
       "          if (!dataTable) return;\n",
       "\n",
       "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "            + ' to learn more about interactive tables.';\n",
       "          element.innerHTML = '';\n",
       "          dataTable['output_type'] = 'display_data';\n",
       "          await google.colab.output.renderOutput(dataTable, element);\n",
       "          const docLink = document.createElement('div');\n",
       "          docLink.innerHTML = docLinkHtml;\n",
       "          element.appendChild(docLink);\n",
       "        }\n",
       "      </script>\n",
       "    </div>\n",
       "  </div>\n",
       "  "
      ],
      "text/plain": [
       "      008704050406  0121  01223585236  01223585334  0125698789  020603  \\\n",
       "0                0     0            0            0           0       0   \n",
       "1                0     0            0            0           0       0   \n",
       "2                0     0            0            0           0       0   \n",
       "3                0     0            0            0           0       0   \n",
       "4                0     0            0            0           0       0   \n",
       "...            ...   ...          ...          ...         ...     ...   \n",
       "4174             0     0            0            0           0       0   \n",
       "4175             0     0            0            0           0       0   \n",
       "4176             0     0            0            0           0       0   \n",
       "4177             0     0            0            0           0       0   \n",
       "4178             0     0            0            0           0       0   \n",
       "\n",
       "      02070836089  02072069400  02073162414  02085076972  ...  åòits  \\\n",
       "0               0            0            0            0  ...      0   \n",
       "1               0            0            0            0  ...      0   \n",
       "2               0            0            0            0  ...      0   \n",
       "3               0            0            0            0  ...      0   \n",
       "4               0            0            0            0  ...      0   \n",
       "...           ...          ...          ...          ...  ...    ...   \n",
       "4174            0            0            0            0  ...      0   \n",
       "4175            0            0            0            0  ...      0   \n",
       "4176            0            0            0            0  ...      0   \n",
       "4177            0            0            0            0  ...      0   \n",
       "4178            0            0            0            0  ...      0   \n",
       "\n",
       "      åômorrow  åôrents  ìll  ìï  ìïll  ûªve  ûïharry  ûò  ûówell  \n",
       "0            0        0    0   0     0     0        0   0       0  \n",
       "1            0        0    0   0     0     0        0   0       0  \n",
       "2            0        0    0   0     0     0        0   0       0  \n",
       "3            0        0    0   0     0     0        0   0       0  \n",
       "4            0        0    0   0     0     0        0   0       0  \n",
       "...        ...      ...  ...  ..   ...   ...      ...  ..     ...  \n",
       "4174         0        0    0   0     0     0        0   0       0  \n",
       "4175         0        0    0   0     0     0        0   0       0  \n",
       "4176         0        0    0   0     0     0        0   0       0  \n",
       "4177         0        0    0   0     0     0        0   0       0  \n",
       "4178         0        0    0   0     0     0        0   0       0  \n",
       "\n",
       "[4179 rows x 7996 columns]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(X_train, columns=vect.get_feature_names_out())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec22e087-4e03-4428-8f47-4cb4f2ec1645",
   "metadata": {
    "id": "ec22e087-4e03-4428-8f47-4cb4f2ec1645"
   },
   "source": [
    "## Calculate Priors and Estimate Model's performance on Training Sample\n",
    "\n",
    "- Calculate priors based on Training Sample using your NB implementation\n",
    "- Evaluate your model's performance on Training Data ($\\alpha = 0$)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "1aa85aed-f38c-4acc-9c76-339d5914118d",
   "metadata": {
    "id": "1aa85aed-f38c-4acc-9c76-339d5914118d"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Your code here\n",
    "\"\"\"\n",
    "\n",
    "d=[[0 for i in range(len(vect.vocabulary_))] for j in range(2)]\n",
    "\n",
    "feature_names = vect.get_feature_names_out()\n",
    "\n",
    "for i in range(len(X_train)):\n",
    "    if y_train.iloc[i]==0:\n",
    "        arr = np.where(X_train[i]!=0)[0]\n",
    "        for j in arr:\n",
    "            d[0][j]+=1\n",
    "    else:\n",
    "        arr = np.where(X_train[i]!=0)[0]\n",
    "        for j in arr:\n",
    "            d[1][j]+=1\n",
    "\n",
    "counts={}\n",
    "counts[0]=len(np.where(y_train==0)[0])\n",
    "counts[1]=len(np.where(y_train==1)[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "vS8jxW8IcjUR",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "vS8jxW8IcjUR",
    "outputId": "2330604c-5303-4c85-f03a-61d08b164b3c"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4179,)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alpha=0\n",
    "Y_pred=[]\n",
    "for i in range(len(X_train)):\n",
    "    arr= np.where(X_train[i]!=0)[0]\n",
    "    prob1 = counts[0]/(counts[0]+counts[1])\n",
    "    prob2 = counts[1]/(counts[0]+counts[1])\n",
    "    for j in arr:\n",
    "        prob1*= (d[0][j]+alpha)/(counts[0] + alpha*X_train.shape[1])\n",
    "        prob2*= (d[1][j]+alpha)/(counts[1] + alpha*X_train.shape[1])\n",
    "    if prob1>prob2:\n",
    "        Y_pred.append(0)\n",
    "    else:\n",
    "        Y_pred.append(1)\n",
    "    #print(prob1,prob2)\n",
    "Y_pred = np.asarray(Y_pred)\n",
    "Y_pred.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "C35FjHKBcjWo",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "C35FjHKBcjWo",
    "outputId": "6c20df5e-67bf-4b76-dd7b-693a9c267fb1"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy on train data with alpha=0: 0.9932998324958124\n"
     ]
    }
   ],
   "source": [
    "print(\"Accuracy on train data with alpha=0: \"+str(accuracy_score(y_train,Y_pred)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "9eJ9kVDtcjZf",
   "metadata": {
    "id": "9eJ9kVDtcjZf"
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "d8a6ec92-2ba4-40bf-9b5d-01825338c034",
   "metadata": {
    "id": "d8a6ec92-2ba4-40bf-9b5d-01825338c034"
   },
   "source": [
    "## Vectorizing Test Sample\n",
    "\n",
    "- Use the Training Sample vocabulary to create word count matrix for test samples\n",
    "- This is also shown in the Helper code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "a16cc470-b033-444f-a366-5b3213eba877",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "a16cc470-b033-444f-a366-5b3213eba877",
    "outputId": "1bae3b6b-54cb-4b1d-d13f-ebf24f1eb1ea"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1393, 7996)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"\"\"\n",
    "Your code here\n",
    "\"\"\"\n",
    "X_test = vect.transform(X_test)\n",
    "X_test = X_test.toarray()\n",
    "X_test.shape\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c02b84a-3116-4260-846d-50e6849efe48",
   "metadata": {
    "id": "7c02b84a-3116-4260-846d-50e6849efe48"
   },
   "source": [
    "## Estimate Model's performance on Test Sample\n",
    "\n",
    "- Evaluate your model's performance on Test Sample, using the Training Priors ($\\alpha = 0$)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "cb763cb8-5a31-4d7d-b538-b741c8cfaae8",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "cb763cb8-5a31-4d7d-b538-b741c8cfaae8",
    "outputId": "ea63b445-0b5d-415d-a1cf-3bceb33c4ba0"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1393,)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"\"\"\n",
    "Your code here\n",
    "\"\"\"\n",
    "\n",
    "alpha=0\n",
    "Y_pred=[]\n",
    "for i in range(len(X_test)):\n",
    "    arr= np.where(X_test[i]!=0)[0]\n",
    "    prob1 = counts[0]/(counts[0]+counts[1])\n",
    "    prob2 = counts[1]/(counts[0]+counts[1])\n",
    "    for j in arr:\n",
    "        prob1*= (d[0][j]+alpha)/(counts[0] + alpha*X_test.shape[1])\n",
    "        prob2*= (d[1][j]+alpha)/(counts[1] + alpha*X_test.shape[1])\n",
    "    if prob1>prob2:\n",
    "        Y_pred.append(0)\n",
    "    else:\n",
    "        Y_pred.append(1)\n",
    "    #print(prob1,prob2)\n",
    "Y_pred = np.asarray(Y_pred)\n",
    "Y_pred.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "60UuKKyDeTBj",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "60UuKKyDeTBj",
    "outputId": "17a9ec76-5fd8-47c2-f7eb-63b2e02eef0a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy on test data with alpha=0: 0.9533381191672649\n"
     ]
    }
   ],
   "source": [
    "print(\"Accuracy on test data with alpha=0: \"+str(accuracy_score(y_test,Y_pred)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b415868e-4b61-4a13-abb5-22124e5312bc",
   "metadata": {
    "id": "b415868e-4b61-4a13-abb5-22124e5312bc"
   },
   "source": [
    "## Select Smoothing Priors\n",
    "\n",
    "- Refactor your code to incorporate smoothing priors, select $\\alpha = 0$ for the previous estimates / sub-questions\n",
    "- Compare the performance with different values of $\\alpha \\gt 0$ as smoothing priors to take care of zero probabilities\n",
    "- You can display a Plot or Table to show the comparison."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "a707cc0d-1f16-40a5-93ec-0366165b077c",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 453
    },
    "id": "a707cc0d-1f16-40a5-93ec-0366165b077c",
    "outputId": "ef8002dc-f000-4db6-c05a-9241412e0e89"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy on test data with alpha = 0.1 : 0.9827709978463748\n",
      "Accuracy on test data with alpha = 0.2 : 0.9849246231155779\n",
      "Accuracy on test data with alpha = 0.30000000000000004 : 0.9827709978463748\n",
      "Accuracy on test data with alpha = 0.4 : 0.9820531227566404\n",
      "Accuracy on test data with alpha = 0.5 : 0.9827709978463748\n",
      "Accuracy on test data with alpha = 0.6000000000000001 : 0.9827709978463748\n",
      "Accuracy on test data with alpha = 0.7000000000000001 : 0.9820531227566404\n",
      "Accuracy on test data with alpha = 0.8 : 0.9813352476669059\n",
      "Accuracy on test data with alpha = 0.9 : 0.9798994974874372\n",
      "Best accuracy on test is: 0.9849246231155779 for alpha=: 0.2\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY4AAAEGCAYAAABy53LJAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAAsTAAALEwEAmpwYAAAs4UlEQVR4nO3deZgU1fX/8fdhd0TcwIWwjIobEgQdUUED7hiNCy5RRwWjEEXcwHzVH0k0GKIhIorigooBJSJiVIhBoghxA3WQTYIoGkQQlbgjKtv5/XFrpB1mmO6hu6t7+vN6nn6murqq63Q39Om6t+655u6IiIgkq07cAYiISH5R4hARkZQocYiISEqUOEREJCVKHCIikpJ6cQeQDU2bNvXi4uK4wxARySuzZs36n7s3q7i+IBJHcXExZWVlcYchIpJXzOz9ytarqUpERFKixCEiIinJaOIws+5mtsjMFpvZtZU83trMpprZPDObbmYtEh4bYmYLzGyhmQ03M4vWT4+ec0502ymTr0FERH4sY4nDzOoCI4DjgbbA2WbWtsJmtwBj3L09MAi4Kdq3M9AFaA+0Aw4CuibsV+ruHaLbJ5l6DSIisqlMnnF0Aha7+3vuvgYYB5xcYZu2wPPR8rSExx1oBDQAGgL1gY8zGKuIiCQpk4njJ8AHCfeXResSzQV6RMunAtuY2Y7uPoOQSFZEtynuvjBhvwejZqrflTdhVWRmfcyszMzKVq5cmY7XkxPGjoXiYqhTJ/wdOzbuiESk0MTdOX410NXMZhOaopYD682sDbAv0IKQbI40s8OjfUrd/afA4dHtvMqe2N1HunuJu5c0a7bJZch5aexY6NMH3n8f3MPfPn2UPEQkuzKZOJYDLRPut4jW/cDdP3T3Hu7eERgYrfuCcPYx091XufsqYDJwaPT48ujv18DfCE1iBWHgQFi9+sfrVq8O60VEsiWTieN1YE8z283MGgBnARMTNzCzpmZWHsN1wKhoeSnhTKSemdUnnI0sjO43jfatD5wIvJnB15BTli5Nbb2ISCZkLHG4+zqgHzAFWAiMd/cFZjbIzE6KNusGLDKzt4GdgcHR+gnAu8B8Qj/IXHefROgon2Jm84A5hDOY+zL1GnJNq1aprRcRyQQrhBkAS0pKvDaUHHnoIejZM/RvlCsqgpEjobQ0vrhEpHYys1nuXlJxfdyd45KCoqKQNJo23bjuqquUNEQku5Q48sjQobD77vDRR6FTvFkzmDs37qhEpNAoceSJGTPC7coroW5d2Gor6NsX/vEPWLQo7uhEpJAoceSJoUNhu+3gggs2ruvbFxo2hGHDYgtLRAqQEkceeO89eOIJuPhiaNx44/qddoLzz4fRo6EWDY4XkRynxJEHbr89NE9ddtmmj/XvD999B3ffnf24RKQwKXHkuM8/hwcegLPPhubNN318n33ghBNgxIiQQEREMk2JI8eNHAnffBPOLKoyYAB88gk8/HD24hKRwqXEkcPWrIHhw+Hoo2H//averls36NgRbr0VNmzIWngiUqCUOHLYo4/Chx+GM4rNMQvbLFwIzzyTndhEpHCp5EiOcg9nEevWwfz5ITlsztq1sNtusPfeMHVqdmIUkdpNJUfyzPPPh1Hh/ftXnzQA6teHyy8P+82Zk/HwRKSAKXHkqKFDwziNc85Jfp8+fcI4j6FDMxeXiIgSRw76z39g8mTo1w8aNUp+v+22gwsvhHHjYNmyjIUnIgVOiSMH3XprqEV1ySWp73vFFeHKqjvuSH9cIiKgxJFzPv5447wbieXTk7XbbnDaaXDvvfD11+mPT0REiSPHjBgRrpC66qqaP8eAAfDllzBqVPXbioikSokjh6xeDXfdBb/4Bey1V82f5+CDoUsXuO22cDmviEg6KXHkkDFj4NNPqx/wl4wBA2DJklBVV0QknTQAMEds2AD77gtNmsBrryU3dmNz1q8PgwGbNg0TQG3p84lI4dEAwBz3j3/A22+HM4V0fMnXrRv6SV59FV55ZcufT0SknBJHjhg6FFq1gtNPT99z9uoF22+vAYEikl5KHDmgrAxeeCGMwahXL33Pu/XWYSzIk0/C4sXpe14RKWxKHDlg6NDQt3HRRel/7n79Qh2r225L/3OLSGFS4ojZ0qXw2GPQu3dIHum2666h3tWDD8Jnn6X/+UWk8ChxxOz228Pfyy/P3DH69w9jRO65J3PHEJHCocQRoy+/hPvugzPPDB3jmfLTn8Kxx4b6Vd9/n7njiEhhUOKI0f33h3pS6RjwV50BA+Cjj+CRRzJ/LBGp3TQAMCZr18Iee8Duu8P06Zk/nvvGecvnztWAQBGpngYA5pgJE+CDD7JztgEhUfTvH6ahffbZ7BxTRGqnjCYOM+tuZovMbLGZXVvJ463NbKqZzTOz6WbWIuGxIWa2wMwWmtlwsx//RjaziWb2ZibjzxT3cAnu3nvDCSdk77hnnw277BLm+xARqamMJQ4zqwuMAI4H2gJnm1nbCpvdAoxx9/bAIOCmaN/OQBegPdAOOAjomvDcPYBVmYo90154AWbNCiVB6mTxnK9hQ7jsMpgyBd7My5QrIrkgk19bnYDF7v6eu68BxgEnV9imLfB8tDwt4XEHGgENgIZAfeBjADNrDPQH/pjB2DNq6NBQfPD887N/7IsvhqIinXWISM1lMnH8BPgg4f6yaF2iuUCPaPlUYBsz29HdZxASyYroNsXdF0bb3QgMBVZnKvBMWrQIJk2Cvn3D9LDZtsMOcMEFMHZsuMpKRCRVcXeOXw10NbPZhKao5cB6M2sD7Au0ICSbI83scDPrAOzh7tXOMmFmfcyszMzKVq5cmblXkKJhw0KT0aWXxhfDlVeGq7ruvDO+GEQkf2UycSwHWibcbxGt+4G7f+juPdy9IzAwWvcF4exjpruvcvdVwGTg0OhWYmZLgJeAvcxsemUHd/eR7l7i7iXNmjVL6wurqZUrYfRoOO882Gmn+OJo0wZOPhnuvhu++Sa+OEQkP2UycbwO7Glmu5lZA+AsYGLiBmbW1MzKY7gOKJ8leynhTKSemdUnnI0sdPe73b25uxcDhwFvu3u3DL6GtLr7bvjuu3BZbNwGDAi1q0aPjjsSEck3GUsc7r4O6AdMARYC4919gZkNMrOTos26AYvM7G1gZ2BwtH4C8C4wn9APMtfdJ2Uq1mz47jsYMQJ+/vMw01/cunSBTp1C09n69XFHIyL5RCPHs+T++0MF3KlT4cgjYw3lB+PHwy9/GeYlP+WUuKMRkVxT1chxJY4s2LAB2rULneJvvJE75T7WrQv9HS1bwosvxh2NiOQalRyJ0TPPwMKFoW8jV5IGhNkGr7wSXnoJXnst7mhEJF8ocWTB0KHQvHloFso1F14I226reclFJHlKHBk2Zw48/3yYqKlBg7ij2dQ220CfPqHo4pIlcUcjIvlAiSPDhg6FrbcOX8656vLLQ82s8tkIRUQ2R4kjg5Ytg3HjQnPQ9tvHHU3VWrQIzWj33w9ffBF3NCKS65Q4MuiOO8IVVVdeGXck1RswAFatClPZiohsjhJHhnz9Ndx7L/ToAbvtFnc01evYEY44AoYPD3WsRESqosSRIaNGwZdfZm+Gv3QYMCA0r40fH3ckIpLLNAAwA9atgz33DJfgvvxy1g67xTZsgP32C+XeZ83KrTEnIpJ9GgCYRU88ES5tzaezDQhXVvXvD7Nnw/TpcUcjIrlKiSPNyucT32OPULo835x3HjRrpgGBIlI1JY40e+UVePXVcCVV3bpxR5O6Ro3CJFNPPx3KpIiIVKTEkWa33hrGbFxwQdyR1FzfviGBDBsWdyQikouUONLo3XdD/8bFF4fR4vmqWTM4/3wYMwY++STuaEQk1yhxpNFtt4WKs/36xR3JlrvqKvj+e7jrrrgjEZFco8SRJp99FsZunHNOuAw33+2zD5x4Ypi18Ntv445GRHKJEkea3HsvrF6dG/OJp8uAAfC//8FDD8UdiYjkEiWONFizJtSlOuYYaN8+7mjSp2tXOOCA0OG/YUPc0YhIrlDiSINHHoEVK/JvwF91zMJrWrQIJk+OOxoRyRUqObKF3GH//cPfefNqX5mOtWth991DCZXnn487GhHJJpUcyZDnnoP583NvPvF0qV8/TPQ0bVooRSIiosSxhYYOhV12CVdT1Va9e0PjxipDIiKBEscWePNNmDIljNto2DDuaDJnu+3goovg0UdD2XURKWxKHFvg1ltDCfKLL447ksy74opwZdXw4XFHIiJxU+KooY8+grFjQ02qHXeMO5rMKy6G00+HkSPD7IYiUriUOGrozjvDFUdXXRV3JNkzYECY1fCBB+KORETipMRRA998A3ffHebbaNMm7miyp1MnOOywUJNr3bq4oxGRuChx1MDo0aE2VW0qL5KsAQPg/ffh73+POxIRiYsGAKZo/fpQAHD77cOETbVx7MbmlL/+HXaAmTML7/WLFBINAEyTSZNg8eLwy7sQvzTr1g39Oq+9Bi+/HHc0IhKHjCYOM+tuZovMbLGZXVvJ463NbKqZzTOz6WbWIuGxIWa2wMwWmtlws/A1bWbPmNnc6LF7zCyrE7QOHQqtW8Npp2XzqLmlV69wxqEBgSKFKWOJI/pCHwEcD7QFzjazthU2uwUY4+7tgUHATdG+nYEuQHugHXAQ0DXa50x33z9a3ww4I1OvoaLXXoOXXgpjGurVy9ZRc09REVxyCTz1FLzzTtzRiEi2ZfKMoxOw2N3fc/c1wDjg5ArbtAXKS+dNS3jcgUZAA6AhUB/4GMDdv4q2qRc9nrVOmqFDoUkTuPDCbB0xd/XrF+pY3XZb3JGISLZlMnH8BPgg4f6yaF2iuUCPaPlUYBsz29HdZxASyYroNsXdF5bvZGZTgE+Ar4EJlR3czPqYWZmZla1cuXKLX8ySJTBhAvTpE5JHodtlFygthQcfhE8/jTsaEcmmuDvHrwa6mtlsQlPUcmC9mbUB9gVaEJLNkWZ2ePlO7n4csCvhbOTIyp7Y3Ue6e4m7lzRr1myLA739dqhTJ1SKlaB//zCt7D33xB2JiGRTJhPHcqBlwv0W0bofuPuH7t7D3TsCA6N1XxDOPma6+yp3XwVMBg6tsO93wFNs2vyVdl98AfffD2eeCS1bVrt5wWjXDo47Loyi//77uKMRkWypNnGY2S/MrCYJ5nVgTzPbzcwaAGcBEys8d9OE574OGBUtLyWcidQzs/qEs5GFZtbYzHaN9q0HnAC8VYPYUnLffbBqVe2b4S8dBgwIdbv+9re4IxGRbEkmIfwSeCe6PHafZJ/Y3dcB/YApwEJgvLsvMLNBZnZStFk3YJGZvQ3sDAyO1k8A3gXmE/pB5rr7JGBrYKKZzQPmEPo5MtpQsnZtqAjbrVuYf1t+7Oijwzzrt94aZkEUkdovqZHjZtYEOBu4gHAV04PAI+6eF3VSt2Tk+NixcO65YeDfiSemObBaYvToMLbjmWdC05WI1A5bNHI8ugR2AuGS2l0JfRBvmNllaY0yh4wdGwb6nXtuGLPxxRdxR5S7zj4btt0WTjklXEBQXBzeP6na2LHhfcq19ytX45Ic4+6bvQEnAU8Qmo1+A+wUrS8CllS3fy7cDjzwQE/Fww+7FxW5h8aXcCsqCutlUw8/7F6/vt6vZOXqv69cjUviA5R5Jd+p1TZVmdlo4AF3f6GSx45y96npTWXpl2pTVXFxqABbUevWYTyH/Jjer9RU9X41aAD775/1cH4wdy6sWbPpen2OhauqpqpkCmfcQBiEV/5EWwE7u/uSfEgaNbF0aWrrC53er9RU9b6sWQNNm2Y3lorHr4w+R6komcTxGNA54f76aN1BGYkoB7RqVfkvwlatsh9LPqjq/dphh9DgUYhVhKvy1FPh/ajsRL91a/jnP7MfU7mqzoSaNAnl9OtmtZyo5LJkOsfreag1BUC03CBzIcVv8OBQyC9RUVFYL5uq7P2qUyeUIjnvPM1RDmGA5JVXhgsIWrWCRo1+/Hgu/Puq7HOsWzdMF3zccWG8jggklzhWJoy7wMxOBv6XuZDiV1oKI0eGX4Bm4e/IkWG9bKqy9+uvf4Ubb4RHHoEDD4Q5c+KOMj7vvgtduoSyNVdcAW+9FSoR5Nq/r8o+x9Gjwxzzr7wS+l+eey7eGCU3JNM5vgcwFmgOGKFw4fnuvjjz4aVHOmcAlNS88EK4XPfTT8MgwUsuKaymq/Hj4aKLwi/3Bx8MZxz5aMGCUHJn4UL4f/8PbrihsKcWKBQ1Hsfh7u+6+yGEEuj7unvnfEoaEq+f/SycbRx5JFx6KZxxRmGMifn2W7j4YvjlL0NNrzlz8jdpAOy3H7z+OvzqV6FJ64gjYNmyuKOSuCQ1ANDMTgD6Av3N7Pdm9vvMhiW1SbNm8I9/wF/+EjqHO3YMk2LVVm+9BQcfDPfeC9dcA//+d2j2yXdFRaGJ7eGHQyLs0AGefjruqCQOyRQ5vIdQr+oyQlPVGUAt+G8g2VSnDlx9Nbz4Yrii6LDDYNiw2lffasyY0KezYgVMngw33xwmvKpNSkth1ixo0SKU4bn66qov5ZXaKZkzjs7ufj7wubv/gVDefK/MhiW11SGHwOzZ4Qunf3846aTaMRHUqlWhXlfPnnDQQWEwXffucUeVOXvtBTNnQt++YWbMww+H//437qgkW5JJHN9Ff1ebWXNgLaFelUiNbL89PP443HEH/OtfocnjpZfijqrm5s0LyWLMGLj+epg6FZo3jzuqzGvUCEaMgMceC81zHTvC3/8ed1SSDckkjklmth3wF+ANYAmg2Rdki5iFectnzAhfQN26wU03wYYNcUeWPPdw+erBB4cO/+eeC1cbFdpAudNPD2eRe+0Fp50Gl10G331X/X6SvzabOKJJlqa6+xfu/jihb2Mfd1fnuKTFAQeE9vIzzgiXeXbvDh9/HHdU1fvqKzjnHPj1r0MzTfmVY4Vq993DWWP//mFGyM6d4Z134o5KMmWzicPdNwAjEu5/7+5fZjwqKShNmoQZBO+7L3Sed+gQmnty1axZIeE99lg4S3rmGdh557ijil+DBqG/Y9KkULrkgAM0M2RtlUxT1VQzO82skIZtSbaZhYFyr70W+kCOOQZ+/3tYty7uyDZyD7NBHnpoKCHy73/DtdeGK8ZkoxNP3Hi5bmlp+FxXr447KkmnZP7J/5pQ1PB7M/vKzL42s68yHJcUqJ/+NAw069UrlCw56ihYvjzuqOCzz6BHj1AypHv38MXYpUvcUeWuli1h2jQYOBBGjYJOncLoc6kdkhk5vo2713H3Bu7eJLrfJBvBSWHaeuvwZTNmTGgW6tAh3qqxM2aEK4aefjqMPXnqKdhxx/jiyRf16sEf/whTpsDKleHKs1Gjat/YnUKUzADAn1V2y0ZwUtjOOy8kjubN4YQT4P/+D9auzd7xN2yAIUNC53fduvDyy6HCrRptU3PMMeEM7dBD4cILVTG5Nkimqeo3CbffAZMIkzuJZNzee4eBZpdcEkqW/Oxn2ZmNbuXKkKyuuQZOPTVcbnpQrZ2BJvN23TWM2Rk0KFRMLikp7IrJ+S6ZpqpfJNyOAdoBn2c+NJFgq63grrtCpdn//Cc0Gz3xROaO9+9/h+axadPg7rvDcbfdNnPHKxR168LvfgfPPx9G2h9ySHh/1XSVf2pyPcgyYN90ByJSnTPOCL/827QJHdWXXx6ubkqX9evDL+Ijj4TGjeHVV0OFWzVNpVfXrhvHvfTtG8q1F0LF5NokmT6OO8xseHS7E3iRMIJcJOt23z30NVx1VShZ0rkzLE5Dkf8VK+DYY0PJkHPOCX0r+++/5c8rlSuvmDxkCDz5ZBjz8frrcUclyUrmjKMMmBXdZgDXuPu5GY1KZDMaNAiTQj31VCisd8ABMG5czZ/v2WdD09TMmWGypTFjwhmHZFadOvCb34TJvtavD5c318aKybVRMoljAvCwu49297HATDMrqm4nkUw76aTQ5PHTn4ZZBn/96zCBUrLWrQvjDI47DnbaaeP4ETVNZdehh4bP8YQTalfF5NosqZHjwFYJ97cCNPOw5IRWrWD6dLjuulBwsFOnML1pdT74IBRW/NOfwsjmV1+Ftm0zHa1UZfvtQ2Xd4cNrR8Xk2i6ZxNHI3VeV34mWdcYhOaN+/ZAAnnkmFEgsKYHRo6veftKk8MU0d26opTRyZJjdTuJlFirrvvIKNGyYnxWTC0UyieMbMzug/I6ZHQik0CAgkh3HHReSwcEHhyan888PI5WLi0N7euvWcPzxoSmkdWt4443QxCW55cADw2eTWDH5rrs2fo7FxTB2bNxRFjbzanqizOwgYBzwIWHq2F2AX7r7rMyHlx4lJSVeVlYWdxiSJevXw+DB4Qops007W489FiZODL9qJXe5hznO+/bdtNhlUVE4UywtjSe2QmFms9y9ZJP11SWOaOf6wN7R3UXunlThBzPrDtwO1AXud/ebKzzeGhgFNAM+A85192XRY0OAEwhnRc8CVxD6Vx4D9gDWA5Pc/drq4lDiKEy77FL53B6tW2dn9LmkR/Pm4XLpivQ5Zl5ViSOZcRyXAlu7+5vu/ibQ2Mz6JrFfXcJcHscDbYGzzaxi9+MtwBh3bw8MAm6K9u0MdAHaE0aqHwR0Ld/H3fcBOgJdzOz46mKRwvTJJ5WvX7o0u3HIlvnoo8rX63OMTzJ9HL3d/YvyO+7+OdA7if06AYvd/T13X0No7jq5wjZtgeej5WkJjzvQCGgANATqAx+7+2p3nxbFsYYwELFFErFIAWrVKrX1kpuq+ryaNctuHLJRMomjbuIkTtGZRIMk9vsJ8EHC/WXRukRzgR7R8qnANma2o7vPICSSFdFtirv/6CLLaB70XxAuF96EmfUxszIzK1u5cmUS4UptM3jwpldLFRWF9ZI/KvsczcIZ5TXXZLdisgTJJI5ngEfN7CgzOwp4BJicpuNfDXQ1s9mEpqjlwHoza0Ooh9WCkGyONLPDy3cys3pRHMPd/b3KntjdR7p7ibuXNNNPk4JUWho6UFu3Dl80rVurQzUfVfY5PvBAqCM2ZEiomPz++3FHWViSuaqqDtAHOCpaNQ/Yxd0vrWa/Q4Eb3P246P51AO5+UxXbNwbecvcWZvYbwviRG6PHfg985+5DovujgFXufnkyL1Kd4yK10/jx0Lt3uEz3wQfhlFPijqh2qXHnuLtvAF4FlhD6LY4Ekhiby+vAnma2m5k1AM4CJlYIqmmUmACuI1xhBbCUcCZSL7qiq2v5Mc3sj8C2wJVJxCAitdiZZ4YxH3vsEeZNueKK9FZMlspVmTjMbC8zu97M3gLuIHyZ4+5HuPud1T2xu68D+gFTCF/64919gZkNMrOTos26AYvM7G1gZ6C89XkC8C4wn9APMtfdJ5lZC2AgoVP9DTObY2YXpfyqRaTW2GOPjbMzDh+evorJUrUqm6rMbAOhhPqF7r44Wveeu++exfjSQk1VIoVh4sRQNWDdutAvctZZcUeU32rSVNWDcEXTNDO7L+oYV91QEclZ5RWT27WrWcVkSU6VicPdn3T3s4B9CJfGXgnsZGZ3m9mxWYpPRCQlrVqF6X+vvTa1ismSvGQ6x79x97+5+y8Il8fOBq7JeGQiIjVUv36orDt5cnIVkyU1Kc057u6fR+Mjjqp+axGReHXvHpquOnUKfR89e8KqVdXtJdVJKXGIiOSb5s3huefghhvgoYfC2ce8eXFHld+UOESk1qtbN5TZnzoVvvoqnIHce6/mN68pJQ4RKRhHHBGarrp1CyVLzjoLvvwy7qjyjxKHiBSUnXaCf/4Tbr4ZHn8cDjgANMwrNUocIlJw6tQJlXVfeCFU1+3cGW6/XU1XyVLiEJGC1blzaLo6/vhQsuSUU+Czz2IOKg8ocYhIQdthB3jySbjttjDuo0MHeOWVmIPKcUocIlLwzEJl3VdeCYMHf/Yz+POfYcOGuCPLTUocIiKRkpJQpr1Hj1Cy5Oc/r3ru+kKmxCEikmDbbeHRR+Gee2D69NB0NX16zEHlGCUOEZEKzEJl3VdfhSZN4Kij4A9/gPXr444sNyhxiIhUYf/9wxiP0tJQsuSYY+DDD+OOKn5KHCIim9G4MYwZA3/9azgD6dABpkyJO6p4KXGIiCShZ89w9rHzzqHq7nXXhYRSXBwGFBYXw9ixcUeZHfXiDkBEJF/suy+89loYLHjzzSFhlF+y+/770KdPWC4tjS3ErNAZh4hICrbaKlTWbdp003Eeq1fDwIHxxJVNShwiIjXw6aeVr1+6NLtxxEGJQ0SkBlq1Sm19baLEISJSA4MHQ1HRj9fVqxfW13ZKHCIiNVBaCiNHQuvWYcDg1lvDunWwxx5xR5Z5ShwiIjVUWgpLloRO8g8/hJYtoVcv+PbbuCPLLCUOEZE0aNIERo2CRYvgt7+NO5rMUuIQEUmTo4+GSy6BYcPgpZfijiZzlDhERNJoyJAwirxXL/jmm7ijyQwlDhGRNGrcGB58EN59N5QlqY2UOERE0qxrV7j8crjjDpg2Le5o0i+jicPMupvZIjNbbGbXVvJ4azObambzzGy6mbVIeGyImS0ws4VmNtzMLFo/2Mw+MLNVmYxdRGRL3HQTtGkDv/oVfP113NGkV8YSh5nVBUYAxwNtgbPNrG2FzW4Bxrh7e2AQcFO0b2egC9AeaAccBHSN9pkEdMpU3CIi6VBUFEqxv/8+/N//xR1NemXyjKMTsNjd33P3NcA44OQK27QFno+WpyU87kAjoAHQEKgPfAzg7jPdfUUG4xYRSYsuXaB//zAN7bPPxh1N+mQycfwE+CDh/rJoXaK5QI9o+VRgGzPb0d1nEBLJiug2xd0XZjBWEZGMuPFG2GcfuPBC+PLLuKNJj7g7x68GuprZbEJT1HJgvZm1AfYFWhCSzZFmdngqT2xmfcyszMzKVq5cme64RUSSstVWoclq+fJw9lEbZDJxLAdaJtxvEa37gbt/6O493L0jMDBa9wXh7GOmu69y91XAZODQVA7u7iPdvcTdS5o1a7YFL0NEZMscfDBcc00YWf7Pf8YdzZbLZOJ4HdjTzHYzswbAWcDExA3MrKmZlcdwHTAqWl5KOBOpZ2b1CWcjaqoSkbx1/fXQrh307g2ffx53NFsmY4nD3dcB/YAphC/98e6+wMwGmdlJ0WbdgEVm9jawM1BekHgC8C4wn9APMtfdJ8EPl+kuA4rMbJmZ3ZCp1yAiki4NG8Lo0fDxx3DFFXFHs2XM3eOOIeNKSkq8rKws7jBERLj+ehg0CJ58Ek6ueJ1pjjGzWe5eUnF93J3jIiIFZeBA6NAB+vSB//0v7mhqRolDRCSLGjQIV1l9/jn06xd3NDWjxCEikmX77w+//z08+ig89ljc0aROiUNEJAbXXgsHHgh9+8Inn8QdTWqUOEREYlCvXrjK6quvwuRP+XSdkhKHiEhM9tsvlCT5+99h3Li4o0meEoeISIwGDIBDDoFLL4UVeVK+VYlDRCRGdeuGq6y+/RZ+/ev8aLJS4hARidnee8Of/gSTJsGYMXFHUz0lDhGRHHDFFXD44eHvsmVxR7N5ShwiIjmgTp1QPXftWrjootxuslLiEBHJEW3awJ//DFOmwAMPxB1N1ZQ4RERySN++cMQRYdKn99+PO5rKKXGIiOSQ8iYr9zDd7IYNcUe0KSUOEZEcU1wMQ4fC1Klw771xR7MpJQ4RkRzUuzcceyz85jfw3ntxR/NjShwiIjnIDO6/PwwQvOCC3GqyUuIQEclRLVvCbbfBCy/AHXfEHc1GShwiIjmsVy844QS47jp4++24owmUOEREcpgZjBwJDRuGJLJ+fdwRKXGIiOS85s1DU9WMGTBsWNzRKHGIiOSF0lI45RT47W9h4cJ4Y1HiEBHJA2Zwzz3QuDH07Anr1sUXixKHiEie2HlnuOsueP11+Mtf4otDiUNEJI+ceSaccQZcfz3Mnx9PDEocIiJ55q67YPvtQ5PV2rXZP74Sh4hInmnaNPR3zJ4dZg7MNiUOEZE8dOqp4UqrP/4xJJBsUuIQEclTw4eHs4+ePeH777N3XCUOEZE8tcMOcN99oZP8xhuzd1wlDhGRPHbiiaEUyc03h8t0syGjicPMupvZIjNbbGbXVvJ4azObambzzGy6mbVIeGyImS0ws4VmNtzMLFp/oJnNj57zh/UiIoVq2DDYZZfQZPXdd5k/XsYSh5nVBUYAxwNtgbPNrG2FzW4Bxrh7e2AQcFO0b2egC9AeaAccBHSN9rkb6A3sGd26Z+o1iIjkg+22gwceCKVIrr8+88fL5BlHJ2Cxu7/n7muAccDJFbZpCzwfLU9LeNyBRkADoCFQH/jYzHYFmrj7THd3YAxwSgZfg4hIXjjuuDBr4C23hGKImZTJxPET4IOE+8uidYnmAj2i5VOBbcxsR3efQUgkK6LbFHdfGO2/rJrnBMDM+phZmZmVrVy5cotfjIhIrhs6NEz+1LMnrF6duePE3Tl+NdDVzGYTmqKWA+vNrA2wL9CCkBiONLPDU3lidx/p7iXuXtKsWbN0xy0iknO22QZGjYJ33gl9HnXqQHExjB2b3uPUS+/T/chyoGXC/RbRuh+4+4dEZxxm1hg4zd2/MLPewEx3XxU9Nhk4FHgoep4qn1NEpJCtWAH16sHXX4f7778PffqE5dLS9Bwjk2ccrwN7mtluZtYAOAuYmLiBmTU1s/IYrgNGRctLCWci9cysPuFsZKG7rwC+MrNDoqupzgeeyuBrEBHJKwMHblpyffXqsD5dMpY43H0d0A+YAiwExrv7AjMbZGYnRZt1AxaZ2dvAzsDgaP0E4F1gPqEfZK67T4oe6wvcDyyOtpmcqdcgIpJvli5NbX1NWLg4qXYrKSnxsrKyuMMQEcm44uLQPFVR69awZElqz2Vms9y9pOL6uDvHRUQkjQYPhqKiH68rKgrr00WJQ0SkFikthZEjwxmGWfg7cmT6OsYhs1dViYhIDEpL05soKtIZh4iIpESJQ0REUqLEISIiKVHiEBGRlChxiIhISgpiAKCZrQQqGRKTlKbA/9IYTroortQortQortTU1rhau/smVWILInFsCTMrq2zkZNwUV2oUV2oUV2oKLS41VYmISEqUOEREJCVKHNUbGXcAVVBcqVFcqVFcqSmouNTHISIiKdEZh4iIpESJQ0REUqLEETGz7ma2yMwWm9m1lTz+MzN7w8zWmdnpORRXfzP7j5nNM7OpZtY6R+K62Mzmm9kcM3vJzNrmQlwJ251mZm5mWbmEMon3q5eZrYzerzlmdlEuxBVtc2b0b2yBmf0tF+Iys2EJ79XbZvZFjsTVysymmdns6P/kz3MkrtbR98M8M5tuZi226IDuXvA3oC5hGtrdgQaE6WrbVtimGGgPjAFOz6G4jgCKouVLgEdzJK4mCcsnAc/kQlzRdtsALwAzgZJciAvoBdyZjX9XKca1JzAb2D66v1MuxFVh+8uAUbkQF6Ez+pJouS2wJEfiegzoGS0fCTy0JcfUGUfQCVjs7u+5+xpgHHBy4gbuvsTd5wEbciyuae6+Oro7E9iyXxLpi+urhLtbA9m4CqPauCI3An8GvstCTKnElW3JxNUbGOHunwO4+yc5Eleis4FHciQuB5pEy9sCH+ZIXG2B56PlaZU8nhIljuAnwAcJ95dF6+KWalwXApMzGlGQVFxmdqmZvQsMAS7PhbjM7ACgpbs/nYV4ko4rclrUlDDBzFrmSFx7AXuZ2ctmNtPMuudIXEBoggF2Y+OXYtxx3QCca2bLgH8SzoZyIa65QI9o+VRgGzPbsaYHVOKoJczsXKAE+EvcsZRz9xHuvgdwDfDbuOMxszrArcCAuGOpxCSg2N3bA88Co2OOp1w9QnNVN8Iv+/vMbLs4A6rgLGCCu6+PO5DI2cBf3b0F8HPgoejfXdyuBrqa2WygK7AcqPF7lgsvKBcsBxJ/4bWI1sUtqbjM7GhgIHCSu3+fK3ElGAecksmAItXFtQ3QDphuZkuAQ4CJWeggr/b9cvdPEz67+4EDMxxTUnERfr1OdPe17v5f4G1CIok7rnJnkZ1mKkgurguB8QDuPgNoRCg0GGtc7v6hu/dw946E7wrc/YsaHzHTHTf5cCP8qnqPcMpb3rm0XxXb/pXsdY5XGxfQkdAxtmcuvV+J8QC/AMpyIa4K208nO53jybxfuyYsnwrMzJG4ugOjo+WmhCaRHeOOK9puH2AJ0UDmHHm/JgO9ouV9CX0cGY0vybiaAnWi5cHAoC06Zjbe8Hy4EU4r346+hAdG6wYRfsUDHET49fUN8CmwIEfieg74GJgT3SbmSFy3AwuimKZt7gs8m3FV2DYriSPJ9+um6P2aG71f++RIXEZo3vsPMB84Kxfiiu7fANycjXhSeL/aAi9Hn+Mc4Ngciet04J1om/uBhltyPJUcERGRlKiPQ0REUqLEISIiKVHiEBGRlChxiIhISpQ4REQkJUocUhCiiqXHVVh3pZndvZl9pmerem6F415uZgvNbGyS2xeb2Ztbuo1IspQ4pFA8QhhlnCibo45T0Rc4xt1L4w5EpDJKHFIoJgAnmFkDCL/AgebAi2Z2t5mVRfNN/KGync1sVcLy6Wb212i5mZk9bmavR7cu0fquCfNFzDazbSp5zv5m9mZ0uzJadw+hPPZkM7uqwvbFZvaihXlh3jCzzpU8Zy8zeyo6W3rHzK5PeLiumd0Xvc5/mdlW0T69o9jnRq+lKOl3VQpTNkdd6qZbnDfgH8DJ0fK1wC3R8g7R37qE0eTto/vTiUaWA6sSnud0QiE7gL8Bh0XLrYCF0fIkoEu03BioVyGWAwkjsbeOHl8AdIweWwI0rST+IqBRtLwnURkXwlwxb0bLvYAVwI7AVsCbhOKXxcA6oEO03Xjg3Gh5x4Rj/BG4LO7PSrfcvumMQwpJYnNVYjPVmWb2BmHCov0IZSOSdTRwp5nNASYCTcysMaHsxK1mdjmwnbuvq7DfYcAT7v6Nu68C/g4cXs2x6hOq084nTMxTVZzPeiia+G30vIdF6//r7nOi5VmEZALQLjqTmQ+UEt4DkSrVizsAkSx6ChgWzclR5O6zzGw3Qsnpg9z986gJqlEl+ybW5kl8vA5wiLtXnBTqZjN7mlBD6GUzO87d39rC+K8i1CXbPzpuVRNRVawjVH4/sXLyesIZCYTCnae4+1wz60UooS5SJZ1xSMGIftlPA0ax8WyjCaFw5ZdmtjNwfBW7f2xm+0ZzK5yasP5fJEzWY2Ydor97uPt8d/8z8DqhkmuiF4FTzKzIzLaOnvPFal7CtsAKd98AnEdoWqvMMWa2Q9SHcQrh7GdztgFWmFl9whmHyGYpcUiheYTwi/0RAHefS2iieovQX1HVl+y1hD6SVwh9COUuB0qimfv+A1wcrb8y6vSeB6ylwsyM7v4G4Zf+a8CrwP3uPrua2O8CeprZXEIi+qaK7V4DHgfmAY+7e1k1z/u7KIaXCe+DyGapOq5ILRI1NZW4e7+4Y5HaS2ccIiKSEp1xiIhISnTGISIiKVHiEBGRlChxiIhISpQ4REQkJUocIiKSkv8PZwrfXX7CZHcAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "\"\"\"\n",
    "Your code here\n",
    "\"\"\"\n",
    "\n",
    "alphaArr=[]\n",
    "accArr=[]\n",
    "maxVal=0\n",
    "maxAlpha=0\n",
    "for alpha in range(1,10):\n",
    "    alpha*=0.1\n",
    "    Y_pred=[]\n",
    "    for i in range(len(X_test)):\n",
    "        arr= np.where(X_test[i]!=0)[0]\n",
    "        prob1 = np.log(counts[0]/(counts[0]+counts[1]))\n",
    "        prob2 = np.log(counts[1]/(counts[0]+counts[1]))\n",
    "        for j in arr:\n",
    "            prob1 += np.log((d[0][j]+alpha)/(counts[0] + alpha*X_test.shape[1]))\n",
    "            prob2 += np.log((d[1][j]+alpha)/(counts[1] + alpha*X_test.shape[1]))\n",
    "        if prob1>prob2:\n",
    "            Y_pred.append(0)\n",
    "        else:\n",
    "            Y_pred.append(1)\n",
    "    alphaArr.append(alpha)\n",
    "    accArr.append(accuracy_score(y_test,Y_pred))\n",
    "    print(\"Accuracy on test data with alpha = \"+str(alpha)+\" : \"+str(accuracy_score(y_test,Y_pred)))\n",
    "    if accuracy_score(y_test,Y_pred)>maxVal:\n",
    "        maxVal = accuracy_score(y_test,Y_pred)\n",
    "        maxAlpha=alpha\n",
    "print(\"Best accuracy on test is: \"+str(maxVal)+\" for alpha=: \"+str(maxAlpha))\n",
    "\n",
    "plt.figure()\n",
    "plt.plot(alphaArr, accArr, 'bo-')\n",
    "plt.xlabel('Values of alpha')\n",
    "plt.ylabel('Accuracy')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d58051f8-67fb-4632-884b-23c719d4cae7",
   "metadata": {
    "id": "d58051f8-67fb-4632-884b-23c719d4cae7"
   },
   "source": [
    "## Comparison with Sci-kit Learn Implementation\n",
    "\n",
    "- Use sci-kit learn's `sklearn.naive_bayes.MultinomialNB` model to compare your implementation's performance\n",
    "- (Optional) try other classifiers from `sklearn.naive_bayes` and see if you can make them work`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "192a26af-2b0a-4165-9d36-b5ed774c11fb",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "192a26af-2b0a-4165-9d36-b5ed774c11fb",
    "outputId": "2e52a1f3-ac3f-4dbf-f0a9-752dddab5ef1"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9827709978463748"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"\"\"\n",
    "Your code here\n",
    "\"\"\"\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "\n",
    "model = MultinomialNB()\n",
    "model.fit(X_train,y_train)\n",
    "\n",
    "Y_pred = model.predict(X_test)\n",
    "accuracy_score(y_test,Y_pred)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "7h6gHgLEgx_e",
   "metadata": {
    "id": "7h6gHgLEgx_e"
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}