{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Naive bayes by hand" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine you have 4 apples with these attributes" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "apples_docs = [\n", " \"red round\",\n", " \"red round\",\n", " \"green sour round\",\n", " \"green round\",\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and 3 bananas with these attributes:" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "bananas_docs = [\n", " \"yellow skinny\",\n", " \"yellow skinny\",\n", " \"green skinny\"\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Split into list of lists:" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "apples = [a.split() for a in apples_docs]\n", "bananas = [b.split() for b in bananas_docs]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q.** What is the sorted set of all attributes (assign to vocabulary variable $V$)?\n", "\n", "(Let's ignore the unknown word issue in our vectors and in our computations.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Solution\n", "['green', 'red', 'round', 'skinny', 'sour', 'yellow']\n", " \n", "You can compute like this:\n", " \n", "```\n", "Va = set(np.concatenate(apples))\n", "Vb = set(np.concatenate(bananas))\n", "V = sorted(Va.union(Vb))\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q**. What is the word vector for the \"red round\" apple?\n", "\n", "The column values are 1 if the word is mentioned otherwise 0. Assume the sorted column order." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Solution\n", " The row vector is [0, 1, 1, 0, 0, 0] for \"red round\"\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q**. What is the word vector for the \"green sour round\" apple?\n", "\n", "The column values are 1 if the word is mentioned otherwise 0. Assume the sorted column order." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Solution\n", " The row vector is [1, 0, 1, 0, 1, 0] for \"green sour round\"\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at all fruit vectors now and fruit target column" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
greenredroundskinnysouryellowfruit
00110000
10110000
21010100
31010000
40001011
50001011
61001001
\n", "
" ], "text/plain": [ " green red round skinny sour yellow fruit\n", "0 0 1 1 0 0 0 0\n", "1 0 1 1 0 0 0 0\n", "2 1 0 1 0 1 0 0\n", "3 1 0 1 0 0 0 0\n", "4 0 0 0 1 0 1 1\n", "5 0 0 0 1 0 1 1\n", "6 1 0 0 1 0 0 1" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = np.zeros((7,len(V)))\n", "for i,row in enumerate(apples+bananas):\n", " for w in row:\n", " data[i,V.index(w)] = 1\n", "df = pd.DataFrame(data,columns=V,dtype=int)\n", "df['fruit'] = [0,0,0,0,1,1,1]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q.** What is a good estimate of $P(apple)$=`P_apple` and $P(banana)$=`P_banana`?\n", "\n", "(Define those variables)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Solution\n", "
\n",
    "P_apple = 4/7\n",
    "P_banana = 3/7\n",
    "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q.** What are good estimates of $P(red|apple)$ and $P(red|banana)$?\n", "\n", "Count number of times red appears and divide by number of words in apple docs, then do for banana." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Solution\n", "P(red|apple) = 2/9 words in apples are red and 0/6 words in bananas are red.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q.** What are good estimates of $P(green|apple)$ and $P(green|banana)$?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Solution\n", "2/9 apples are green and 1/6 bananas are green.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Laplace smoothing of $P(w|c)$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q.** If $P(skinny|apple)=0$, what is our smoothed estimate?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Solution\n", "$P(skinny|apple) = (count(skinny,apple)+1)/(count(apple)+|V|) = (0+1)/(9+6) = .0666666$\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Now, do that using vector operations to get smoothed `P_w_apple` from the apple records**\n", "\n", "Recall that `df[df.fruit==0]` gets you just the apple records. Your $P(skinny|apple)$ resuls should be:\n", "\n", "```\n", "green 0.200000\n", "red 0.200000\n", "round 0.333333\n", "skinny 0.066667\n", "sour 0.133333\n", "yellow 0.066667\n", "fruit 0.066667\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Solution\n", "
\n",
    "w_counts_apple = df[df.fruit==0].sum(axis=0)\n",
    "P_w_apple = (w_counts_apple+1) / (9+len(V))\n",
    "P_w_apple\n",
    "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Do that same thing to `P_w_banana` from the banana records**\n", "\n", "You should get:\n", "\n", "```\n", "green 0.166667\n", "red 0.083333\n", "round 0.083333\n", "skinny 0.333333\n", "sour 0.083333\n", "yellow 0.250000\n", "fruit 0.333333\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Solution\n", "
\n",
    "w_counts_banana = df[df.fruit==1].sum(axis=0)\n",
    "P_w_banana = (w_counts_banana+1) / (6+len(V))\n",
    "P_w_banana\n",
    "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q.** Given `P_w_apple`, what is `P_apple_redround`, the \"probability\" that \"red round\" is an apple?\n", "\n", "(We haven't normalized the scores (per our friend Bayes) so they aren't technically probabilities.) Just compute the score we'd use for classification per the lecture. Hint: `P_w_apple['skinny']` gives the estimate of $P(skinny|apple)$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Solution\n", " The answer is 0.03809523809523809 via:
\n", " P_apple_redround = P_apple * P_w_apple['red']*P_w_apple['round']\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q.** Given `P_w_banana`, what is `P_banana_redround`, the \"probability\" that \"red round\" is a banana?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Solution\n", " The answer is 0.002976190476190476 via:
\n", " P_banana_redround = P_banana * P_w_banana['red']*P_w_banana['round']\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's how to easily compute the probability of each word in V given class apple and class banana:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[P_w_apple[w] for w in V]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[P_w_banana[w] for w in V]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Now, define a function for computing likelihood of a document index $d \\in [0,5]$**\n", "\n", "$$\n", "c^*= \\underset{c}{argmax} ~ P(c) \\prod_{w \\in V} P(w | c)^{n_w(d)}\n", "$$\n", "\n", "You have these pieces: `P_apple`, `P_w_apple`, and $n_w(d)$ is just the value in `df[w][d]`." ] }, { "cell_type": "code", "execution_count": 174, "metadata": {}, "outputs": [], "source": [ "def likelihood_apple(d:int):\n", " return ...\n", "def likelihood_banana(d:int):\n", " return ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Solution\n", "
\n",
    "def likelihood_apple(d:int):\n",
    "    return P_apple * np.product([P_w_apple[w]**df[w][d] for w in V])\n",
    "def likelihood_banana(d:int):\n",
    "    return P_banana * np.product([P_w_banana[w]**df[w][d] for w in V])\n",
    "
\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Run the following loop to make predictions for each document**\n", "\n", "Output should be:\n", "\n", "```\n", "red round : 0.038095, 0.002976 => apple\n", "red round : 0.038095, 0.002976 => apple\n", "green sour round : 0.005079, 0.000496 => apple\n", "green round : 0.038095, 0.005952 => apple\n", "yellow skinny : 0.002540, 0.035714 => banana\n", "yellow skinny : 0.002540, 0.035714 => banana\n", "green skinny : 0.007619, 0.023810 => banana\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "docs = apples_docs+bananas_docs\n", "for d in range(len(df)):\n", " a = likelihood_apple(d)\n", " b = likelihood_banana(d)\n", " print(f\"{docs[d]:17s}: {a:4f}, {b:4f} => {'apple' if a>b else 'banana'}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }