{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Naive bayes by hand"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Imagine you have 4 apples with these attributes"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [],
"source": [
"apples_docs = [\n",
" \"red round\",\n",
" \"red round\",\n",
" \"green sour round\",\n",
" \"green round\",\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and 3 bananas with these attributes:"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [],
"source": [
"bananas_docs = [\n",
" \"yellow skinny\",\n",
" \"yellow skinny\",\n",
" \"green skinny\"\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Split into list of lists:"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [],
"source": [
"apples = [a.split() for a in apples_docs]\n",
"bananas = [b.split() for b in bananas_docs]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q.** What is the sorted set of all attributes (assign to vocabulary variable $V$)?\n",
"\n",
"(Let's ignore the unknown word issue in our vectors and in our computations.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Solution
\n",
"['green', 'red', 'round', 'skinny', 'sour', 'yellow']\n",
" \n",
"You can compute like this:\n",
" \n",
"```\n",
"Va = set(np.concatenate(apples))\n",
"Vb = set(np.concatenate(bananas))\n",
"V = sorted(Va.union(Vb))\n",
"```\n",
"Solution
\n",
" The row vector is [0, 1, 1, 0, 0, 0] for \"red round\"\n",
"Solution
\n",
" The row vector is [1, 0, 1, 0, 1, 0] for \"green sour round\"\n",
"
\n", " | green | \n", "red | \n", "round | \n", "skinny | \n", "sour | \n", "yellow | \n", "fruit | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
1 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
3 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
4 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "
5 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "
6 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "
\n", "P_apple = 4/7\n", "P_banana = 3/7\n", "\n", "
\n", "w_counts_apple = df[df.fruit==0].sum(axis=0)\n", "P_w_apple = (w_counts_apple+1) / (9+len(V))\n", "P_w_apple\n", "\n", "
\n", "w_counts_banana = df[df.fruit==1].sum(axis=0)\n", "P_w_banana = (w_counts_banana+1) / (6+len(V))\n", "P_w_banana\n", "\n", "
\n", "def likelihood_apple(d:int):\n", " return P_apple * np.product([P_w_apple[w]**df[w][d] for w in V])\n", "def likelihood_banana(d:int):\n", " return P_banana * np.product([P_w_banana[w]**df[w][d] for w in V])\n", "\n", "