{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"tags": [
"s1",
"content",
"l1"
]
},
"source": [
"# Joint Distributions\n",
"\n",
"Consider two discrete random variables X and Y. The function given by\n",
"f (x, y) = P(X = x, Y = y) for each pair of values (x, y) within the\n",
"range of X is called the joint probability distribution of X and Y.\n",
"\n",
"The joint probability mass function for discrete random variables (X=x, Y=y) is given by:\n",
"\n",
"${\\begin{aligned}\\mathrm {P} (X=x\\ \\mathrm {and} \\ Y=y)=\\mathrm {P} (Y=y\\mid X=x)\\cdot \\mathrm {P} (X=x)=\\mathrm {P} (X=x\\mid Y=y)\\cdot \\mathrm {P} (Y=y)\\end{aligned}}$\n",
"\n",
"\n",
"### Example\n",
"\n",
"A coin is tossed twice. Let X denote the number of heads on the first toss and Y the total number of heads on the 2 tosses. \n",
"Assume that the coin is biased and a head has a 60% chance of occurring:\n",
"\n",
"* X = First head\n",
"* Y = Number of heads in 2 tosses\n",
"\n",
"Compute the joint probability table and assign the values to the dictionary."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"tags": [
"s1",
"ce",
"l1"
]
},
"outputs": [],
"source": [
"# Assign the values of the dictionary of the form p_xy[X][Y] below\n",
"p_h = 0.6\n",
"p_t = 1-0.6\n",
"p_12 = 0\n",
"p_11 = 0\n",
"p_01 = 0\n",
"p_10 = 0"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"tags": [
"s1",
"l1",
"hint"
]
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"tags": [
"s1",
"l1",
"ans"
]
},
"outputs": [],
"source": [
"p_12 = p_h * p_h\n",
"p_11 = p_h * p_t\n",
"p_01 = p_t * p_h\n",
"p_00 = p_t * p_t\n",
"\n",
"print(\"p_12 %s, p_11 %s, p_01 %s, p_00 %.4s\" % (p_12, p_11, p_01, p_00))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"tags": [
"s1",
"hid",
"l1"
]
},
"outputs": [],
"source": [
"ref_tmp_var = False\n",
"\n",
"try:\n",
" if (abs(p_12 - 0.36)<0.1) and (abs(p_11 - 0.24) < 0.1) and (abs(p_01 - 0.24) < 0.1) and (abs(p_00 - .16) < 0.1): \n",
" ref_assert_var = True\n",
" ref_tmp_var = True\n",
" else:\n",
" ref_assert_var = False\n",
" print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
"except Exception:\n",
" print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
"\n",
"assert ref_tmp_var"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"l2",
"content",
"s2"
]
},
"source": [
"\\begin{array}{ l | c | r }\n",
" \\hline\n",
" - & 1st - Toss & 2nd-Toss & JP \\\\ \n",
" \\hline\n",
" HH & 0.6 & 0.6 & 0.36 \\\\ \n",
" \\hline\n",
" HT & 0.6 & 0.4 & 0.24 \\\\ \n",
" \\hline\n",
" TH & 0.4 & 0.6 & 0.24 \\\\ \n",
" \\hline\n",
" TT & 0.4 & 0.4 & 0.16 \\\\ \n",
" \\hline\n",
"\\end{array}\n",
"\n",
"\n",
"\n",
"The joint probability distribution looks like :\n",
"\n",
"\\begin{array}{ l | c | r }\n",
" \\hline\n",
" H:T & X & Y & JP \\\\ \n",
" \\hline\n",
" HH & 1 & 2 & 0.36 \\\\ \n",
" \\hline\n",
" HT & 1 & 1 & 0.24 \\\\ \n",
" \\hline\n",
" TH & 0 & 1 & 0.24 \\\\ \n",
" \\hline\n",
" TT & 0 & 0 & 0.16 \\\\ \n",
" \\hline\n",
"\\end{array}\n",
"\n",
"We can now organize the above in the form of a map with Y, X as:\n",
"\n",
"\\begin{array}{ l | c | r }\n",
" \\hline\n",
" Y:X-> & 0 & 1 \\\\ \n",
" \\hline\n",
" 0 & 0.16 & 0 \\\\ \n",
" \\hline\n",
" 1 & 0.24 & 0.24 \\\\ \n",
" \\hline\n",
" 2 & 0 & 0.36 \\\\ \n",
" \\hline\n",
"\\end{array}\n",
"\n",
"\n",
"## Marginal Distribution\n",
"\n",
"For a given two random variables X and Y whose joint distribution is known, the marginal distribution of X is simply the probability distribution of X averaging over information about Y. This is calculated by summing the joint probability distribution over Y.\n",
"\n",
"For discrete random variable , marginal distribution of variable X is obtained by summing up the distribution of X over values of Y.\n",
"\n",
"Let us consider the above joint distribution again:\n",
"\n",
"\\begin{array}{ l | c | r }\n",
" \\hline\n",
" Y : X-> & 0 & 1 \\\\ \n",
" \\hline\n",
" 0 & 0.16 & 0 \\\\ \n",
" \\hline\n",
" 1 & 0.24 & 0.24 \\\\ \n",
" \\hline\n",
" 2 & 0 & 0.36 \\\\ \n",
" \\hline\n",
"\\end{array}\n",
"\n",
"\n",
"## Example\n",
"\n",
"* Compute the marginal distributions, f(X), f(y). Assign the list to the variables fX, fY."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"tags": [
"l2",
"ce",
"s2"
]
},
"outputs": [],
"source": [
"#Exercise\n",
"fX = []\n",
"fY = []"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"l2",
"s2",
"hint"
]
},
"source": [
"Sum over rows and columns for each marginal distribution."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"tags": [
"l2",
"s2",
"ans"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fX: [0.4, 0.6]\n",
"fY: [0.16, 0.48, 0.36]\n"
]
}
],
"source": [
"fX = [0.4, 0.6]\n",
"fY = [0.16, 0.48, 0.36]\n",
"\n",
"print(\"fX: \", fX)\n",
"print(\"fY: \", fY)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"tags": [
"l2",
"hid",
"s2"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"continue\n"
]
}
],
"source": [
"ref_tmp_var = False\n",
"\n",
"try:\n",
" if fX == [0.4, 0.6] and fY == [0.16, 0.48, 0.36]: \n",
" ref_assert_var = True\n",
" ref_tmp_var = True\n",
" else:\n",
" ref_assert_var = False\n",
" print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
"except Exception:\n",
" print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
"\n",
"assert ref_tmp_var"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"l3",
"s3",
"content"
]
},
"source": [
"For the above joint distribution the marginal distribution is below:\n",
"\n",
"Marginal Distribution of X:\n",
"\n",
"\\begin{array}{ l | c | r }\n",
" \\hline\n",
" X-> & 0 & 1 \\\\ \n",
" \\hline\n",
" f(x) & 0.4 & 0.6 \\\\ \n",
" \\hline\n",
"\\end{array}\n",
"\n",
"\n",
"Marginal Distribution of Y:\n",
"\n",
"\\begin{array}{ l | c | r }\n",
" \\hline\n",
" Y-> & 0 & 1 & 2 \\\\ \n",
" \\hline\n",
" f(y) & 0.16 & 0.48 & 0.36 \\\\ \n",
" \\hline\n",
"\\end{array}\n",
"\n",
"http://www.sci.csueastbay.edu/~btrumbo/Stat3401/Hand3401/JointDistnsCor.pdf\n",
"\n",
"\n",
"### Corpus of words\n",
"\n",
"Let us consider the case of a corpus (collection) of 100 words in a text. The words are tabulated below based on their frequency of occurrence and the probability - \n",
"c(w) = count\n",
"P(w) = Probability\n",
"X = word length\n",
"Y- number of Vowels.\n",
"\n",
"\n",
"Let us look at a joint probability table for this:\n",
"\n",
"\\begin{array}{ l | c | r }\n",
" \\hline\n",
" word & c(w) & P(w) & X & Y \\\\ \n",
" \\hline\n",
" the & 30 & 0.30 & 3 & 1 \\\\ \n",
" \\hline\n",
" to & 18 & 0.18 & 2 & 1 \\\\ \n",
" \\hline\n",
" will & 16 & 0.16 & 4 & 1 \\\\ \n",
" \\hline\n",
" of & 10 & 0.10 & 2 & 1 \\\\ \n",
" \\hline\n",
" hello & 7 & 0.07 & 5 & 2 \\\\ \n",
" \\hline\n",
" in & 6 & 0.06 & 2 & 1 \\\\ \n",
" \\hline\n",
" tools & 4 & 0.04 & 5 & 2 \\\\ \n",
" \\hline\n",
" pose & 3 & 0.03 & 4 & 2 \\\\ \n",
" \\hline\n",
" taste & 3 & 0.03 & 5 & 2 \\\\ \n",
" \\hline\n",
" PGM & 3 & 0.03 & 3 & 0 \\\\ \n",
" \\hline\n",
"\\end{array}\n",
"\n",
"From the above table, it is evident that the word \"the\" occurs 30 times (count column) out of a total of 100 words. Hence the probability of the word \"the\" is 0.30 (30/100 = 0.30). The X column refers to the length of the word. In this case x=3. The Y column refers to the number of vowels. In this case y=1. Similarly for the word \"to\" the probability of occurrence is 0.18 (18/100 = 0.18). X and Y are 2 and 1 respectively.\n",
"\n",
"For arriving at joint probability distribution of variables X and Y, we must consider all the combinations of X and Y that are observed. For example, let us consider all the words with a length of 2 (that is X=2) and with exactly 1 vowel (Y=1). We have 3 occurrences namely \"to\", \"of\" and \"in\". We can get the joint probability by summing up the individual probabilities for these words. Those are 0.18, 0.10 and 0.06. Hence for X=2, Y=1 the joint probability is 0.18+0.10+0.06 which is 0.34. Similarly calculating the joint probabilities for all combinations of X and Y we get the Joint Probability Distribution table. \n",
"\n",
"The joint probability distribution looks like this:\n",
"\n",
"\\begin{array}{ l | c | r }\n",
" \\hline\n",
" Y/X-> & 2 & 3 & 4 & 5 \\\\ \n",
" \\hline\n",
" 0 & 0 & 0.03 & 0 & 0 \\\\ \n",
" \\hline\n",
" 1 & 0.34 & 0.30 & 0.16 & 0 \\\\ \n",
" \\hline\n",
" 2 & 0 & 0 & 0.03 & 0.14 \\\\ \n",
" \\hline\n",
"\\end{array}\n",
"\n",
"### Exercise\n",
"\n",
"Find the marginal distribution of X and Y from the above joint probability distribution.\n",
"\n",
"Assign them to the variables fX and fY respectively.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true,
"tags": [
"l3",
"s3",
"ce"
]
},
"outputs": [],
"source": [
"#Exercise\n",
"fX = []\n",
"fY = []"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"l3",
"s3",
"hint"
]
},
"source": [
"Sum over rows(for fY array) and columns(for fX array) for each marginal distribution."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"tags": [
"l3",
"s3",
"ans"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fX: [0.34, 0.33, 0.19, 0.14]\n",
"fY: [0.03, 0.8, 0.17]\n"
]
}
],
"source": [
"fX = [0.34, 0.33, 0.19, 0.14]\n",
"fY = [0.03, 0.80, 0.17]\n",
"\n",
"print(\"fX: \", fX)\n",
"print(\"fY: \", fY)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"tags": [
"l3",
"s3",
"hid"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"continue\n"
]
}
],
"source": [
"ref_tmp_var = False\n",
"\n",
"try:\n",
" if fX == [0.34, 0.33, 0.19, 0.14] and fY == [0.03, 0.8, 0.17]: \n",
" ref_assert_var = True\n",
" ref_tmp_var = True\n",
" else:\n",
" ref_assert_var = False\n",
" print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
"except Exception:\n",
" print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
"\n",
"assert ref_tmp_var"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"l4",
"s4",
"content"
]
},
"source": [
"For the above joint distribution, the marginal distribution of X and Y are given below:\n",
"\n",
"Marginal Distribution of X:\n",
"\n",
"\\begin{array}{ l | c | r }\n",
" \\hline\n",
" X-> & 2 & 3 & 4 & 5 \\\\ \n",
" \\hline\n",
" f(X) & 0.34 & 0.33 & 0.19 & 0.14 \\\\ \n",
" \\hline\n",
"\\end{array}\n",
"\n",
"\n",
"Marginal Distribution of Y:\n",
"\n",
"\\begin{array}{ l | c | r }\n",
" \\hline\n",
" Y-> & 0 & 1 & 2 \\\\ \n",
" \\hline\n",
" f(Y) & 0.03 & 0.80 & 0.17 \\\\ \n",
" \\hline\n",
"\\end{array}\n",
"\n",
"\n",
"\n",
"\n",
"## Fraud Modeling Example\n",
"\n",
"Consider a simple model of fraudulent transactions with data containing Sex (S), Age (A), Fraud (F), Jewelry (J) and probabilities P {P(S,A,F,J)}:\n",
"\n",
"| S | A | F | J | P |\n",
"|-----|-----|-----|-----|----------------|\n",
"| S_0 | A_0 | F_0 | J_0 | 0.0025 |\n",
"| S_0 | A_0 | F_0 | J_1 | 0.0100 |\n",
"| S_0 | A_0 | F_1 | J_0 | 0.1069 |\n",
"| ... | ... | ... | ... | ... |\n",
"| S_1 | A_2 | F_1 | J_1 | 0.0079 |\n",
"\n",
"\n",
"(F = No) corresponds to F_1\n",
"\n",
"* Compute p(S, A, F, J | F=No) and assign it to p_SAFJ "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"tags": [
"l4",
"s4",
"ce"
]
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" S | \n",
" A | \n",
" F | \n",
" J | \n",
" P | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" S_0 | \n",
" A_0 | \n",
" F_0 | \n",
" J_0 | \n",
" 0.0025 | \n",
"
\n",
" \n",
" 1 | \n",
" S_0 | \n",
" A_0 | \n",
" F_0 | \n",
" J_1 | \n",
" 0.0100 | \n",
"
\n",
" \n",
" 2 | \n",
" S_0 | \n",
" A_0 | \n",
" F_1 | \n",
" J_0 | \n",
" 0.1069 | \n",
"
\n",
" \n",
" 3 | \n",
" S_0 | \n",
" A_0 | \n",
" F_1 | \n",
" J_1 | \n",
" 0.0056 | \n",
"
\n",
" \n",
" 4 | \n",
" S_0 | \n",
" A_1 | \n",
" F_0 | \n",
" J_0 | \n",
" 0.0008 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" S A F J P\n",
"0 S_0 A_0 F_0 J_0 0.0025\n",
"1 S_0 A_0 F_0 J_1 0.0100\n",
"2 S_0 A_0 F_1 J_0 0.1069\n",
"3 S_0 A_0 F_1 J_1 0.0056\n",
"4 S_0 A_1 F_0 J_0 0.0008"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"fraud_data = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/Fraud/fraud_data.csv')\n",
"fraud_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"s4",
"l4",
"hint"
]
},
"source": [
"Use fraud_data['F'].str.contains('F_1')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"tags": [
"s4",
"l4",
"ans"
]
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\ProgramData\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:2: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
" \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" S | \n",
" A | \n",
" F | \n",
" J | \n",
" P | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" S_0 | \n",
" A_0 | \n",
" F_1 | \n",
" J_0 | \n",
" 0.118778 | \n",
"
\n",
" \n",
" 3 | \n",
" S_0 | \n",
" A_0 | \n",
" F_1 | \n",
" J_1 | \n",
" 0.006222 | \n",
"
\n",
" \n",
" 6 | \n",
" S_0 | \n",
" A_1 | \n",
" F_1 | \n",
" J_0 | \n",
" 0.190000 | \n",
"
\n",
" \n",
" 7 | \n",
" S_0 | \n",
" A_1 | \n",
" F_1 | \n",
" J_1 | \n",
" 0.010000 | \n",
"
\n",
" \n",
" 10 | \n",
" S_0 | \n",
" A_2 | \n",
" F_1 | \n",
" J_0 | \n",
" 0.166222 | \n",
"
\n",
" \n",
" 11 | \n",
" S_0 | \n",
" A_2 | \n",
" F_1 | \n",
" J_1 | \n",
" 0.008778 | \n",
"
\n",
" \n",
" 14 | \n",
" S_1 | \n",
" A_0 | \n",
" F_1 | \n",
" J_0 | \n",
" 0.118778 | \n",
"
\n",
" \n",
" 15 | \n",
" S_1 | \n",
" A_0 | \n",
" F_1 | \n",
" J_1 | \n",
" 0.006222 | \n",
"
\n",
" \n",
" 18 | \n",
" S_1 | \n",
" A_1 | \n",
" F_1 | \n",
" J_0 | \n",
" 0.190000 | \n",
"
\n",
" \n",
" 19 | \n",
" S_1 | \n",
" A_1 | \n",
" F_1 | \n",
" J_1 | \n",
" 0.010000 | \n",
"
\n",
" \n",
" 22 | \n",
" S_1 | \n",
" A_2 | \n",
" F_1 | \n",
" J_0 | \n",
" 0.166222 | \n",
"
\n",
" \n",
" 23 | \n",
" S_1 | \n",
" A_2 | \n",
" F_1 | \n",
" J_1 | \n",
" 0.008778 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" S A F J P\n",
"2 S_0 A_0 F_1 J_0 0.118778\n",
"3 S_0 A_0 F_1 J_1 0.006222\n",
"6 S_0 A_1 F_1 J_0 0.190000\n",
"7 S_0 A_1 F_1 J_1 0.010000\n",
"10 S_0 A_2 F_1 J_0 0.166222\n",
"11 S_0 A_2 F_1 J_1 0.008778\n",
"14 S_1 A_0 F_1 J_0 0.118778\n",
"15 S_1 A_0 F_1 J_1 0.006222\n",
"18 S_1 A_1 F_1 J_0 0.190000\n",
"19 S_1 A_1 F_1 J_1 0.010000\n",
"22 S_1 A_2 F_1 J_0 0.166222\n",
"23 S_1 A_2 F_1 J_1 0.008778"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"p_SAFJ = fraud_data[fraud_data['F'].str.contains('F_1')]\n",
"p_SAFJ['P'] = p_SAFJ['P']/p_SAFJ['P'].sum()\n",
"p_SAFJ"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"tags": [
"s4",
"hid",
"l4"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"continue\n"
]
}
],
"source": [
"ref_tmp_var = False\n",
"\n",
"try:\n",
" if abs(p_SAFJ['P'][2] - 0.1069) < 0.1: \n",
" ref_assert_var = True\n",
" ref_tmp_var = True\n",
" else:\n",
" ref_assert_var = False\n",
" print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
"except Exception:\n",
" print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
"\n",
"assert ref_tmp_var"
]
}
],
"metadata": {
"executed_sections": [],
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}