{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [ "s1", "content", "l1" ] }, "source": [ "# Joint Distributions\n", "\n", "Consider two discrete random variables X and Y. The function given by\n", "f (x, y) = P(X = x, Y = y) for each pair of values (x, y) within the\n", "range of X is called the joint probability distribution of X and Y.\n", "\n", "The joint probability mass function for discrete random variables (X=x, Y=y) is given by:\n", "\n", "${\\begin{aligned}\\mathrm {P} (X=x\\ \\mathrm {and} \\ Y=y)=\\mathrm {P} (Y=y\\mid X=x)\\cdot \\mathrm {P} (X=x)=\\mathrm {P} (X=x\\mid Y=y)\\cdot \\mathrm {P} (Y=y)\\end{aligned}}$\n", "\n", "\n", "### Example\n", "\n", "A coin is tossed twice. Let X denote the number of heads on the first toss and Y the total number of heads on the 2 tosses. \n", "Assume that the coin is biased and a head has a 60% chance of occurring:\n", "\n", "* X = First head\n", "* Y = Number of heads in 2 tosses\n", "\n", "Compute the joint probability table and assign the values to the dictionary." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "tags": [ "s1", "ce", "l1" ] }, "outputs": [], "source": [ "# Assign the values of the dictionary of the form p_xy[X][Y] below\n", "p_h = 0.6\n", "p_t = 1-0.6\n", "p_12 = 0\n", "p_11 = 0\n", "p_01 = 0\n", "p_10 = 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "tags": [ "s1", "l1", "hint" ] }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "tags": [ "s1", "l1", "ans" ] }, "outputs": [], "source": [ "p_12 = p_h * p_h\n", "p_11 = p_h * p_t\n", "p_01 = p_t * p_h\n", "p_00 = p_t * p_t\n", "\n", "print(\"p_12 %s, p_11 %s, p_01 %s, p_00 %.4s\" % (p_12, p_11, p_01, p_00))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "tags": [ "s1", "hid", "l1" ] }, "outputs": [], "source": [ "ref_tmp_var = False\n", "\n", "try:\n", " if (abs(p_12 - 0.36)<0.1) and (abs(p_11 - 0.24) < 0.1) and (abs(p_01 - 0.24) < 0.1) and (abs(p_00 - .16) < 0.1): \n", " ref_assert_var = True\n", " ref_tmp_var = True\n", " else:\n", " ref_assert_var = False\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "assert ref_tmp_var" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l2", "content", "s2" ] }, "source": [ "\\begin{array}{ l | c | r }\n", " \\hline\n", " - & 1st - Toss & 2nd-Toss & JP \\\\ \n", " \\hline\n", " HH & 0.6 & 0.6 & 0.36 \\\\ \n", " \\hline\n", " HT & 0.6 & 0.4 & 0.24 \\\\ \n", " \\hline\n", " TH & 0.4 & 0.6 & 0.24 \\\\ \n", " \\hline\n", " TT & 0.4 & 0.4 & 0.16 \\\\ \n", " \\hline\n", "\\end{array}\n", "\n", "\n", "\n", "The joint probability distribution looks like :\n", "\n", "\\begin{array}{ l | c | r }\n", " \\hline\n", " H:T & X & Y & JP \\\\ \n", " \\hline\n", " HH & 1 & 2 & 0.36 \\\\ \n", " \\hline\n", " HT & 1 & 1 & 0.24 \\\\ \n", " \\hline\n", " TH & 0 & 1 & 0.24 \\\\ \n", " \\hline\n", " TT & 0 & 0 & 0.16 \\\\ \n", " \\hline\n", "\\end{array}\n", "\n", "We can now organize the above in the form of a map with Y, X as:\n", "\n", "\\begin{array}{ l | c | r }\n", " \\hline\n", " Y:X-> & 0 & 1 \\\\ \n", " \\hline\n", " 0 & 0.16 & 0 \\\\ \n", " \\hline\n", " 1 & 0.24 & 0.24 \\\\ \n", " \\hline\n", " 2 & 0 & 0.36 \\\\ \n", " \\hline\n", "\\end{array}\n", "\n", "\n", "## Marginal Distribution\n", "\n", "For a given two random variables X and Y whose joint distribution is known, the marginal distribution of X is simply the probability distribution of X averaging over information about Y. This is calculated by summing the joint probability distribution over Y.\n", "\n", "For discrete random variable , marginal distribution of variable X is obtained by summing up the distribution of X over values of Y.\n", "\n", "Let us consider the above joint distribution again:\n", "\n", "\\begin{array}{ l | c | r }\n", " \\hline\n", " Y : X-> & 0 & 1 \\\\ \n", " \\hline\n", " 0 & 0.16 & 0 \\\\ \n", " \\hline\n", " 1 & 0.24 & 0.24 \\\\ \n", " \\hline\n", " 2 & 0 & 0.36 \\\\ \n", " \\hline\n", "\\end{array}\n", "\n", "\n", "## Example\n", "\n", "* Compute the marginal distributions, f(X), f(y). Assign the list to the variables fX, fY." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "tags": [ "l2", "ce", "s2" ] }, "outputs": [], "source": [ "#Exercise\n", "fX = []\n", "fY = []" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l2", "s2", "hint" ] }, "source": [ "Sum over rows and columns for each marginal distribution." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [ "l2", "s2", "ans" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fX: [0.4, 0.6]\n", "fY: [0.16, 0.48, 0.36]\n" ] } ], "source": [ "fX = [0.4, 0.6]\n", "fY = [0.16, 0.48, 0.36]\n", "\n", "print(\"fX: \", fX)\n", "print(\"fY: \", fY)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [ "l2", "hid", "s2" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "continue\n" ] } ], "source": [ "ref_tmp_var = False\n", "\n", "try:\n", " if fX == [0.4, 0.6] and fY == [0.16, 0.48, 0.36]: \n", " ref_assert_var = True\n", " ref_tmp_var = True\n", " else:\n", " ref_assert_var = False\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "assert ref_tmp_var" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l3", "s3", "content" ] }, "source": [ "For the above joint distribution the marginal distribution is below:\n", "\n", "Marginal Distribution of X:\n", "\n", "\\begin{array}{ l | c | r }\n", " \\hline\n", " X-> & 0 & 1 \\\\ \n", " \\hline\n", " f(x) & 0.4 & 0.6 \\\\ \n", " \\hline\n", "\\end{array}\n", "\n", "\n", "Marginal Distribution of Y:\n", "\n", "\\begin{array}{ l | c | r }\n", " \\hline\n", " Y-> & 0 & 1 & 2 \\\\ \n", " \\hline\n", " f(y) & 0.16 & 0.48 & 0.36 \\\\ \n", " \\hline\n", "\\end{array}\n", "\n", "http://www.sci.csueastbay.edu/~btrumbo/Stat3401/Hand3401/JointDistnsCor.pdf\n", "\n", "\n", "### Corpus of words\n", "\n", "Let us consider the case of a corpus (collection) of 100 words in a text. The words are tabulated below based on their frequency of occurrence and the probability - \n", "c(w) = count\n", "P(w) = Probability\n", "X = word length\n", "Y- number of Vowels.\n", "\n", "\n", "Let us look at a joint probability table for this:\n", "\n", "\\begin{array}{ l | c | r }\n", " \\hline\n", " word & c(w) & P(w) & X & Y \\\\ \n", " \\hline\n", " the & 30 & 0.30 & 3 & 1 \\\\ \n", " \\hline\n", " to & 18 & 0.18 & 2 & 1 \\\\ \n", " \\hline\n", " will & 16 & 0.16 & 4 & 1 \\\\ \n", " \\hline\n", " of & 10 & 0.10 & 2 & 1 \\\\ \n", " \\hline\n", " hello & 7 & 0.07 & 5 & 2 \\\\ \n", " \\hline\n", " in & 6 & 0.06 & 2 & 1 \\\\ \n", " \\hline\n", " tools & 4 & 0.04 & 5 & 2 \\\\ \n", " \\hline\n", " pose & 3 & 0.03 & 4 & 2 \\\\ \n", " \\hline\n", " taste & 3 & 0.03 & 5 & 2 \\\\ \n", " \\hline\n", " PGM & 3 & 0.03 & 3 & 0 \\\\ \n", " \\hline\n", "\\end{array}\n", "\n", "From the above table, it is evident that the word \"the\" occurs 30 times (count column) out of a total of 100 words. Hence the probability of the word \"the\" is 0.30 (30/100 = 0.30). The X column refers to the length of the word. In this case x=3. The Y column refers to the number of vowels. In this case y=1. Similarly for the word \"to\" the probability of occurrence is 0.18 (18/100 = 0.18). X and Y are 2 and 1 respectively.\n", "\n", "For arriving at joint probability distribution of variables X and Y, we must consider all the combinations of X and Y that are observed. For example, let us consider all the words with a length of 2 (that is X=2) and with exactly 1 vowel (Y=1). We have 3 occurrences namely \"to\", \"of\" and \"in\". We can get the joint probability by summing up the individual probabilities for these words. Those are 0.18, 0.10 and 0.06. Hence for X=2, Y=1 the joint probability is 0.18+0.10+0.06 which is 0.34. Similarly calculating the joint probabilities for all combinations of X and Y we get the Joint Probability Distribution table. \n", "\n", "The joint probability distribution looks like this:\n", "\n", "\\begin{array}{ l | c | r }\n", " \\hline\n", " Y/X-> & 2 & 3 & 4 & 5 \\\\ \n", " \\hline\n", " 0 & 0 & 0.03 & 0 & 0 \\\\ \n", " \\hline\n", " 1 & 0.34 & 0.30 & 0.16 & 0 \\\\ \n", " \\hline\n", " 2 & 0 & 0 & 0.03 & 0.14 \\\\ \n", " \\hline\n", "\\end{array}\n", "\n", "### Exercise\n", "\n", "Find the marginal distribution of X and Y from the above joint probability distribution.\n", "\n", "Assign them to the variables fX and fY respectively.\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true, "tags": [ "l3", "s3", "ce" ] }, "outputs": [], "source": [ "#Exercise\n", "fX = []\n", "fY = []" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l3", "s3", "hint" ] }, "source": [ "Sum over rows(for fY array) and columns(for fX array) for each marginal distribution." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [ "l3", "s3", "ans" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fX: [0.34, 0.33, 0.19, 0.14]\n", "fY: [0.03, 0.8, 0.17]\n" ] } ], "source": [ "fX = [0.34, 0.33, 0.19, 0.14]\n", "fY = [0.03, 0.80, 0.17]\n", "\n", "print(\"fX: \", fX)\n", "print(\"fY: \", fY)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [ "l3", "s3", "hid" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "continue\n" ] } ], "source": [ "ref_tmp_var = False\n", "\n", "try:\n", " if fX == [0.34, 0.33, 0.19, 0.14] and fY == [0.03, 0.8, 0.17]: \n", " ref_assert_var = True\n", " ref_tmp_var = True\n", " else:\n", " ref_assert_var = False\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "assert ref_tmp_var" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l4", "s4", "content" ] }, "source": [ "For the above joint distribution, the marginal distribution of X and Y are given below:\n", "\n", "Marginal Distribution of X:\n", "\n", "\\begin{array}{ l | c | r }\n", " \\hline\n", " X-> & 2 & 3 & 4 & 5 \\\\ \n", " \\hline\n", " f(X) & 0.34 & 0.33 & 0.19 & 0.14 \\\\ \n", " \\hline\n", "\\end{array}\n", "\n", "\n", "Marginal Distribution of Y:\n", "\n", "\\begin{array}{ l | c | r }\n", " \\hline\n", " Y-> & 0 & 1 & 2 \\\\ \n", " \\hline\n", " f(Y) & 0.03 & 0.80 & 0.17 \\\\ \n", " \\hline\n", "\\end{array}\n", "\n", "\n", "\n", "\n", "## Fraud Modeling Example\n", "\n", "Consider a simple model of fraudulent transactions with data containing Sex (S), Age (A), Fraud (F), Jewelry (J) and probabilities P {P(S,A,F,J)}:\n", "\n", "| S | A | F | J | P |\n", "|-----|-----|-----|-----|----------------|\n", "| S_0 | A_0 | F_0 | J_0 | 0.0025 |\n", "| S_0 | A_0 | F_0 | J_1 | 0.0100 |\n", "| S_0 | A_0 | F_1 | J_0 | 0.1069 |\n", "| ... | ... | ... | ... | ... |\n", "| S_1 | A_2 | F_1 | J_1 | 0.0079 |\n", "\n", "\n", "(F = No) corresponds to F_1\n", "\n", "* Compute p(S, A, F, J | F=No) and assign it to p_SAFJ " ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [ "l4", "s4", "ce" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SAFJP
0S_0A_0F_0J_00.0025
1S_0A_0F_0J_10.0100
2S_0A_0F_1J_00.1069
3S_0A_0F_1J_10.0056
4S_0A_1F_0J_00.0008
\n", "
" ], "text/plain": [ " S A F J P\n", "0 S_0 A_0 F_0 J_0 0.0025\n", "1 S_0 A_0 F_0 J_1 0.0100\n", "2 S_0 A_0 F_1 J_0 0.1069\n", "3 S_0 A_0 F_1 J_1 0.0056\n", "4 S_0 A_1 F_0 J_0 0.0008" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "fraud_data = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/Fraud/fraud_data.csv')\n", "fraud_data.head()" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "s4", "l4", "hint" ] }, "source": [ "Use fraud_data['F'].str.contains('F_1')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [ "s4", "l4", "ans" ] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:2: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SAFJP
2S_0A_0F_1J_00.118778
3S_0A_0F_1J_10.006222
6S_0A_1F_1J_00.190000
7S_0A_1F_1J_10.010000
10S_0A_2F_1J_00.166222
11S_0A_2F_1J_10.008778
14S_1A_0F_1J_00.118778
15S_1A_0F_1J_10.006222
18S_1A_1F_1J_00.190000
19S_1A_1F_1J_10.010000
22S_1A_2F_1J_00.166222
23S_1A_2F_1J_10.008778
\n", "
" ], "text/plain": [ " S A F J P\n", "2 S_0 A_0 F_1 J_0 0.118778\n", "3 S_0 A_0 F_1 J_1 0.006222\n", "6 S_0 A_1 F_1 J_0 0.190000\n", "7 S_0 A_1 F_1 J_1 0.010000\n", "10 S_0 A_2 F_1 J_0 0.166222\n", "11 S_0 A_2 F_1 J_1 0.008778\n", "14 S_1 A_0 F_1 J_0 0.118778\n", "15 S_1 A_0 F_1 J_1 0.006222\n", "18 S_1 A_1 F_1 J_0 0.190000\n", "19 S_1 A_1 F_1 J_1 0.010000\n", "22 S_1 A_2 F_1 J_0 0.166222\n", "23 S_1 A_2 F_1 J_1 0.008778" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p_SAFJ = fraud_data[fraud_data['F'].str.contains('F_1')]\n", "p_SAFJ['P'] = p_SAFJ['P']/p_SAFJ['P'].sum()\n", "p_SAFJ" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "tags": [ "s4", "hid", "l4" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "continue\n" ] } ], "source": [ "ref_tmp_var = False\n", "\n", "try:\n", " if abs(p_SAFJ['P'][2] - 0.1069) < 0.1: \n", " ref_assert_var = True\n", " ref_tmp_var = True\n", " else:\n", " ref_assert_var = False\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "assert ref_tmp_var" ] } ], "metadata": { "executed_sections": [], "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }