{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Models, response schedules, and estimators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook summarizes some probability distributions and models related to them, and draws a distinction between a model and a response schedule." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some common probability distributions\n", "\n", "### Discrete\n", "\n", "+ Bernoulli: distribution of a single trial that can result in \"success\" (1) or \"failure\" (0). A random variable $X$ has the Bernoulli($p$) distribution iff \n", "$$\\Pr \\{ X=1 \\} = p, \\;\\; \\Pr \\{X=0\\} = 1-p.$$\n", "\n", "+ Binomial: distribution of the number of successes in $n$ independent Bernoulli($p$) trials. Special case: Bernoulli ($n=1$). A random variable $X$ has a Binomial($n,p$) distribution iff \n", "$$\\Pr \\{X=j\\} = {{n}\\choose{j}}p^j(1-p)^{n-j}, \\; j=0, 1, \\ldots, n.$$\n", "\n", "+ Geometric: distribution of the number of trials until the 1st success in independent Bernoulli($p$) trials. A random variable $X$ has a Geometric($p$) distribution iff \n", "$$ \\Pr \\{X=j\\} = (1-p)^{j-1}p, \\;\\; j=1, 2, \\ldots .$$\n", "\n", "+ Negative binomial: distribution of the number of trials until the $k$th success in independent Bernoulli($p$) trials. Special case: geometric ($k=1$). A random variable $X$ has a Negative Binomial distribution with parameters $p$ and $k$ distribution iff \n", "$$\\Pr \\{X=j\\} = {{j-1}\\choose{k-1}}(1-p)^{j-k}p^k, \\;\\; j=k, k+1, \\ldots .$$\n", "\n", "+ Poisson: limit of Binomial as $n \\rightarrow \\infty$ and $p \\rightarrow 0$, with $np= \\lambda$. A random variable $X$ has a Poisson($\\lambda$) distribution iff\n", "$$ \\Pr \\{X=j\\} = e^{-\\lambda} \\frac{\\lambda^j}{j!}, \\;\\;j=0, 1, \\ldots .$$\n", "\n", "+ Hypergeometric: number of \"good\" items in a simple random sample of size $n$ from a population of $N$ items of which $G$ are good. A random variable $X$ has a hypergeometric distribution with parameters $N$, $G$, and $n$ iff\n", "$$ \\Pr \\{X = j,\\; j = 1, \\ldots, k\\} = \\frac{{{G}\\choose{j}}{{N-G}\\choose{n-j}}}{{N}\\choose{n}}, \\;\\; j = \\max(0,n-(N-G)), \\ldots, \\min(n, G).$$\n", "\n", "+ Multinomial: joint distribution of the number of values in each of $k \\ge 2$ categories\n", "for $n$ IID draws with probability $\\pi_j$ of selecting value $j$ in each draw. Special cases: uniform distribution on $k$ outcomes ($n=1$, $\\pi_j = 1/k$), binomial ($k=2$). A random vector $(X_1, \\ldots, X_k)$ has a multinomial joint distribution with parameters $n$\n", "and $\\{\\pi_j\\}_{j=1}^k$ iff\n", "$$ \\Pr \\{X_j = x_j \\} = \\prod_{j=1}^k \\pi_j^{x_j} \\frac{n!}{x_1!x_2! \\cdots x_j!}, \\;\\; x_j \\ge 0,\\;\\; \\sum_{j=1}^k x_j = n.$$\n", "\n", "+ Multi-hypergeometric: joint distribution of the number of values in each of $k \\ge 2$ categories for $n$ draws without replacement from a finite population of $N$ items of\n", "which $N_j$ are in category $j$. Special case: hypergeometric ($k = 2$). A random vector $(X_1, \\ldots, X_k)$ has a multi-hypergeometric joint distribution with parameters $\\{N_j\\}_{j=1}^k$ iff\n", "$$ \\Pr \\{ X_j = x_j,\\; j = 1, \\ldots, k \\} = \\frac{{{N_1}\\choose{x_1}} \\cdots {{N_k}\\choose{x_k}}}{{{N}\\choose{n}}}, \\;\\; x_j \\ge 0;\\;\\; \\sum_j x_j = n; \\;\\; \\sum_j N_j = N.$$\n", "\n", "### Continuous\n", "\n", "+ Uniform on a domain $\\mathbf{S}$. A random variable $X$ has a Uniform distribution on $\\mathbf{S}$ iff\n", "$$ \\Pr \\{X \\in A\\} = \\frac{\\int_{A \\cap S} dx}{\\int_{S} dx}.$$\n", "(Here and below, $A$ needs to be a Lebesgue-measurable set; we will not worry about measurability.)\n", "\n", "+ Normal. A random variable $X$ has a normal distribution with mean $\\mu$ and variance $\\sigma^2$ iff\n", "$$ \\Pr \\{ X \\in A \\} = \\int_A \\frac{1}{\\sqrt{2 \\pi} \\sigma} e^{-(x-\\mu)^2/(2\\sigma^2)} dx.$$\n", "\n", "+ Distributions derived from the normal: Student's t, F, chi-square\n", "\n", "+ Exponential. A random variable $X$ has an exponential distribution with rate $\\lambda$ \n", "(mean $\\lambda^{-1}$) iff\n", "$$ \\Pr \\{ X \\in A \\} = \\int_{A \\cap [0, \\infty)} \\lambda e^{-\\lambda x} dx.$$\n", "\n", "+ Gamma. A random variable $X$ has a Gamma distribution with shape parameter $\\alpha$\n", "and rate parameter $\\beta$ iff\n", "$$ \\Pr \\{ X \\in A \\} = \\int_{A \\cap [0, \\infty)}\\frac{\\beta ^{\\alpha }}{\\Gamma (\\alpha )}x^{\\alpha -1}e^{-\\beta x} dx.$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's a model?\n", "\n", "An expression for the probability distribution of data $X$, usually \"indexed\" by a (possibly abstract, possibly infinite-dimensional) parameter, often relating some observables (_independent variables_, _covariates_, _explanatory variables_, _predictors_) to others (_dependent variables_, _response variables_, _data_, ...).\n", "\n", "$$\n", " X \\sim \\mathbb{P}_\\theta, \\;\\; \\theta \\in \\Theta.\n", "$$\n", "\n", "### Examples\n", "\n", "+ coins and 0-1 boxes\n", " - number of heads in 1 toss\n", " - number of heads in $n$ tosses\n", " - number of tosses to the first head\n", " - number of tosses to the $k$th head\n", " \n", "+ draws without replacement\n", " - boxes of numbers\n", " - boxes of categories\n", "\n", "+ radioactive decay\n", "\n", "+ Hooke's Law, Ohm's Law, Boyle's Law\n", "\n", "+ Conjoint analysis\n", "\n", "+ avian-turbine interactions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some models\n", "\n", "+ Linear regression \n", "\n", "+ Linear probability model\n", "\n", "+ Logit\n", "\n", "+ Probit\n", "\n", "+ Multinomial logit\n", "\n", "+ Poisson regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Response schedules and causal inference\n", "\n", "A response schedule is an assertion about how Nature generated the data: it says how one variable would respond if you intervened and changed the value of other variables.\n", "\n", "Regression is about _conditional expectation_: the expected value of\n", "the response variable for cases _selected_ on the basis of the values of the predictor variables.\n", "\n", "Causal inference is about _intervention_: what would happen if the values of the predictor variables were exogenously set to some values.\n", "\n", "Response schedules connect _selection_ to _intervention_.\n", "\n", "For conditioning to give the same result as intervention, the model has to be a response schedule, and the response schedule has to be correct.\n", "\n", "+ Linear: a model for real-valued outcomes. $Y_X = X\\beta + \\epsilon$. Nature picks $X$, multiplies by $\\beta$, adds $\\epsilon$. $X$ and $\\epsilon$ are independent.\n", "\n", " - Good examples (for suitable ranges of $X$ and suitable instrumental error): Hooke's law, Ohm's law, Boyle's law\n", " - Bad examples: most (if not all) applications in social science, including econometrics.\n", "\n", "+ Linear probability model: a model for binary outcomes. $\\Pr \\{Y_j = 1 | X \\} = X_j\\beta + \\epsilon$, where the components of $\\epsilon$ are IID with mean zero. Not guaranteed to give probabilities between 0 and 1 when fitted to data.\n", "\n", "+ Logit: a model for binary outcomes. Logistic distribution function is $\\Lambda(x) = e^x/(1+e^x)$. The logit function is $\\mathrm{logit} p \\equiv \\log [p/(1-p)]$, also called the _log odds ratio_. The logit model is that $\\{Y_j\\}$ are independent with $\\Pr \\{Y_j = 1 | X \\} = \\Lambda(X_j \\beta)$. Equivalently, $\\mathrm{logit} \\Pr(Y_j=1 | X) = X_j \\beta$. Also equivalently, the _latent variable_ formulation\n", "$$ Y_j = \\begin{cases} 1, & X_j\\beta + U_j \\ge 0\\\\ 0, & \\mathrm{otherwise,} \\end{cases}$$\n", "where $\\{U_j \\}$ are IID random variables with the logistic distribution, and are independent of $X$.\n", "\n", "+ Probit: a model for binary outcomes. Let $\\Phi$ denote the standard normal cdf. The probit model is that $\\{Y_j\\}$ are independent with $\\Pr \\{Y_j = 1 | X) = \\Phi(X_j \\beta)$. Equivalently, the latent variable formulation\n", "$$ Y_j = \\begin{cases} 1, & X_j\\beta + U_j \\ge 0\\\\ 0, &\\mathrm{otherwise,} \\end{cases}$$\n", "where $\\{U_j \\}$ are IID random variables with the standard normal distribution, and are independent of $X$.\n", "\n", "+ Multinomial logit: a model for categorical outcomes. Suppose there are $K$ categories.\n", "The multinomial logit model is that $\\{Y_j\\}$ are independent with\n", "$$ \\Pr \\{Y_j = k | X \\} = \\begin{cases} \\frac{e^{X_j \\beta_k}}{1 + \\sum_{\\ell=1}^{K-1}e^{X_j \\beta_\\ell}}, & k=1, \\ldots, K-1 \\\\ \\frac{1}{1 + \\sum_{\\ell=1}^{K-1}e^{X_j \\beta_\\ell}}, & k=K.\n", "\\end{cases}\n", "$$\n", "\n", "+ Poisson regression: a model for non-negative counts. The model is that $\\{Y_j\\}$ are independent Poisson random variables with corresponding rates $\\{\\lambda_j\\}$ and that\n", "$$ \\log \\lambda_j | X = X_j \\beta.$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }