{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Multiple Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section is based partly on Freedman, D.A., 2009. [_Statistical Models: Theory and Practice_, Revised Edition](http://www.amazon.com/Statistical-Models-Practice-David-Freedman/dp/0521743850/), Cambridge University Press.\n", "\n", "Statistical models, and regression in particular, are used primarily for three purposes:\n", "\n", "1. _Description_: to summarize data\n", "2. _Prediction_: to predict future data\n", "3. _Causal Inference_: to predict what would happen in response to an intervention\n", "\n", "It is straightforward to check whether a regression model is a good summary of _existing_ data, although there is some subtlety in determining whether the summary is _good enough_. How to measure goodness of fit appropriately is not always obvious, and adequacy of fit depends on the use of the summary.\n", "\n", "Prediction is harder than description because it involves _extrapolation_: how can one tell what the future will bring? Why should the future be like the past? Is the system under study stable (i.e., _stationary_) or are its properties changing with time?\n", "\n", "However, the hardest of these tasks is causal inference. The biggest difficulty in drawing causal inferences is _confounding_, especially when the data arise from _observational studies_ rather than _randomized, controlled experiments_. (_Natural experiments_ lie somewhere in between; there are few that approach the utility of\n", "a randomized controlled experiment, but John Snow's study of the communication of cholera is a notable exception.)\n", "\n", "_Confounding_ happens when one factor or variable manifests in the outcome in a way that cannot be distinguished from the _treatment_.\n", "\n", "_Stratification_ (e.g., _cross tabulation_) can help reduce confounding. So can modeling—in some cases, but not in others. \n", "For modeling to help, it is generally necessary for the structure of the model to \n", "correspond to how the data were actually generated.\n", "Unfortunately, most models in science, especially social science, are chosen out of habit or computational\n", "convenience, not because they have a basis in the science itself.\n", "This often produces misleading results, and the misleading impression that those results have small\n", "uncertainties." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some notation\n", "\n", "If $\\{x_i\\}_{i=1}^n$ is a list of numbers, \n", "$$\n", "\\bar{x} \\equiv \\frac{1}{n} \\sum_{i=1}^n\n", "$$\n", "is the _mean_ of the list;\n", "$$\n", "\\mbox{var } x \\equiv \\frac{1}{n} \\sum_{i=1}^n (x_i - \\bar{x})^2\n", "$$\n", "is the _variance_ of the list;\n", "$$\n", "s_x \\equiv \\sqrt{\\mbox{var } x}\n", "$$\n", "is the SD or _standard deviation_ of the list;\n", "and\n", "$$\n", " z_i \\equiv \\frac{x_i - \\bar{x}}{s_x}\n", "$$\n", "is _$x_i$ in standard units_.\n", "With $\\bar{y}$ and $s_y$ defined analogously for the list $\\{y_i\\}_{i=1}^n$, and if $s_x$\n", "and $s_y$ are nonzero,\n", "\n", "$$\n", " r_{xy} \\equiv \\frac{1}{n} \\sum_{i=1} \\frac{x_i - \\bar{x}}{s_x} \\cdot \\frac{y_i-\\bar{y}}{s_y}\n", "$$\n", "is the _correlation of $x$ and $y$_.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bivariate regression\n", "\n", "See [SticiGui: Regression](http://www.stat.berkeley.edu/~stark/SticiGui/Text/regression.htm)\n", "\n", "Suppose we observe pairs $\\{(x_i, y_i)\\}_{i=1}^n$.\n", "What straight line $y = a + bx$ comes closest to fitting these data, in the least-squares sense?\n", "That is, what values $a$ and $b$ minimize\n", "$$\n", " \\mbox{mean squared residual} = \\mbox{MSR} = \\equiv \\frac{1}{n}\\sum_{i=1}^n \\left ( y_i - (a + bx_i) \\right )^2?\n", "$$\n", "The solution is $b = r_{xy} \\frac{s_y}{s_x}$ and $a = \\bar{y} - b \\bar{x}$.\n", "\n", "_Proof._\n", "The mean squared residual is \n", "$$\n", " \\mbox{MSR} = \\frac{1}{n}\\sum_{i=1}^n \\left ( y_i - (a + bx_i) \\right )^2 = \n", " \\frac{1}{n}\\sum_{i=1}^n \\left ( y_i^2 - 2y_i(a + bx_i) + (a+bx_i)^2 \\right ) =\n", " \\frac{1}{n}\\sum_{i=1}^n \\left ( y_i^2 - 2y_i(a + bx_i) + a^2+2abx_i + b^2x_i^2 \\right ).\n", "$$\n", "\n", "This is quadratic in both $a$ and $b$, and has positive leading coefficients in both.\n", "Hence, its minimum occurs at a stationary point with respect to both $a$ and $b$.\n", "We can differentiate inside the sum and solve for the stationary point:\n", "\n", "$$\n", " 0 = \\frac{\\partial \\mbox{MSR}}{\\partial a} = \\frac{1}{n} \\sum_{i=1}^n 2(y_i - (a+b x_i))\n", " = 2 (\\bar{y} - a - b \\bar{x}).\n", "$$\n", "\n", "Hence, the optimal $a$ solves $a = \\bar{y} - b \\bar{x}$.\n", "Substituting this into the expression for mean squared residual gives\n", "\n", "$$ \n", "\\mbox{MSR} = \\frac{1}{n} \\sum_{i=1}^n \\left ( y_i - (\\bar{y} - b \\bar{x} + bx_i) \\right )^2 \n", "= \\frac{1}{n} \\sum_{i=1}^n \\left ( y_i - (\\bar{y} - b (\\bar{x} - x_i)) \\right )^2.\n", "$$\n", "\n", "Differentiating with respect to $b$ and finding the stationary point gives\n", "$$\n", " 0 = \\frac{\\partial \\mbox{MSR}}{\\partial b} = \n", " \\frac{1}{n} \\sum_{i=1}^n 2 \\left ( y_i - (\\bar{y} - b (\\bar{x} - x_i) ) \\right )(\\bar{x} - x_i)\n", " = \\frac{1}{n} \\sum_{i=1}^n ( y_i - \\bar{y})(\\bar{x} - x_i) + \\frac{b}{n} \\sum_{i=1}^n (\\bar{x}-x_i)^2\n", " = s_x s_y r_{xy} - b s_x^2,\n", "$$\n", "\n", "i.e., \n", "\n", "$$\n", "b = \\frac{s_x s_y r_{xy}}{s_x^2} = r_{xy}\\frac{s_y}{s_x}.\n", "$$\n", "\n", "The mean of the squared residuals from the regression line\n", "has a simple expression: \n", "\n", "$$\\frac{1}{n} \\sum_{i=1}^n e_i^2 = \n", "\\frac{1}{n}\\sum_{i=1}^n \\left ( y_i - \\bar{y} + r_{xy} \\frac{s_y}{s_x} \\bar{x} - r_{xy} \\frac{s_y}{s_x} x_i) \\right )^2\n", "$$\n", "$$\n", "= \\frac{1}{n} \\sum_{i=1}^n ( y_i - \\bar{y})^2 + \n", "2r_{xy} \\frac{s_y}{s_x} \\frac{1}{n} \\sum_{i=1}^n (y_i - \\bar{y})(\\bar{x} - x_i)\n", "+ r_{xy}^2 \\frac{s_y^2}{s_x^2} \\frac{1}{n} \\sum_{i=1}^n (\\bar{x} - x_i)^2\n", "$$\n", "$$\n", "= s_y^2 + 2r_{xy} \\frac{s_y}{s_x} (-r_{xy}s_x s_y) + r_{xy}^2 \\frac{s_y^2}{s_x^2} s_x^2\n", "$$\n", "$$\n", "= s_y^2 - 2 r^2 s_y^2 + r^2 s_y^2 \n", "$$\n", "$$\n", "= s_y^2 (1 - r_{xy}^2).\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some statistical terminology: measuring error\n", "\n", "Suppose that $\\theta \\in \\Re^p$ is a parameter, and let $\\hat{\\theta}$ be an estimator of $\\theta$.\n", "The _bias_ of $\\hat{\\theta}$ is\n", "\n", "$$\n", " \\mbox{bias } \\hat{\\theta} = {\\mathbb E}(\\hat{\\theta} - \\theta) = {\\mathbb E} \\hat{\\theta} - \\theta.\n", "$$\n", "\n", "An estimate $\\hat{\\theta}$ of a parameter $\\theta \\in \\Theta$ is _unbiased_ \n", "if ${\\mathbb E}_\\eta \\hat{\\theta} = \\eta$ for all $\\eta \\in \\Theta$.\n", "\n", "The _mean squared error_ (MSE) of an estimate $\\hat{\\theta}$ of $\\theta$ is\n", "\n", "$$\n", "\\mbox{MSE } \\hat{\\theta} = {\\mathbb E} \\| \\hat{\\theta} - \\theta \\|^2.\n", "$$\n", "\n", "The _variance_ of $\\hat{\\theta}$ is\n", "\n", "$$\n", "\\mbox{var } \\hat{\\theta} = {\\mathbb E} \\| \\hat{\\theta} - {\\mathbb E} \\hat{\\theta} \\|^2.\n", "$$\n", "\n", "__Proposition.__\n", "\n", "$$\n", "\\mbox{MSE } \\hat{\\theta} = \\mbox{var } \\hat{\\theta} + \\| \\mbox{bias} \\hat{\\theta} \\|^2.\n", "$$\n", "\n", "_Proof._\n", "\n", "$$\n", "\\| \\hat{\\theta} - \\theta \\|^2 = \n", "\\| \\hat{\\theta} - {\\mathbb E} \\hat{\\theta} - (\\theta - {\\mathbb E} \\hat{\\theta}) \\|^2\n", "= (\\hat{\\theta} - {\\mathbb E} \\hat{\\theta} - (\\theta - {\\mathbb E} \\hat{\\theta}))'\n", "(\\hat{\\theta} - {\\mathbb E} \\hat{\\theta} - (\\theta - {\\mathbb E} \\hat{\\theta})) \n", "$$\n", "$$ =\n", "(\\hat{\\theta} - {\\mathbb E} \\hat{\\theta})'(\\hat{\\theta} - {\\mathbb E} \\hat{\\theta})\n", "+ 2 (\\hat{\\theta} - {\\mathbb E} \\hat{\\theta})'(\\theta - {\\mathbb E} \\hat{\\theta})\n", "+ (\\theta - {\\mathbb E} \\hat{\\theta})'(\\theta - {\\mathbb E} \\hat{\\theta})\n", "$$\n", "$$\n", "=\n", "\\| \\hat{\\theta} - {\\mathbb E} \\hat{\\theta} \\|^2 + \n", "2(\\hat{\\theta} - {\\mathbb E} \\hat{\\theta})'(\\theta - {\\mathbb E} \\hat{\\theta})\n", "+ \\| \\theta - {\\mathbb E} \\hat{\\theta} \\|^2.\n", "$$\n", "\n", "Hence,\n", "\n", "$$\n", "\\mbox{MSE } \\hat{\\theta} = {\\mathbb E} \\| \\hat{\\theta} - {\\mathbb E} \\hat{\\theta} \\|^2\n", "+ {\\mathbb E} \\left ( 2(\\hat{\\theta} - {\\mathbb E} \\hat{\\theta})'(\\theta - {\\mathbb E} \\hat{\\theta}) \\right )\n", "+ {\\mathbb E} \\| \\theta - {\\mathbb E} \\hat{\\theta} \\|^2\n", "$$\n", "$$\n", "=\n", "{\\mathbb E} \\| \\hat{\\theta} - {\\mathbb E} \\hat{\\theta} \\|^2\n", "+ 2 \\left ({\\mathbb E} \\hat{\\theta} - {\\mathbb E} \\hat{\\theta})'(\\theta - {\\mathbb E} \\hat{\\theta}) \\right )\n", "+ {\\mathbb E} \\| \\theta - {\\mathbb E} \\hat{\\theta} \\|^2\n", "$$\n", "$$\n", "= \\mbox{var } \\hat{\\theta} + \\| \\mbox{bias } \\hat{\\theta} \\|^2.\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiple Regression\n", "\n", "Multiple linear regression is a commonly used tool in most branches of science, as well as economics.\n", "It is frequently misinterpreted.\n", "\n", "We will start with an introduction to the linear algebra of constructing the least-squares estimate for\n", "linear regression, then discuss the features and limitations of regression, especially the limitations\n", "of drawing causal inferences from regression models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Notation\n", "\n", "The basic relationship for linear regression is the equation\n", "\n", "$$\n", " Y = X \\beta + \\epsilon.\n", "$$\n", "\n", "Here, $Y$ is an $n$-vector of data, called the _dependent variable_, the _response_, the _explained variables_,\n", "the _modeled variables_, or the _left hand side_.\n", "The matrix $X \\in {\\mathcal M}(n,p)$ with $n \\ge p$, $\\mbox{rank}(X)=p$ (full rank)\n", "so that the columns of $X$ are linearly independent.\n", "\n", "The vector $\\beta$ is a $p$-vector of _parameters_, _model parameters_, or _coefficients_. \n", "The usual goal of regression is to estimate $\\beta$.\n", "\n", "The vector $\\epsilon$ is an $n$-vector of _error_, _disturbance_, or _noise_.\n", "\n", "We will usually assume that $\\epsilon$ is random, which makes $Y$ random as well.\n", "We will usually treat $X$ as fixed rather than random.\n", "\n", "There is an observation $Y_i$ for each _unit_ of observation, a row of $X$ for each observation,\n", "and a column of $X$ for each parameter (element of $\\beta$).\n", "The columns correspond to _explanatory variables_, _independent variables_, _covariates_, _control variables_,\n", "or _right-hand side variables_.\n", "\n", "We have observaations of $Y$, which are _assumed_ to be values of $X\\beta + \\epsilon$.\n", "The value of $\\beta$ is unknown.\n", "The value of $\\epsilon$ cannot be observed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The standard assumption in multiple regression is that the noise terms $\\epsilon$ are random, with\n", "\n", "+ $\\{\\epsilon_i\\}$ iid (independent and identically distributed)\n", "+ ${\\mathbb E}\\epsilon_i = 0$\n", "+ $\\mbox{var}\\epsilon_i = \\sigma^2$, generally a known constant\n", "\n", "It is also standard to assume that if $X$ is random, $X$ and $\\epsilon$ are independent.\n", "Regardless, the value of $X$ is observed$.\n", "\n", "Since $X$ is observed, we can find $\\mbox{rank}(X)$.\n", "But there is no way to test whether the main assumptions are true, that is, whether\n", "+ $Y = X\\beta + \\epsilon$\n", "+ $\\{ \\epsilon_i \\}$ are iid with mean 0 and finite variance\n", "+ $X$ and $\\epsilon$ are independent.\n", "\n", "It is common to _condition_ on $X$; that is, to treat it as fixed erather than random." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ordinary least squares\n", "\n", "The ordinary least squares (OLS) estimate of $\\beta$ is\n", "\n", "$$ \n", "\\hat{\\beta} \\equiv \\left ( X'X \\right )^{-1}X' Y.\n", "$$\n", "\n", "The estimate $\\hat{\\beta}$ is a $p$-vector, just like $\\beta$.\n", "\n", "The _residuals_ or _fitting errors_ are\n", "$$\n", " e \\equiv Y - X \\hat{\\beta}.\n", "$$\n", "\n", "
\n", "__Theorem.__\n", "\n", "1. $e \\perp X$ (i.e., $X'e = {\\mathbf 0}$: the fitting errors are orthogonal to $X$)\n", "1. $\\min_{\\gamma \\in \\Re^p} \\| Y - X\\gamma\\|^2 = \\| Y - X \\hat{\\beta}\\|^2$ ($\\hat{\\beta}$ solves the least squares fitting problem)\n", "
\n", "\n", "This is a special case of the _Projection Theorem_.\n", "\n", "__Proof.__\n", "\n", "We prove the first part by direct calculation:\n", "$ X'e = X' (Y - X \\hat{\\beta})$, so\n", "\n", "$$ \\left ( X'X \\right )^{-1} X' e = \\left ( X'X \\right )^{-1} X' Y - \\left ( X'X \\right )^{-1} X'X \\hat{\\beta}\n", "= \\hat{\\beta} - \\hat{\\beta} = 0.\n", "$$\n", "\n", "For the second part, write $\\gamma = \\hat{\\beta} + (\\gamma - \\hat{\\beta})$.\n", "Then\n", "\n", "$$\n", "\\| Y - X \\gamma \\|^2 = \\| Y - X \\hat{\\beta} - X(\\gamma - \\hat{\\beta}) \\|^2 = \\| e - X (\\gamma - \\hat{\\beta})\\|^2\n", "$$\n", "$$\n", "= \\| e \\|^2 + 2 e' X(\\gamma - \\hat{\\beta}) + \\| X(\\gamma - \\hat{\\beta}) \\|^2\n", "\\ge \\| e \\|^2\n", "$$\n", "since $e'X = {\\mathbf 0}$.\n", "\n", "What does that orthogonality mean?\n", "Finding $\\hat{\\beta}$ amounts to finding the coefficients in a linear combination \n", "of columns of $X$ that give the best approximation to $Y$, in a least-squares sense.\n", "That is, we are approximating $Y \\in \\Re^n$ by an element of the $p$-dimensional\n", "subspace spanned by the columns of $X$.\n", "If the residuals had a non-zero component in that subspace, there would be a better approximation\n", "to $Y$ from that subspace.\n", "\n", "For instance, if any column of $X$ is constant, then the residuals should have zero mean:\n", "$\\sum_{i=1}^n e_j = 0$.\n", "Otherwise, the dot product of the residuals with that column of $X$ would not be zero:\n", "the residuals would not be orthogonal to $X$, and there would be an element of that\n", "subspace that is a better approximation to $Y$.\n", "\n", "Let's consider statistical properties of $\\hat{\\beta}$.\n", "We start with (conditional) bias; later we will get to the MSE.\n", "\n", "
\n", "__Theorem.__\n", "The ordinary least squares estimate $\\hat{\\beta}$ is _conditionally unbiased_: ${\\mathbb E} (\\hat{\\beta} \\mid X) = \\beta$.\n", "\n", "__Proof. __\n", "By calculation:\n", "\n", "$$ \n", "\\hat{\\beta} = (X'X)^{-1} X'Y = (X'X)^{-1} X'(X\\beta + \\epsilon)\n", "$$\n", "$$\n", "= (X'X)^{-1}X'X\\beta + (X'X)^{-1}X'\\epsilon\n", "= \\beta + (X'X)^{-1}X' \\epsilon.\n", "$$\n", "\n", "Now,\n", "$$ \n", "{\\mathbb E} \\left ( (X'X)^{-1} X' \\epsilon \\mid X \\right ) = (X'X)^{-1} X' E(\\epsilon \\mid X).\n", "$$\n", "\n", "Since $\\epsilon$ and $X$ are independent, \n", "$$\n", " {\\mathbb E} (\\epsilon \\mid X) = {\\mathbb E}\\epsilon = 0.\n", "$$\n", "\n", "Hence,\n", "$$\n", "{\\mathbb E} (\\hat{\\beta} \\vert X) = {\\mathbb E}(\\beta + (X'X)^{-1}X'\\epsilon \\mid X) = \\beta + {\\mathbf 0} = \\beta.\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computational example\n", "\n", "We drop an object from an unknown height $h$, with an unknown velocity $v$, subject to an unknown\n", "constant acceleration $a$.\n", "\n", "We measure the height $Y_i$ of the object at times 0, 1, 2, and 3 seconds:\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "
$t_i$ $Y_i$
0 2.043
1 6.856
2 22.565
3 47.709
\n", "\n", "The height at time $t$ is\n", "\n", "$$\n", " Y_i = h + v t_i + (1/2) a t_i^2 + \\epsilon_i.\n", "$$\n", "\n", "This has three unknown parameters: the initial height $h$, the initial velocity $v$, and the acceleration $a$.\n", "It is an (unverifiable)\n", "assumption that the noise terms $\\{\\epsilon_i\\}$ are independent and have zero mean and finite variance.\n", "\n", "We can expand the data relationship as follows:\n", "\n", "$$Y_1 = h + vt_1 + 1/2 a t_1^2 + \\epsilon_1 = h \\times 1 + v \\times 0 + a \\times 0 + \\epsilon_1$$\n", "$$Y_2 = h + vt_2 + 1/2 a t_2^2 + \\epsilon_2 = h \\times 1 + v \\times 1 + a \\times (1/2)1^2 + \\epsilon_2$$\n", "$$Y_3 = h + vt_3 + 1/2 a t_3^2 + \\epsilon_3 = h \\times 1 + v \\times 2 + a \\times (1/2)2^2 + \\epsilon_3$$\n", "$$Y_4 = h + vt_4 + 1/2 a t_3^4 + \\epsilon_4 = h \\times 1 + v \\times 3 + a \\times (1/2)3^2 + \\epsilon_3.$$\n", "\n", "We can write these relations in matrix form, as follows.\n", "\n", "Define\n", "$$\n", "Y \\equiv\n", "\\begin{pmatrix}\n", " 2.043 \\\\\n", " 6.856 \\\\\n", " 22.565 \\\\\n", " 47.709\n", "\\end{pmatrix},\n", "$$\n", "\n", "$$\n", " X \\equiv \\begin{pmatrix}\n", " 1 & 0 & 0 \\\\\n", " 1 & 1 & 0.5 \\\\\n", " 1 & 2 & 2 \\\\\n", " 1 & 3 & 4.5\n", "\\end{pmatrix},\n", "$$\n", "\n", "$$\n", "\\beta \\equiv \\begin{pmatrix}\n", "h \\\\\n", "v \\\\\n", "a\n", "\\end{pmatrix},\n", "$$\n", "\n", "and\n", "\n", "$$\n", "\\epsilon \\equiv \\begin{pmatrix}\n", "\\epsilon_1 \\\\\n", "\\epsilon_2 \\\\\n", "\\epsilon_3 \\\\\n", "\\epsilon_4 \\\\\n", "\\end{pmatrix}\n", ".\n", "$$\n", "\n", "Then the data relationship can be written\n", "\n", "$$\n", "Y = X \\beta + \\epsilon.\n", "$$\n", "\n", "The least squares estimate $\\hat{\\beta}$ of $\\beta$ is\n", "$$\n", "\\hat{\\beta} = (X'X)^{-1}X'Y.\n", "$$\n", "If the Newtonian gravity data relations are correct and the observational\n", "noise $\\epsilon$ really has zero mean, then _formally_ $\\hat{\\beta}$ is an unbiased\n", "estimate of $\\beta$.\n", "However, we should not calculate $\\hat{\\beta}$ by inverting $X'X$, as discussed previously.\n", "Rather, we should use matrix factorization or back-substitution." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
100
1.01.00.5
122
1.03.04.5
\n" ], "text/latex": [ "\\begin{tabular}{lll}\n", "\t 1 & 0 & 0\\\\\n", "\t 1.0 & 1.0 & 0.5\\\\\n", "\t 1 & 2 & 2\\\\\n", "\t 1.0 & 3.0 & 4.5\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "1. 1\n", "2. 1\n", "3. 1\n", "4. 1\n", "5. 0\n", "6. 1\n", "7. 2\n", "8. 3\n", "9. 0\n", "10. 0.5\n", "11. 2\n", "12. 4.5\n", "\n", "\n" ], "text/plain": [ " [,1] [,2] [,3]\n", "[1,] 1 0 0.0\n", "[2,] 1 1 0.5\n", "[3,] 1 2 2.0\n", "[4,] 1 3 4.5" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/html": [ "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
1.96995
0.02245
10.1655
\n" ], "text/latex": [ "\\begin{tabular}{l}\n", "\t 1.96995\\\\\n", "\t 0.02245\\\\\n", "\t 10.1655\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "1. 1.96995\n", "2. 0.0224499999999926\n", "3. 10.1655\n", "\n", "\n" ], "text/plain": [ " [,1]\n", "[1,] 1.96995\n", "[2,] 0.02245\n", "[3,] 10.16550" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Least squares example in R\n", "Y <- c(2.043, 6.856, 22.565, 47.709);\n", "X <- matrix(c(1,0,0,1,1,0.5,1,2,2,1,3,9/2), nrow = 4, ncol=3, byrow = T);\n", "X\n", "betahat <- solve(t(X) %*% X, t(X) %*% Y);\n", "betahat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is, we would estimate that the object was dropped from a height of 1.970 m\n", "with an initial velocity of 0.022 m/s, under gravitational acceleration of 10.166 m/s2.\n", "\n", "Let's find the residual vector and verify that it is orthogonal to $X$:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
0.07305
-0.21915
0.21915
-0.07305
\n" ], "text/latex": [ "\\begin{tabular}{l}\n", "\t 0.07305\\\\\n", "\t -0.21915\\\\\n", "\t 0.21915\\\\\n", "\t -0.07305\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "1. 0.0730499999999983\n", "2. -0.219149999999998\n", "3. 0.219149999999999\n", "4. -0.0730500000000092\n", "\n", "\n" ], "text/plain": [ " [,1]\n", "[1,] 0.07305\n", "[2,] -0.21915\n", "[3,] 0.21915\n", "[4,] -0.07305" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/html": [ "\n", "\n", "\t\n", "\n", "
-9.992007e-15-2.753353e-14-4.218847e-14
\n" ], "text/latex": [ "\\begin{tabular}{lll}\n", "\t -9.992007e-15 & -2.753353e-14 & -4.218847e-14\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "1. -9.99200722162641e-15\n", "2. -2.75335310107039e-14\n", "3. -4.21884749357559e-14\n", "\n", "\n" ], "text/plain": [ " [,1] [,2] [,3]\n", "[1,] -9.992007e-15 -2.753353e-14 -4.218847e-14" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "e <- Y - X %*% betahat\n", "e\n", "t(e) %*% X" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
lm {stats}R Documentation
\n", "\n", "

Fitting Linear Models

\n", "\n", "

Description

\n", "\n", "

lm is used to fit linear models.\n", "It can be used to carry out regression,\n", "single stratum analysis of variance and\n", "analysis of covariance (although aov may provide a more\n", "convenient interface for these).\n", "

\n", "\n", "\n", "

Usage

\n", "\n", "
\n",
       "lm(formula, data, subset, weights, na.action,\n",
       "   method = \"qr\", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,\n",
       "   singular.ok = TRUE, contrasts = NULL, offset, ...)\n",
       "
\n", "\n", "\n", "

Arguments

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
formula\n", "

an object of class \"formula\" (or one that\n", "can be coerced to that class): a symbolic description of the\n", "model to be fitted. The details of model specification are given\n", "under ‘Details’.

\n", "
data\n", "

an optional data frame, list or environment (or object\n", "coercible by as.data.frame to a data frame) containing\n", "the variables in the model. If not found in data, the\n", "variables are taken from environment(formula),\n", "typically the environment from which lm is called.

\n", "
subset\n", "

an optional vector specifying a subset of observations\n", "to be used in the fitting process.

\n", "
weights\n", "

an optional vector of weights to be used in the fitting\n", "process. Should be NULL or a numeric vector.\n", "If non-NULL, weighted least squares is used with weights\n", "weights (that is, minimizing sum(w*e^2)); otherwise\n", "ordinary least squares is used. See also ‘Details’,

\n", "
na.action\n", "

a function which indicates what should happen\n", "when the data contain NAs. The default is set by\n", "the na.action setting of options, and is\n", "na.fail if that is unset. The ‘factory-fresh’\n", "default is na.omit. Another possible value is\n", "NULL, no action. Value na.exclude can be useful.

\n", "
method\n", "

the method to be used; for fitting, currently only\n", "method = \"qr\" is supported; method = \"model.frame\" returns\n", "the model frame (the same as with model = TRUE, see below).

\n", "
model, x, y, qr\n", "

logicals. If TRUE the corresponding\n", "components of the fit (the model frame, the model matrix, the\n", "response, the QR decomposition) are returned.\n", "

\n", "
singular.ok\n", "

logical. If FALSE (the default in S but\n", "not in R) a singular fit is an error.

\n", "
contrasts\n", "

an optional list. See the contrasts.arg\n", "of model.matrix.default.

\n", "
offset\n", "

this can be used to specify an a priori known\n", "component to be included in the linear predictor during fitting.\n", "This should be NULL or a numeric vector of length equal to\n", "the number of cases. One or more offset terms can be\n", "included in the formula instead or as well, and if more than one are\n", "specified their sum is used. See model.offset.

\n", "
...\n", "

additional arguments to be passed to the low level\n", "regression fitting functions (see below).

\n", "
\n", "\n", "\n", "

Details

\n", "\n", "

Models for lm are specified symbolically. A typical model has\n", "the form response ~ terms where response is the (numeric)\n", "response vector and terms is a series of terms which specifies a\n", "linear predictor for response. A terms specification of the form\n", "first + second indicates all the terms in first together\n", "with all the terms in second with duplicates removed. A\n", "specification of the form first:second indicates the set of\n", "terms obtained by taking the interactions of all terms in first\n", "with all terms in second. The specification first*second\n", "indicates the cross of first and second. This is\n", "the same as first + second + first:second.\n", "

\n", "

If the formula includes an offset, this is evaluated and\n", "subtracted from the response.\n", "

\n", "

If response is a matrix a linear model is fitted separately by\n", "least-squares to each column of the matrix.\n", "

\n", "

See model.matrix for some further details. The terms in\n", "the formula will be re-ordered so that main effects come first,\n", "followed by the interactions, all second-order, all third-order and so\n", "on: to avoid this pass a terms object as the formula (see\n", "aov and demo(glm.vr) for an example).\n", "

\n", "

A formula has an implied intercept term. To remove this use either\n", "y ~ x - 1 or y ~ 0 + x. See formula for\n", "more details of allowed formulae.\n", "

\n", "

Non-NULL weights can be used to indicate that different\n", "observations have different variances (with the values in\n", "weights being inversely proportional to the variances); or\n", "equivalently, when the elements of weights are positive\n", "integers w_i, that each response y_i is the mean of\n", "w_i unit-weight observations (including the case that there are\n", "w_i observations equal to y_i and the data have been\n", "summarized).\n", "

\n", "

lm calls the lower level functions lm.fit, etc,\n", "see below, for the actual numerical computations. For programming\n", "only, you may consider doing likewise.\n", "

\n", "

All of weights, subset and offset are evaluated\n", "in the same way as variables in formula, that is first in\n", "data and then in the environment of formula.\n", "

\n", "\n", "\n", "

Value

\n", "\n", "

lm returns an object of class \"lm\" or for\n", "multiple responses of class c(\"mlm\", \"lm\").\n", "

\n", "

The functions summary and anova are used to\n", "obtain and print a summary and analysis of variance table of the\n", "results. The generic accessor functions coefficients,\n", "effects, fitted.values and residuals extract\n", "various useful features of the value returned by lm.\n", "

\n", "

An object of class \"lm\" is a list containing at least the\n", "following components:\n", "

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
coefficients\n", "

a named vector of coefficients

\n", "
residuals\n", "

the residuals, that is response minus fitted values.

\n", "
fitted.values\n", "

the fitted mean values.

\n", "
rank\n", "

the numeric rank of the fitted linear model.

\n", "
weights\n", "

(only for weighted fits) the specified weights.

\n", "
df.residual\n", "

the residual degrees of freedom.

\n", "
call\n", "

the matched call.

\n", "
terms\n", "

the terms object used.

\n", "
contrasts\n", "

(only where relevant) the contrasts used.

\n", "
xlevels\n", "

(only where relevant) a record of the levels of the\n", "factors used in fitting.

\n", "
offset\n", "

the offset used (missing if none were used).

\n", "
y\n", "

if requested, the response used.

\n", "
x\n", "

if requested, the model matrix used.

\n", "
model\n", "

if requested (the default), the model frame used.

\n", "
na.action\n", "

(where relevant) information returned by\n", "model.frame on the special handling of NAs.

\n", "
\n", "

In addition, non-null fits will have components assign,\n", "effects and (unless not requested) qr relating to the linear\n", "fit, for use by extractor functions such as summary and\n", "effects.\n", "

\n", "\n", "\n", "

Using time series

\n", "\n", "

Considerable care is needed when using lm with time series.\n", "

\n", "

Unless na.action = NULL, the time series attributes are\n", "stripped from the variables before the regression is done. (This is\n", "necessary as omitting NAs would invalidate the time series\n", "attributes, and if NAs are omitted in the middle of the series\n", "the result would no longer be a regular time series.)\n", "

\n", "

Even if the time series attributes are retained, they are not used to\n", "line up series, so that the time shift of a lagged or differenced\n", "regressor would be ignored. It is good practice to prepare a\n", "data argument by ts.intersect(..., dframe = TRUE),\n", "then apply a suitable na.action to that data frame and call\n", "lm with na.action = NULL so that residuals and fitted\n", "values are time series.\n", "

\n", "\n", "\n", "

Note

\n", "\n", "

Offsets specified by offset will not be included in predictions\n", "by predict.lm, whereas those specified by an offset term\n", "in the formula will be.\n", "

\n", "\n", "\n", "

Author(s)

\n", "\n", "

The design was inspired by the S function of the same name described\n", "in Chambers (1992). The implementation of model formula by Ross Ihaka\n", "was based on Wilkinson & Rogers (1973).\n", "

\n", "\n", "\n", "

References

\n", "\n", "

Chambers, J. M. (1992)\n", "Linear models.\n", "Chapter 4 of Statistical Models in S\n", "eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.\n", "

\n", "

Wilkinson, G. N. and Rogers, C. E. (1973)\n", "Symbolic descriptions of factorial models for analysis of variance.\n", "Applied Statistics, 22, 392–9.\n", "

\n", "\n", "\n", "

See Also

\n", "\n", "

summary.lm for summaries and anova.lm for\n", "the ANOVA table; aov for a different interface.\n", "

\n", "

The generic functions coef, effects,\n", "residuals, fitted, vcov.\n", "

\n", "

predict.lm (via predict) for prediction,\n", "including confidence and prediction intervals;\n", "confint for confidence intervals of parameters.\n", "

\n", "

lm.influence for regression diagnostics, and\n", "glm for generalized linear models.\n", "

\n", "

The underlying low level functions,\n", "lm.fit for plain, and lm.wfit for weighted\n", "regression fitting.\n", "

\n", "

More lm() examples are available e.g., in\n", "anscombe, attitude, freeny,\n", "LifeCycleSavings, longley,\n", "stackloss, swiss.\n", "

\n", "

biglm in package biglm for an alternative\n", "way to fit linear models to large datasets (especially those with many\n", "cases).\n", "

\n", "\n", "\n", "

Examples

\n", "\n", "
\n",
       "require(graphics)\n",
       "\n",
       "## Annette Dobson (1990) \"An Introduction to Generalized Linear Models\".\n",
       "## Page 9: Plant Weight Data.\n",
       "ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)\n",
       "trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)\n",
       "group <- gl(2, 10, 20, labels = c(\"Ctl\",\"Trt\"))\n",
       "weight <- c(ctl, trt)\n",
       "lm.D9 <- lm(weight ~ group)\n",
       "lm.D90 <- lm(weight ~ group - 1) # omitting intercept\n",
       "\n",
       "anova(lm.D9)\n",
       "summary(lm.D90)\n",
       "\n",
       "opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))\n",
       "plot(lm.D9, las = 1)      # Residuals, Fitted, ...\n",
       "par(opar)\n",
       "\n",
       "### less simple examples in \"See Also\" above\n",
       "
\n", "\n", "
[Package stats version 3.1.3 ]
" ], "text/latex": [ "\\inputencoding{utf8}\n", "\\HeaderA{lm}{Fitting Linear Models}{lm}\n", "\\keyword{regression}{lm}\n", "%\n", "\\begin{Description}\\relax\n", "\\code{lm} is used to fit linear models.\n", "It can be used to carry out regression,\n", "single stratum analysis of variance and\n", "analysis of covariance (although \\code{\\LinkA{aov}{aov}} may provide a more\n", "convenient interface for these).\n", "\\end{Description}\n", "%\n", "\\begin{Usage}\n", "\\begin{verbatim}\n", "lm(formula, data, subset, weights, na.action,\n", " method = \"qr\", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,\n", " singular.ok = TRUE, contrasts = NULL, offset, ...)\n", "\\end{verbatim}\n", "\\end{Usage}\n", "%\n", "\\begin{Arguments}\n", "\\begin{ldescription}\n", "\\item[\\code{formula}] an object of class \\code{\"\\LinkA{formula}{formula}\"} (or one that\n", "can be coerced to that class): a symbolic description of the\n", "model to be fitted. The details of model specification are given\n", "under `Details'.\n", "\n", "\\item[\\code{data}] an optional data frame, list or environment (or object\n", "coercible by \\code{\\LinkA{as.data.frame}{as.data.frame}} to a data frame) containing\n", "the variables in the model. If not found in \\code{data}, the\n", "variables are taken from \\code{environment(formula)},\n", "typically the environment from which \\code{lm} is called.\n", "\n", "\\item[\\code{subset}] an optional vector specifying a subset of observations\n", "to be used in the fitting process.\n", "\n", "\\item[\\code{weights}] an optional vector of weights to be used in the fitting\n", "process. Should be \\code{NULL} or a numeric vector.\n", "If non-NULL, weighted least squares is used with weights\n", "\\code{weights} (that is, minimizing \\code{sum(w*e\\textasciicircum{}2)}); otherwise\n", "ordinary least squares is used. See also `Details',\n", "\n", "\\item[\\code{na.action}] a function which indicates what should happen\n", "when the data contain \\code{NA}s. The default is set by\n", "the \\code{na.action} setting of \\code{\\LinkA{options}{options}}, and is\n", "\\code{\\LinkA{na.fail}{na.fail}} if that is unset. The `factory-fresh'\n", "default is \\code{\\LinkA{na.omit}{na.omit}}. Another possible value is\n", "\\code{NULL}, no action. Value \\code{\\LinkA{na.exclude}{na.exclude}} can be useful.\n", "\n", "\\item[\\code{method}] the method to be used; for fitting, currently only\n", "\\code{method = \"qr\"} is supported; \\code{method = \"model.frame\"} returns\n", "the model frame (the same as with \\code{model = TRUE}, see below).\n", "\n", "\\item[\\code{model, x, y, qr}] logicals. If \\code{TRUE} the corresponding\n", "components of the fit (the model frame, the model matrix, the\n", "response, the QR decomposition) are returned.\n", "\n", "\n", "\\item[\\code{singular.ok}] logical. If \\code{FALSE} (the default in S but\n", "not in \\R{}) a singular fit is an error.\n", "\n", "\\item[\\code{contrasts}] an optional list. See the \\code{contrasts.arg}\n", "of \\code{\\LinkA{model.matrix.default}{model.matrix.default}}.\n", "\n", "\\item[\\code{offset}] this can be used to specify an \\emph{a priori} known\n", "component to be included in the linear predictor during fitting.\n", "This should be \\code{NULL} or a numeric vector of length equal to\n", "the number of cases. One or more \\code{\\LinkA{offset}{offset}} terms can be\n", "included in the formula instead or as well, and if more than one are\n", "specified their sum is used. See \\code{\\LinkA{model.offset}{model.offset}}.\n", "\n", "\\item[\\code{...}] additional arguments to be passed to the low level\n", "regression fitting functions (see below).\n", "\\end{ldescription}\n", "\\end{Arguments}\n", "%\n", "\\begin{Details}\\relax\n", "Models for \\code{lm} are specified symbolically. A typical model has\n", "the form \\code{response \\textasciitilde{} terms} where \\code{response} is the (numeric)\n", "response vector and \\code{terms} is a series of terms which specifies a\n", "linear predictor for \\code{response}. A terms specification of the form\n", "\\code{first + second} indicates all the terms in \\code{first} together\n", "with all the terms in \\code{second} with duplicates removed. A\n", "specification of the form \\code{first:second} indicates the set of\n", "terms obtained by taking the interactions of all terms in \\code{first}\n", "with all terms in \\code{second}. The specification \\code{first*second}\n", "indicates the \\emph{cross} of \\code{first} and \\code{second}. This is\n", "the same as \\code{first + second + first:second}.\n", "\n", "If the formula includes an \\code{\\LinkA{offset}{offset}}, this is evaluated and\n", "subtracted from the response.\n", "\n", "If \\code{response} is a matrix a linear model is fitted separately by\n", "least-squares to each column of the matrix.\n", "\n", "See \\code{\\LinkA{model.matrix}{model.matrix}} for some further details. The terms in\n", "the formula will be re-ordered so that main effects come first,\n", "followed by the interactions, all second-order, all third-order and so\n", "on: to avoid this pass a \\code{terms} object as the formula (see\n", "\\code{\\LinkA{aov}{aov}} and \\code{demo(glm.vr)} for an example).\n", "\n", "A formula has an implied intercept term. To remove this use either\n", "\\code{y \\textasciitilde{} x - 1} or \\code{y \\textasciitilde{} 0 + x}. See \\code{\\LinkA{formula}{formula}} for\n", "more details of allowed formulae.\n", "\n", "Non-\\code{NULL} \\code{weights} can be used to indicate that different\n", "observations have different variances (with the values in\n", "\\code{weights} being inversely proportional to the variances); or\n", "equivalently, when the elements of \\code{weights} are positive\n", "integers \\eqn{w_i}{}, that each response \\eqn{y_i}{} is the mean of\n", "\\eqn{w_i}{} unit-weight observations (including the case that there are\n", "\\eqn{w_i}{} observations equal to \\eqn{y_i}{} and the data have been\n", "summarized).\n", "\n", "\\code{lm} calls the lower level functions \\code{\\LinkA{lm.fit}{lm.fit}}, etc,\n", "see below, for the actual numerical computations. For programming\n", "only, you may consider doing likewise.\n", "\n", "All of \\code{weights}, \\code{subset} and \\code{offset} are evaluated\n", "in the same way as variables in \\code{formula}, that is first in\n", "\\code{data} and then in the environment of \\code{formula}.\n", "\\end{Details}\n", "%\n", "\\begin{Value}\n", "\\code{lm} returns an object of \\code{\\LinkA{class}{class}} \\code{\"lm\"} or for\n", "multiple responses of class \\code{c(\"mlm\", \"lm\")}.\n", "\n", "The functions \\code{summary} and \\code{\\LinkA{anova}{anova}} are used to\n", "obtain and print a summary and analysis of variance table of the\n", "results. The generic accessor functions \\code{coefficients},\n", "\\code{effects}, \\code{fitted.values} and \\code{residuals} extract\n", "various useful features of the value returned by \\code{lm}.\n", "\n", "An object of class \\code{\"lm\"} is a list containing at least the\n", "following components:\n", "\n", "\\begin{ldescription}\n", "\\item[\\code{coefficients}] a named vector of coefficients\n", "\\item[\\code{residuals}] the residuals, that is response minus fitted values.\n", "\\item[\\code{fitted.values}] the fitted mean values.\n", "\\item[\\code{rank}] the numeric rank of the fitted linear model.\n", "\\item[\\code{weights}] (only for weighted fits) the specified weights.\n", "\\item[\\code{df.residual}] the residual degrees of freedom.\n", "\\item[\\code{call}] the matched call.\n", "\\item[\\code{terms}] the \\code{\\LinkA{terms}{terms}} object used.\n", "\\item[\\code{contrasts}] (only where relevant) the contrasts used.\n", "\\item[\\code{xlevels}] (only where relevant) a record of the levels of the\n", "factors used in fitting.\n", "\\item[\\code{offset}] the offset used (missing if none were used).\n", "\\item[\\code{y}] if requested, the response used.\n", "\\item[\\code{x}] if requested, the model matrix used.\n", "\\item[\\code{model}] if requested (the default), the model frame used.\n", "\\item[\\code{na.action}] (where relevant) information returned by\n", "\\code{\\LinkA{model.frame}{model.frame}} on the special handling of \\code{NA}s.\n", "\n", "\\end{ldescription}\n", "In addition, non-null fits will have components \\code{assign},\n", "\\code{effects} and (unless not requested) \\code{qr} relating to the linear\n", "fit, for use by extractor functions such as \\code{summary} and\n", "\\code{\\LinkA{effects}{effects}}.\n", "\\end{Value}\n", "%\n", "\\begin{Section}{Using time series}\n", "Considerable care is needed when using \\code{lm} with time series.\n", "\n", "Unless \\code{na.action = NULL}, the time series attributes are\n", "stripped from the variables before the regression is done. (This is\n", "necessary as omitting \\code{NA}s would invalidate the time series\n", "attributes, and if \\code{NA}s are omitted in the middle of the series\n", "the result would no longer be a regular time series.)\n", "\n", "Even if the time series attributes are retained, they are not used to\n", "line up series, so that the time shift of a lagged or differenced\n", "regressor would be ignored. It is good practice to prepare a\n", "\\code{data} argument by \\code{\\LinkA{ts.intersect}{ts.intersect}(..., dframe = TRUE)},\n", "then apply a suitable \\code{na.action} to that data frame and call\n", "\\code{lm} with \\code{na.action = NULL} so that residuals and fitted\n", "values are time series.\n", "\\end{Section}\n", "%\n", "\\begin{Note}\\relax\n", "Offsets specified by \\code{offset} will not be included in predictions\n", "by \\code{\\LinkA{predict.lm}{predict.lm}}, whereas those specified by an offset term\n", "in the formula will be.\n", "\\end{Note}\n", "%\n", "\\begin{Author}\\relax\n", "The design was inspired by the S function of the same name described\n", "in Chambers (1992). The implementation of model formula by Ross Ihaka\n", "was based on Wilkinson \\& Rogers (1973).\n", "\\end{Author}\n", "%\n", "\\begin{References}\\relax\n", "Chambers, J. M. (1992)\n", "\\emph{Linear models.}\n", "Chapter 4 of \\emph{Statistical Models in S}\n", "eds J. M. Chambers and T. J. Hastie, Wadsworth \\& Brooks/Cole.\n", "\n", "Wilkinson, G. N. and Rogers, C. E. (1973)\n", "Symbolic descriptions of factorial models for analysis of variance.\n", "\\emph{Applied Statistics}, \\bold{22}, 392--9.\n", "\\end{References}\n", "%\n", "\\begin{SeeAlso}\\relax\n", "\\code{\\LinkA{summary.lm}{summary.lm}} for summaries and \\code{\\LinkA{anova.lm}{anova.lm}} for\n", "the ANOVA table; \\code{\\LinkA{aov}{aov}} for a different interface.\n", "\n", "The generic functions \\code{\\LinkA{coef}{coef}}, \\code{\\LinkA{effects}{effects}},\n", "\\code{\\LinkA{residuals}{residuals}}, \\code{\\LinkA{fitted}{fitted}}, \\code{\\LinkA{vcov}{vcov}}.\n", "\n", "\\code{\\LinkA{predict.lm}{predict.lm}} (via \\code{\\LinkA{predict}{predict}}) for prediction,\n", "including confidence and prediction intervals;\n", "\\code{\\LinkA{confint}{confint}} for confidence intervals of \\emph{parameters}.\n", "\n", "\\code{\\LinkA{lm.influence}{lm.influence}} for regression diagnostics, and\n", "\\code{\\LinkA{glm}{glm}} for \\bold{generalized} linear models.\n", "\n", "The underlying low level functions,\n", "\\code{\\LinkA{lm.fit}{lm.fit}} for plain, and \\code{\\LinkA{lm.wfit}{lm.wfit}} for weighted\n", "regression fitting.\n", "\n", "More \\code{lm()} examples are available e.g., in\n", "\\code{\\LinkA{anscombe}{anscombe}}, \\code{\\LinkA{attitude}{attitude}}, \\code{\\LinkA{freeny}{freeny}},\n", "\\code{\\LinkA{LifeCycleSavings}{LifeCycleSavings}}, \\code{\\LinkA{longley}{longley}},\n", "\\code{\\LinkA{stackloss}{stackloss}}, \\code{\\LinkA{swiss}{swiss}}.\n", "\n", "\\code{biglm} in package \\Rhref{http://CRAN.R-project.org/package=biglm}{\\pkg{biglm}} for an alternative\n", "way to fit linear models to large datasets (especially those with many\n", "cases).\n", "\\end{SeeAlso}\n", "%\n", "\\begin{Examples}\n", "\\begin{ExampleCode}\n", "require(graphics)\n", "\n", "## Annette Dobson (1990) \"An Introduction to Generalized Linear Models\".\n", "## Page 9: Plant Weight Data.\n", "ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)\n", "trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)\n", "group <- gl(2, 10, 20, labels = c(\"Ctl\",\"Trt\"))\n", "weight <- c(ctl, trt)\n", "lm.D9 <- lm(weight ~ group)\n", "lm.D90 <- lm(weight ~ group - 1) # omitting intercept\n", "\n", "anova(lm.D9)\n", "summary(lm.D90)\n", "\n", "opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))\n", "plot(lm.D9, las = 1) # Residuals, Fitted, ...\n", "par(opar)\n", "\n", "### less simple examples in \"See Also\" above\n", "\\end{ExampleCode}\n", "\\end{Examples}" ], "text/plain": [ "lm package:stats R Documentation\n", "\n", "_\bF_\bi_\bt_\bt_\bi_\bn_\bg _\bL_\bi_\bn_\be_\ba_\br _\bM_\bo_\bd_\be_\bl_\bs\n", "\n", "_\bD_\be_\bs_\bc_\br_\bi_\bp_\bt_\bi_\bo_\bn:\n", "\n", " ‘lm’ is used to fit linear models. It can be used to carry out\n", " regression, single stratum analysis of variance and analysis of\n", " covariance (although ‘aov’ may provide a more convenient interface\n", " for these).\n", "\n", "_\bU_\bs_\ba_\bg_\be:\n", "\n", " lm(formula, data, subset, weights, na.action,\n", " method = \"qr\", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,\n", " singular.ok = TRUE, contrasts = NULL, offset, ...)\n", " \n", "_\bA_\br_\bg_\bu_\bm_\be_\bn_\bt_\bs:\n", "\n", " formula: an object of class ‘\"formula\"’ (or one that can be coerced to\n", " that class): a symbolic description of the model to be\n", " fitted. The details of model specification are given under\n", " ‘Details’.\n", "\n", " data: an optional data frame, list or environment (or object\n", " coercible by ‘as.data.frame’ to a data frame) containing the\n", " variables in the model. If not found in ‘data’, the\n", " variables are taken from ‘environment(formula)’, typically\n", " the environment from which ‘lm’ is called.\n", "\n", " subset: an optional vector specifying a subset of observations to be\n", " used in the fitting process.\n", "\n", " weights: an optional vector of weights to be used in the fitting\n", " process. Should be ‘NULL’ or a numeric vector. If non-NULL,\n", " weighted least squares is used with weights ‘weights’ (that\n", " is, minimizing ‘sum(w*e^2)’); otherwise ordinary least\n", " squares is used. See also ‘Details’,\n", "\n", "na.action: a function which indicates what should happen when the data\n", " contain ‘NA’s. The default is set by the ‘na.action’ setting\n", " of ‘options’, and is ‘na.fail’ if that is unset. The\n", " ‘factory-fresh’ default is ‘na.omit’. Another possible value\n", " is ‘NULL’, no action. Value ‘na.exclude’ can be useful.\n", "\n", " method: the method to be used; for fitting, currently only ‘method =\n", " \"qr\"’ is supported; ‘method = \"model.frame\"’ returns the\n", " model frame (the same as with ‘model = TRUE’, see below).\n", "\n", "model, x, y, qr: logicals. If ‘TRUE’ the corresponding components of\n", " the fit (the model frame, the model matrix, the response, the\n", " QR decomposition) are returned.\n", "\n", "singular.ok: logical. If ‘FALSE’ (the default in S but not in R) a\n", " singular fit is an error.\n", "\n", "contrasts: an optional list. See the ‘contrasts.arg’ of\n", " ‘model.matrix.default’.\n", "\n", " offset: this can be used to specify an _a priori_ known component to\n", " be included in the linear predictor during fitting. This\n", " should be ‘NULL’ or a numeric vector of length equal to the\n", " number of cases. One or more ‘offset’ terms can be included\n", " in the formula instead or as well, and if more than one are\n", " specified their sum is used. See ‘model.offset’.\n", "\n", " ...: additional arguments to be passed to the low level regression\n", " fitting functions (see below).\n", "\n", "_\bD_\be_\bt_\ba_\bi_\bl_\bs:\n", "\n", " Models for ‘lm’ are specified symbolically. A typical model has\n", " the form ‘response ~ terms’ where ‘response’ is the (numeric)\n", " response vector and ‘terms’ is a series of terms which specifies a\n", " linear predictor for ‘response’. A terms specification of the\n", " form ‘first + second’ indicates all the terms in ‘first’ together\n", " with all the terms in ‘second’ with duplicates removed. A\n", " specification of the form ‘first:second’ indicates the set of\n", " terms obtained by taking the interactions of all terms in ‘first’\n", " with all terms in ‘second’. The specification ‘first*second’\n", " indicates the _cross_ of ‘first’ and ‘second’. This is the same\n", " as ‘first + second + first:second’.\n", "\n", " If the formula includes an ‘offset’, this is evaluated and\n", " subtracted from the response.\n", "\n", " If ‘response’ is a matrix a linear model is fitted separately by\n", " least-squares to each column of the matrix.\n", "\n", " See ‘model.matrix’ for some further details. The terms in the\n", " formula will be re-ordered so that main effects come first,\n", " followed by the interactions, all second-order, all third-order\n", " and so on: to avoid this pass a ‘terms’ object as the formula (see\n", " ‘aov’ and ‘demo(glm.vr)’ for an example).\n", "\n", " A formula has an implied intercept term. To remove this use\n", " either ‘y ~ x - 1’ or ‘y ~ 0 + x’. See ‘formula’ for more details\n", " of allowed formulae.\n", "\n", " Non-‘NULL’ ‘weights’ can be used to indicate that different\n", " observations have different variances (with the values in\n", " ‘weights’ being inversely proportional to the variances); or\n", " equivalently, when the elements of ‘weights’ are positive integers\n", " w_i, that each response y_i is the mean of w_i unit-weight\n", " observations (including the case that there are w_i observations\n", " equal to y_i and the data have been summarized).\n", "\n", " ‘lm’ calls the lower level functions ‘lm.fit’, etc, see below, for\n", " the actual numerical computations. For programming only, you may\n", " consider doing likewise.\n", "\n", " All of ‘weights’, ‘subset’ and ‘offset’ are evaluated in the same\n", " way as variables in ‘formula’, that is first in ‘data’ and then in\n", " the environment of ‘formula’.\n", "\n", "_\bV_\ba_\bl_\bu_\be:\n", "\n", " ‘lm’ returns an object of ‘class’ ‘\"lm\"’ or for multiple responses\n", " of class ‘c(\"mlm\", \"lm\")’.\n", "\n", " The functions ‘summary’ and ‘anova’ are used to obtain and print a\n", " summary and analysis of variance table of the results. The\n", " generic accessor functions ‘coefficients’, ‘effects’,\n", " ‘fitted.values’ and ‘residuals’ extract various useful features of\n", " the value returned by ‘lm’.\n", "\n", " An object of class ‘\"lm\"’ is a list containing at least the\n", " following components:\n", "\n", "coefficients: a named vector of coefficients\n", "\n", "residuals: the residuals, that is response minus fitted values.\n", "\n", "fitted.values: the fitted mean values.\n", "\n", " rank: the numeric rank of the fitted linear model.\n", "\n", " weights: (only for weighted fits) the specified weights.\n", "\n", "df.residual: the residual degrees of freedom.\n", "\n", " call: the matched call.\n", "\n", " terms: the ‘terms’ object used.\n", "\n", "contrasts: (only where relevant) the contrasts used.\n", "\n", " xlevels: (only where relevant) a record of the levels of the factors\n", " used in fitting.\n", "\n", " offset: the offset used (missing if none were used).\n", "\n", " y: if requested, the response used.\n", "\n", " x: if requested, the model matrix used.\n", "\n", " model: if requested (the default), the model frame used.\n", "\n", "na.action: (where relevant) information returned by ‘model.frame’ on\n", " the special handling of ‘NA’s.\n", " In addition, non-null fits will have components ‘assign’,\n", " ‘effects’ and (unless not requested) ‘qr’ relating to the linear\n", " fit, for use by extractor functions such as ‘summary’ and\n", " ‘effects’.\n", "\n", "_\bU_\bs_\bi_\bn_\bg _\bt_\bi_\bm_\be _\bs_\be_\br_\bi_\be_\bs:\n", "\n", " Considerable care is needed when using ‘lm’ with time series.\n", "\n", " Unless ‘na.action = NULL’, the time series attributes are stripped\n", " from the variables before the regression is done. (This is\n", " necessary as omitting ‘NA’s would invalidate the time series\n", " attributes, and if ‘NA’s are omitted in the middle of the series\n", " the result would no longer be a regular time series.)\n", "\n", " Even if the time series attributes are retained, they are not used\n", " to line up series, so that the time shift of a lagged or\n", " differenced regressor would be ignored. It is good practice to\n", " prepare a ‘data’ argument by ‘ts.intersect(..., dframe = TRUE)’,\n", " then apply a suitable ‘na.action’ to that data frame and call ‘lm’\n", " with ‘na.action = NULL’ so that residuals and fitted values are\n", " time series.\n", "\n", "_\bN_\bo_\bt_\be:\n", "\n", " Offsets specified by ‘offset’ will not be included in predictions\n", " by ‘predict.lm’, whereas those specified by an offset term in the\n", " formula will be.\n", "\n", "_\bA_\bu_\bt_\bh_\bo_\br(_\bs):\n", "\n", " The design was inspired by the S function of the same name\n", " described in Chambers (1992). The implementation of model formula\n", " by Ross Ihaka was based on Wilkinson & Rogers (1973).\n", "\n", "_\bR_\be_\bf_\be_\br_\be_\bn_\bc_\be_\bs:\n", "\n", " Chambers, J. M. (1992) _Linear models._ Chapter 4 of _Statistical\n", " Models in S_ eds J. M. Chambers and T. J. Hastie, Wadsworth &\n", " Brooks/Cole.\n", "\n", " Wilkinson, G. N. and Rogers, C. E. (1973) Symbolic descriptions of\n", " factorial models for analysis of variance. _Applied Statistics_,\n", " *22*, 392-9.\n", "\n", "_\bS_\be_\be _\bA_\bl_\bs_\bo:\n", "\n", " ‘summary.lm’ for summaries and ‘anova.lm’ for the ANOVA table;\n", " ‘aov’ for a different interface.\n", "\n", " The generic functions ‘coef’, ‘effects’, ‘residuals’, ‘fitted’,\n", " ‘vcov’.\n", "\n", " ‘predict.lm’ (via ‘predict’) for prediction, including confidence\n", " and prediction intervals; ‘confint’ for confidence intervals of\n", " _parameters_.\n", "\n", " ‘lm.influence’ for regression diagnostics, and ‘glm’ for\n", " *generalized* linear models.\n", "\n", " The underlying low level functions, ‘lm.fit’ for plain, and\n", " ‘lm.wfit’ for weighted regression fitting.\n", "\n", " More ‘lm()’ examples are available e.g., in ‘anscombe’,\n", " ‘attitude’, ‘freeny’, ‘LifeCycleSavings’, ‘longley’, ‘stackloss’,\n", " ‘swiss’.\n", "\n", " ‘biglm’ in package ‘biglm’ for an alternative way to fit linear\n", " models to large datasets (especially those with many cases).\n", "\n", "_\bE_\bx_\ba_\bm_\bp_\bl_\be_\bs:\n", "\n", " require(graphics)\n", " \n", " ## Annette Dobson (1990) \"An Introduction to Generalized Linear Models\".\n", " ## Page 9: Plant Weight Data.\n", " ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)\n", " trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)\n", " group <- gl(2, 10, 20, labels = c(\"Ctl\",\"Trt\"))\n", " weight <- c(ctl, trt)\n", " lm.D9 <- lm(weight ~ group)\n", " lm.D90 <- lm(weight ~ group - 1) # omitting intercept\n", " \n", " anova(lm.D9)\n", " summary(lm.D90)\n", " \n", " opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))\n", " plot(lm.D9, las = 1) # Residuals, Fitted, ...\n", " par(opar)\n", " \n", " ### less simple examples in \"See Also\" above\n", " " ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# R also has a function to fit linear models: lm.\n", "# Here is its help function:\n", "help(lm)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
formula {stats}R Documentation
\n", "\n", "

Model Formulae

\n", "\n", "

Description

\n", "\n", "

The generic function formula and its specific methods provide a\n", "way of extracting formulae which have been included in other objects.\n", "

\n", "

as.formula is almost identical, additionally preserving\n", "attributes when object already inherits from\n", "\"formula\".\n", "

\n", "\n", "\n", "

Usage

\n", "\n", "
\n",
       "formula(x, ...)\n",
       "as.formula(object, env = parent.frame())\n",
       "\n",
       "## S3 method for class 'formula'\n",
       "print(x, showEnv = !identical(e, .GlobalEnv), ...)\n",
       "
\n", "\n", "\n", "

Arguments

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
x, object\n", "

R object.

\n", "
...\n", "

further arguments passed to or from other methods.

\n", "
env\n", "

the environment to associate with the result, if not\n", "already a formula.

\n", "
showEnv\n", "

logical indicating if the environment should be printed\n", "as well.

\n", "
\n", "\n", "\n", "

Details

\n", "\n", "

The models fit by, e.g., the lm and glm functions\n", "are specified in a compact symbolic form.\n", "The ~ operator is basic in the formation of such models.\n", "An expression of the form y ~ model is interpreted\n", "as a specification that the response y is modelled\n", "by a linear predictor specified symbolically by model.\n", "Such a model consists of a series of terms separated\n", "by + operators.\n", "The terms themselves consist of variable and factor\n", "names separated by : operators.\n", "Such a term is interpreted as the interaction of\n", "all the variables and factors appearing in the term.\n", "

\n", "

In addition to + and :, a number of other operators are\n", "useful in model formulae. The * operator denotes factor\n", "crossing: a*b interpreted as a+b+a:b. The ^\n", "operator indicates crossing to the specified degree. For example\n", "(a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turn\n", "expands to a formula containing the main effects for a,\n", "b and c together with their second-order interactions.\n", "The %in% operator indicates that the terms on its left are\n", "nested within those on the right. For example a + b %in% a\n", "expands to the formula a + a:b. The - operator removes\n", "the specified terms, so that (a+b+c)^2 - a:b is identical to\n", "a + b + c + b:c + a:c. It can also used to remove the\n", "intercept term: when fitting a linear model y ~ x - 1 specifies\n", "a line through the origin. A model with no intercept can be also\n", "specified as y ~ x + 0 or y ~ 0 + x.\n", "

\n", "

While formulae usually involve just variable and factor\n", "names, they can also involve arithmetic expressions.\n", "The formula log(y) ~ a + log(x) is quite legal.\n", "When such arithmetic expressions involve\n", "operators which are also used symbolically\n", "in model formulae, there can be confusion between\n", "arithmetic and symbolic operator use.\n", "

\n", "

To avoid this confusion, the function I()\n", "can be used to bracket those portions of a model\n", "formula where the operators are used in their\n", "arithmetic sense. For example, in the formula\n", "y ~ a + I(b+c), the term b+c is to be\n", "interpreted as the sum of b and c.\n", "

\n", "

Variable names can be quoted by backticks `like this` in\n", "formulae, although there is no guarantee that all code using formulae\n", "will accept such non-syntactic names.\n", "

\n", "

Most model-fitting functions accept formulae with right-hand-side\n", "including the function offset to indicate terms with a\n", "fixed coefficient of one. Some functions accept other\n", "‘specials’ such as strata or cluster (see the\n", "specials argument of terms.formula).\n", "

\n", "

There are two special interpretations of . in a formula. The\n", "usual one is in the context of a data argument of model\n", "fitting functions and means ‘all columns not otherwise in the\n", "formula’: see terms.formula. In the context of\n", "update.formula, only, it means ‘what was\n", "previously in this part of the formula’.\n", "

\n", "

When formula is called on a fitted model object, either a\n", "specific method is used (such as that for class \"nls\") or the\n", "default method. The default first looks for a \"formula\"\n", "component of the object (and evaluates it), then a \"terms\"\n", "component, then a formula parameter of the call (and evaluates\n", "its value) and finally a \"formula\" attribute.\n", "

\n", "

There is a formula method for data frames. If there is only\n", "one column this forms the RHS with an empty LHS. For more columns,\n", "the first column is the LHS of the formula and the remaining columns\n", "separated by + form the RHS.\n", "

\n", "\n", "\n", "

Value

\n", "\n", "

All the functions above produce an object of class \"formula\"\n", "which contains a symbolic model formula.\n", "

\n", "\n", "\n", "

Environments

\n", "\n", "

A formula object has an associated environment, and\n", "this environment (rather than the parent\n", "environment) is used by model.frame to evaluate variables\n", "that are not found in the supplied data argument.\n", "

\n", "

Formulas created with the ~ operator use the\n", "environment in which they were created. Formulas created with\n", "as.formula will use the env argument for their\n", "environment.\n", "

\n", "\n", "\n", "

References

\n", "\n", "

Chambers, J. M. and Hastie, T. J. (1992)\n", "Statistical models.\n", "Chapter 2 of Statistical Models in S\n", "eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.\n", "

\n", "\n", "\n", "

See Also

\n", "\n", "

I, offset.\n", "

\n", "

For formula manipulation: terms, and all.vars;\n", "for typical use: lm, glm, and\n", "coplot.\n", "

\n", "\n", "\n", "

Examples

\n", "\n", "
\n",
       "class(fo <- y ~ x1*x2) # \"formula\"\n",
       "fo\n",
       "typeof(fo)  # R internal : \"language\"\n",
       "terms(fo)\n",
       "\n",
       "environment(fo)\n",
       "environment(as.formula(\"y ~ x\"))\n",
       "environment(as.formula(\"y ~ x\", env = new.env()))\n",
       "\n",
       "\n",
       "## Create a formula for a model with a large number of variables:\n",
       "xnam <- paste0(\"x\", 1:25)\n",
       "(fmla <- as.formula(paste(\"y ~ \", paste(xnam, collapse= \"+\"))))\n",
       "
\n", "\n", "
[Package stats version 3.1.3 ]
" ], "text/latex": [ "\\inputencoding{utf8}\n", "\\HeaderA{formula}{Model Formulae}{formula}\n", "\\aliasA{[.formula}{formula}{[.formula}\n", "\\aliasA{as.formula}{formula}{as.formula}\n", "\\methaliasA{formula.data.frame}{formula}{formula.data.frame}\n", "\\methaliasA{formula.default}{formula}{formula.default}\n", "\\methaliasA{formula.formula}{formula}{formula.formula}\n", "\\methaliasA{formula.terms}{formula}{formula.terms}\n", "\\aliasA{print.formula}{formula}{print.formula}\n", "\\keyword{models}{formula}\n", "%\n", "\\begin{Description}\\relax\n", "The generic function \\code{formula} and its specific methods provide a\n", "way of extracting formulae which have been included in other objects.\n", "\n", "\\code{as.formula} is almost identical, additionally preserving\n", "attributes when \\code{object} already inherits from\n", "\\code{\"formula\"}.\n", "\\end{Description}\n", "%\n", "\\begin{Usage}\n", "\\begin{verbatim}\n", "formula(x, ...)\n", "as.formula(object, env = parent.frame())\n", "\n", "## S3 method for class 'formula'\n", "print(x, showEnv = !identical(e, .GlobalEnv), ...)\n", "\\end{verbatim}\n", "\\end{Usage}\n", "%\n", "\\begin{Arguments}\n", "\\begin{ldescription}\n", "\\item[\\code{x, object}] \\R{} object.\n", "\\item[\\code{...}] further arguments passed to or from other methods.\n", "\\item[\\code{env}] the environment to associate with the result, if not\n", "already a formula.\n", "\\item[\\code{showEnv}] logical indicating if the environment should be printed\n", "as well.\n", "\\end{ldescription}\n", "\\end{Arguments}\n", "%\n", "\\begin{Details}\\relax\n", "The models fit by, e.g., the \\code{\\LinkA{lm}{lm}} and \\code{\\LinkA{glm}{glm}} functions\n", "are specified in a compact symbolic form.\n", "The \\code{\\textasciitilde{}} operator is basic in the formation of such models.\n", "An expression of the form \\code{y \\textasciitilde{} model} is interpreted\n", "as a specification that the response \\code{y} is modelled\n", "by a linear predictor specified symbolically by \\code{model}.\n", "Such a model consists of a series of terms separated\n", "by \\code{+} operators.\n", "The terms themselves consist of variable and factor\n", "names separated by \\code{:} operators.\n", "Such a term is interpreted as the interaction of\n", "all the variables and factors appearing in the term.\n", "\n", "In addition to \\code{+} and \\code{:}, a number of other operators are\n", "useful in model formulae. The \\code{*} operator denotes factor\n", "crossing: \\code{a*b} interpreted as \\code{a+b+a:b}. The \\code{\\textasciicircum{}}\n", "operator indicates crossing to the specified degree. For example\n", "\\code{(a+b+c)\\textasciicircum{}2} is identical to \\code{(a+b+c)*(a+b+c)} which in turn\n", "expands to a formula containing the main effects for \\code{a},\n", "\\code{b} and \\code{c} together with their second-order interactions.\n", "The \\code{\\%in\\%} operator indicates that the terms on its left are\n", "nested within those on the right. For example \\code{a + b \\%in\\% a}\n", "expands to the formula \\code{a + a:b}. The \\code{-} operator removes\n", "the specified terms, so that \\code{(a+b+c)\\textasciicircum{}2 - a:b} is identical to\n", "\\code{a + b + c + b:c + a:c}. It can also used to remove the\n", "intercept term: when fitting a linear model \\code{y \\textasciitilde{} x - 1} specifies\n", "a line through the origin. A model with no intercept can be also\n", "specified as \\code{y \\textasciitilde{} x + 0} or \\code{y \\textasciitilde{} 0 + x}.\n", "\n", "While formulae usually involve just variable and factor\n", "names, they can also involve arithmetic expressions.\n", "The formula \\code{log(y) \\textasciitilde{} a + log(x)} is quite legal.\n", "When such arithmetic expressions involve\n", "operators which are also used symbolically\n", "in model formulae, there can be confusion between\n", "arithmetic and symbolic operator use.\n", "\n", "To avoid this confusion, the function \\code{\\LinkA{I}{I}()}\n", "can be used to bracket those portions of a model\n", "formula where the operators are used in their\n", "arithmetic sense. For example, in the formula\n", "\\code{y \\textasciitilde{} a + I(b+c)}, the term \\code{b+c} is to be\n", "interpreted as the sum of \\code{b} and \\code{c}.\n", "\n", "Variable names can be quoted by backticks \\code{`like this`} in\n", "formulae, although there is no guarantee that all code using formulae\n", "will accept such non-syntactic names.\n", "\n", "Most model-fitting functions accept formulae with right-hand-side\n", "including the function \\code{\\LinkA{offset}{offset}} to indicate terms with a\n", "fixed coefficient of one. Some functions accept other\n", "`specials' such as \\code{strata} or \\code{cluster} (see the\n", "\\code{specials} argument of \\code{\\LinkA{terms.formula}{terms.formula})}.\n", "\n", "There are two special interpretations of \\code{.} in a formula. The\n", "usual one is in the context of a \\code{data} argument of model\n", "fitting functions and means `all columns not otherwise in the\n", "formula': see \\code{\\LinkA{terms.formula}{terms.formula}}. In the context of\n", "\\code{\\LinkA{update.formula}{update.formula}}, \\bold{only}, it means `what was\n", "previously in this part of the formula'.\n", "\n", "When \\code{formula} is called on a fitted model object, either a\n", "specific method is used (such as that for class \\code{\"nls\"}) or the\n", "default method. The default first looks for a \\code{\"formula\"}\n", "component of the object (and evaluates it), then a \\code{\"terms\"}\n", "component, then a \\code{formula} parameter of the call (and evaluates\n", "its value) and finally a \\code{\"formula\"} attribute.\n", "\n", "There is a \\code{formula} method for data frames. If there is only\n", "one column this forms the RHS with an empty LHS. For more columns,\n", "the first column is the LHS of the formula and the remaining columns\n", "separated by \\code{+} form the RHS.\n", "\\end{Details}\n", "%\n", "\\begin{Value}\n", "All the functions above produce an object of class \\code{\"formula\"}\n", "which contains a symbolic model formula.\n", "\\end{Value}\n", "%\n", "\\begin{Section}{Environments}\n", "A formula object has an associated environment, and\n", "this environment (rather than the parent\n", "environment) is used by \\code{\\LinkA{model.frame}{model.frame}} to evaluate variables\n", "that are not found in the supplied \\code{data} argument.\n", "\n", "Formulas created with the \\code{\\textasciitilde{}} operator use the\n", "environment in which they were created. Formulas created with\n", "\\code{as.formula} will use the \\code{env} argument for their\n", "environment.\n", "\\end{Section}\n", "%\n", "\\begin{References}\\relax\n", "Chambers, J. M. and Hastie, T. J. (1992)\n", "\\emph{Statistical models.}\n", "Chapter 2 of \\emph{Statistical Models in S}\n", "eds J. M. Chambers and T. J. Hastie, Wadsworth \\& Brooks/Cole.\n", "\\end{References}\n", "%\n", "\\begin{SeeAlso}\\relax\n", "\\code{\\LinkA{I}{I}}, \\code{\\LinkA{offset}{offset}}.\n", "\n", "For formula manipulation: \\code{\\LinkA{terms}{terms}}, and \\code{\\LinkA{all.vars}{all.vars}};\n", "for typical use: \\code{\\LinkA{lm}{lm}}, \\code{\\LinkA{glm}{glm}}, and\n", "\\code{\\LinkA{coplot}{coplot}}.\n", "\\end{SeeAlso}\n", "%\n", "\\begin{Examples}\n", "\\begin{ExampleCode}\n", "class(fo <- y ~ x1*x2) # \"formula\"\n", "fo\n", "typeof(fo) # R internal : \"language\"\n", "terms(fo)\n", "\n", "environment(fo)\n", "environment(as.formula(\"y ~ x\"))\n", "environment(as.formula(\"y ~ x\", env = new.env()))\n", "\n", "\n", "## Create a formula for a model with a large number of variables:\n", "xnam <- paste0(\"x\", 1:25)\n", "(fmla <- as.formula(paste(\"y ~ \", paste(xnam, collapse= \"+\"))))\n", "\\end{ExampleCode}\n", "\\end{Examples}" ], "text/plain": [ "formula package:stats R Documentation\n", "\n", "_\bM_\bo_\bd_\be_\bl _\bF_\bo_\br_\bm_\bu_\bl_\ba_\be\n", "\n", "_\bD_\be_\bs_\bc_\br_\bi_\bp_\bt_\bi_\bo_\bn:\n", "\n", " The generic function ‘formula’ and its specific methods provide a\n", " way of extracting formulae which have been included in other\n", " objects.\n", "\n", " ‘as.formula’ is almost identical, additionally preserving\n", " attributes when ‘object’ already inherits from ‘\"formula\"’.\n", "\n", "_\bU_\bs_\ba_\bg_\be:\n", "\n", " formula(x, ...)\n", " as.formula(object, env = parent.frame())\n", " \n", " ## S3 method for class 'formula'\n", " print(x, showEnv = !identical(e, .GlobalEnv), ...)\n", " \n", "_\bA_\br_\bg_\bu_\bm_\be_\bn_\bt_\bs:\n", "\n", "x, object: R object.\n", "\n", " ...: further arguments passed to or from other methods.\n", "\n", " env: the environment to associate with the result, if not already\n", " a formula.\n", "\n", " showEnv: logical indicating if the environment should be printed as\n", " well.\n", "\n", "_\bD_\be_\bt_\ba_\bi_\bl_\bs:\n", "\n", " The models fit by, e.g., the ‘lm’ and ‘glm’ functions are\n", " specified in a compact symbolic form. The ‘~’ operator is basic\n", " in the formation of such models. An expression of the form ‘y ~\n", " model’ is interpreted as a specification that the response ‘y’ is\n", " modelled by a linear predictor specified symbolically by ‘model’.\n", " Such a model consists of a series of terms separated by ‘+’\n", " operators. The terms themselves consist of variable and factor\n", " names separated by ‘:’ operators. Such a term is interpreted as\n", " the interaction of all the variables and factors appearing in the\n", " term.\n", "\n", " In addition to ‘+’ and ‘:’, a number of other operators are useful\n", " in model formulae. The ‘*’ operator denotes factor crossing:\n", " ‘a*b’ interpreted as ‘a+b+a:b’. The ‘^’ operator indicates\n", " crossing to the specified degree. For example ‘(a+b+c)^2’ is\n", " identical to ‘(a+b+c)*(a+b+c)’ which in turn expands to a formula\n", " containing the main effects for ‘a’, ‘b’ and ‘c’ together with\n", " their second-order interactions. The ‘%in%’ operator indicates\n", " that the terms on its left are nested within those on the right.\n", " For example ‘a + b %in% a’ expands to the formula ‘a + a:b’. The\n", " ‘-’ operator removes the specified terms, so that ‘(a+b+c)^2 -\n", " a:b’ is identical to ‘a + b + c + b:c + a:c’. It can also used to\n", " remove the intercept term: when fitting a linear model ‘y ~ x - 1’\n", " specifies a line through the origin. A model with no intercept\n", " can be also specified as ‘y ~ x + 0’ or ‘y ~ 0 + x’.\n", "\n", " While formulae usually involve just variable and factor names,\n", " they can also involve arithmetic expressions. The formula ‘log(y)\n", " ~ a + log(x)’ is quite legal. When such arithmetic expressions\n", " involve operators which are also used symbolically in model\n", " formulae, there can be confusion between arithmetic and symbolic\n", " operator use.\n", "\n", " To avoid this confusion, the function ‘I()’ can be used to bracket\n", " those portions of a model formula where the operators are used in\n", " their arithmetic sense. For example, in the formula ‘y ~ a +\n", " I(b+c)’, the term ‘b+c’ is to be interpreted as the sum of ‘b’ and\n", " ‘c’.\n", "\n", " Variable names can be quoted by backticks ‘`like this`’ in\n", " formulae, although there is no guarantee that all code using\n", " formulae will accept such non-syntactic names.\n", "\n", " Most model-fitting functions accept formulae with right-hand-side\n", " including the function ‘offset’ to indicate terms with a fixed\n", " coefficient of one. Some functions accept other ‘specials’ such\n", " as ‘strata’ or ‘cluster’ (see the ‘specials’ argument of\n", " ‘terms.formula)’.\n", "\n", " There are two special interpretations of ‘.’ in a formula. The\n", " usual one is in the context of a ‘data’ argument of model fitting\n", " functions and means ‘all columns not otherwise in the formula’:\n", " see ‘terms.formula’. In the context of ‘update.formula’, *only*,\n", " it means ‘what was previously in this part of the formula’.\n", "\n", " When ‘formula’ is called on a fitted model object, either a\n", " specific method is used (such as that for class ‘\"nls\"’) or the\n", " default method. The default first looks for a ‘\"formula\"’\n", " component of the object (and evaluates it), then a ‘\"terms\"’\n", " component, then a ‘formula’ parameter of the call (and evaluates\n", " its value) and finally a ‘\"formula\"’ attribute.\n", "\n", " There is a ‘formula’ method for data frames. If there is only one\n", " column this forms the RHS with an empty LHS. For more columns,\n", " the first column is the LHS of the formula and the remaining\n", " columns separated by ‘+’ form the RHS.\n", "\n", "_\bV_\ba_\bl_\bu_\be:\n", "\n", " All the functions above produce an object of class ‘\"formula\"’\n", " which contains a symbolic model formula.\n", "\n", "_\bE_\bn_\bv_\bi_\br_\bo_\bn_\bm_\be_\bn_\bt_\bs:\n", "\n", " A formula object has an associated environment, and this\n", " environment (rather than the parent environment) is used by\n", " ‘model.frame’ to evaluate variables that are not found in the\n", " supplied ‘data’ argument.\n", "\n", " Formulas created with the ‘~’ operator use the environment in\n", " which they were created. Formulas created with ‘as.formula’ will\n", " use the ‘env’ argument for their environment.\n", "\n", "_\bR_\be_\bf_\be_\br_\be_\bn_\bc_\be_\bs:\n", "\n", " Chambers, J. M. and Hastie, T. J. (1992) _Statistical models._\n", " Chapter 2 of _Statistical Models in S_ eds J. M. Chambers and T.\n", " J. Hastie, Wadsworth & Brooks/Cole.\n", "\n", "_\bS_\be_\be _\bA_\bl_\bs_\bo:\n", "\n", " ‘I’, ‘offset’.\n", "\n", " For formula manipulation: ‘terms’, and ‘all.vars’; for typical\n", " use: ‘lm’, ‘glm’, and ‘coplot’.\n", "\n", "_\bE_\bx_\ba_\bm_\bp_\bl_\be_\bs:\n", "\n", " class(fo <- y ~ x1*x2) # \"formula\"\n", " fo\n", " typeof(fo) # R internal : \"language\"\n", " terms(fo)\n", " \n", " environment(fo)\n", " environment(as.formula(\"y ~ x\"))\n", " environment(as.formula(\"y ~ x\", env = new.env()))\n", " \n", " \n", " ## Create a formula for a model with a large number of variables:\n", " xnam <- paste0(\"x\", 1:25)\n", " (fmla <- as.formula(paste(\"y ~ \", paste(xnam, collapse= \"+\"))))\n", " " ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# It will help to be familiar with \"formula\" objects in R:\n", "help(formula)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "\n", "Call:\n", "lm(formula = Y ~ X)\n", "\n", "Coefficients:\n", "(Intercept) X1 X2 X3 \n", " 1.96995 NA 0.02245 10.16550 \n" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "\n", "Call:\n", "lm(formula = Y ~ X[, 2:3])\n", "\n", "Coefficients:\n", "(Intercept) X[, 2:3]1 X[, 2:3]2 \n", " 1.96995 0.02245 10.16550 \n" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "\n", "Call:\n", "lm(formula = Y ~ X - 1)\n", "\n", "Coefficients:\n", " X1 X2 X3 \n", " 1.96995 0.02245 10.16550 \n" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "\n", "Call:\n", "lm(formula = Y ~ time + I(time^2/2))\n", "\n", "Coefficients:\n", "(Intercept) time I(time^2/2) \n", " 1.96995 0.02245 10.16550 \n" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# In R, there are always several ways to do anything!\n", "\n", "# Here are several calls to lm() to fit the regression model. \n", "# Note that lm() uses QR factorization by default to solve least-squares problems (rather than inverting matrices)\n", "# R uses a fairly standard \"formula\" notation. \n", "# \"Y ~ X\" models Y as a linear function of X.\n", "# Y ~ X + I(X^2) models Y as a linear function of X and X^2.\n", "# By default, R includes an intercept (constant) term. You can alter this, or take advantage of it.\n", "\n", "lm(Y ~ X) # the constant term (intercept) corresponding to h is redundant: lm() automatically inserts an intercept.\n", " # Since the model has a constant, X1 won't be estimated.\n", "\n", "lm(Y ~ X[,2:3]) # this omits X1, the intercept column of X, from the model, since lm() automatically includes one.\n", "\n", "lm(Y ~ X - 1) # this tells lm() not to add an intercept term (we don't need it, since X already has a constant term).\n", "\n", "time <- 0:3; # set just the \"time\" variable\n", "lm(Y ~ time + I(time^2/2)) # Model the position (Y) as an affine function of t and t^2. \n", " # The intercept (h) term will be included automatically." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Standard errors\n", "\n", "Recall that there is in general no way to test whether the regression model is true; in most\n", "applications, that the form of the model is correct is simply an _assumption_.\n", "If the model is not correct, the parameters in the model may be meaningless;\n", "and it is not obvious what uncertainties in estimates of the model parameters mean\n", "when the form of the model is wrong.\n", "\n", "However, suppose that the regression model is _true_: $Y = X\\beta + \\epsilon$, \n", "where the components of $\\epsilon$ are uncorrelated, with mean zero and variance $\\sigma^2$,\n", "and $\\epsilon$ and $X$ are independent.\n", "Then, as we have seen, $\\hat{\\beta}$ is an unbiased estimate of \n", "$\\beta$, i.e., ${\\mathbb E}\\hat{\\beta} = \\beta$.\n", "\n", "Moreover, we can find the covariance of the parameter estimates as follows:\n", "\n", "
\n", "__Theorem (variance of $\\hat{\\beta}$).__\n", "\n", "$$\n", "\\mbox{cov}(\\hat{\\beta} \\mid X) = \\sigma^2 (X'X)^{-1}.\n", "$$\n", "\n", "_Proof._\n", "Since $\\hat{\\beta}$ is conditionally unbiased under these assumptions,\n", "\n", "$$\n", "\\hat{\\beta} - {\\mathbb E}(\\hat{\\beta} \\mid X) = (X'X)^{-1}X'(X\\beta + \\epsilon) - \\beta =\n", "\\beta + (X'X)^{-1}X'\\epsilon - \\beta = (X'X)^{-1}X'\\epsilon.\n", "$$ \n", "\n", "Hence,\n", "\n", "$$\n", "\\mbox{cov}(\\hat{\\beta} \\mid X) = \\mbox{cov}((X'X)^{-1}X'\\epsilon \\mid X) \n", "$$\n", "$$\n", "= {\\mathbb E} \\left ( ((X'X)^{-1}X' \\epsilon) ((X'X)^{-1}X' \\epsilon)' \\mid X \\right )\n", "$$\n", "$$\n", "= {\\mathbb E} \\left ( (X'X)^{-1}X' \\epsilon \\epsilon' X (X'X)^{-1}) \\mid X \\right )\n", "$$\n", "(since $(X'X)^{-1} = ((X'X)')^{-1}$)\n", "$$\n", "= (X'X)^{-1} X' {\\mathbb E}(\\epsilon \\epsilon' \\mid X) X (X'X)^{-1}\n", "$$\n", "(by the linearity of expectation)\n", "$$\n", "= (X'X)^{-1}X' \\sigma^2 {\\mathbf 1}_n X (X'X)^{-1}\n", "$$\n", "$$\n", "= \\sigma^2 (X'X)^{-1}X'X(X'X)^{-1} = \\sigma^2(X'X)^{-1}.\n", "$$\n", "\n", "In typical applications, $\\sigma^2$ is unknown.\n", "If we knew the disturbance term $\\epsilon$, we could estimate $\\sigma^2$ by\n", "$\\frac{1}{n} \\sum_{i=1}^n \\epsilon_i^2$.\n", "But the disturbance terms $\\epsilon$ are not observable.\n", "We do observe the residuals $e = Y - X\\hat{\\beta}$.\n", "If the model is true, we can use them to estimate $\\sigma^2$.\n", "\n", "The quantity $\\frac{1}{n} \\sum_{i=1}^n e_i^2$ tends to be an underestimate of $\\sigma^2$ because\n", "the model is opimized precisely to minimize this quantity over all possible values of $\\beta$.\n", "To compensate for that, we can divide the sum of squared residuals by $n-p$ rather than\n", "by $n$.\n", "The quantity\n", "\n", "$$\n", "\\hat{\\sigma}^2 \\equiv \\frac{1}{n-p} \\sum_{i=1}^n e_i^2\n", "$$\n", "is a conditionally unbiased estimate of $\\sigma^2$ given $X$, if the model is true.\n", "The number $n-p$ is called the _degrees of freedom_.\n", "\n", "Define\n", "$$\n", "\\hat{\\mbox{cov}}(\\hat{\\beta} | X) \\equiv \\hat{\\sigma}^2 (X'X)^{-1}.\n", "$$\n", "\n", "This is a conditionally unbiased estimate of the covariance of $\\hat{\\beta}$ given $X$.\n", "(Note that we _do_ have to invert a matrix to find this!)\n", "\n", "Let's find this quantity for our toy problem." ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "0.10672605" ], "text/latex": [ "0.10672605" ], "text/markdown": [ "0.10672605" ], "text/plain": [ "[1] 0.106726" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# estimate the uncertainty of the model estimate.\n", "# first estimate sigma^2\n", "e <- Y - X %*% betahat; # the residuals\n", "dims <- dim(X); # the number of rows is n; the number of columns is p\n", "sigmahat2 <- sum(e^2)/(dims[1]-dims[2]);\n", "sigmahat2" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
0.10138975-0.11206235 0.05336302
-0.1120624 0.2614788-0.1600891
0.05336302-0.16008907 0.10672605
\n" ], "text/latex": [ "\\begin{tabular}{lll}\n", "\t 0.10138975 & -0.11206235 & 0.05336302\\\\\n", "\t -0.1120624 & 0.2614788 & -0.1600891\\\\\n", "\t 0.05336302 & -0.16008907 & 0.10672605\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "1. 0.1013897475\n", "2. -0.1120623525\n", "3. 0.0533630249999997\n", "4. -0.1120623525\n", "5. 0.261478822499999\n", "6. -0.160089074999999\n", "7. 0.0533630249999999\n", "8. -0.160089074999999\n", "9. 0.10672605\n", "\n", "\n" ], "text/plain": [ " [,1] [,2] [,3]\n", "[1,] 0.10138975 -0.1120624 0.05336302\n", "[2,] -0.11206235 0.2614788 -0.16008907\n", "[3,] 0.05336302 -0.1600891 0.10672605" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# now use this to find the estimate of cov(betahat)\n", "covbhat <- sigmahat2 * solve(t(X) %*% X)\n", "covbhat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The estimated covariance matrix of $\\hat{\\beta}$ gives us standard errors for the individual components of\n", "$\\hat{\\beta}$.\n", "The estimated uncertainty of the $j$th parameter is the square-root of the $(j,j)$ element of\n", "$\\hat{\\mbox{cov}}(\\hat{\\beta} | X)$." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 0.318417567825646
  2. \n", "\t
  3. 0.511349999999999
  4. \n", "\t
  5. 0.326689531512719
  6. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 0.318417567825646\n", "\\item 0.511349999999999\n", "\\item 0.326689531512719\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 0.318417567825646\n", "2. 0.511349999999999\n", "3. 0.326689531512719\n", "\n", "\n" ], "text/plain": [ "[1] 0.3184176 0.5113500 0.3266895" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# find the square-root of the diagonal elements of the estimated covariance matrix\n", "sqrt(diag(covbhat))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, for instance, the estimated standard error of the estimate of $h$ is 0.3184...\n", "\n", "Our estimates, with estimated uncertainties, are now\n", "\n", "\n", "\n", "\n", "\n", "\n", "
$h = 1.97 \\pm 0.318$ $v = 0.02 \\pm 0.511$ $a = 10.16 \\pm 0.327$
\n", "\n", "In \"reality,\" the data for this problem were generated using $h=3$, $v=0$, and $a = 9.8$,\n", "and the components of $\\epsilon$ were iid standard normal (pseudo-)random variables: $\\epsilon_i \\sim N(0,1)$.\n", "That is, $\\sigma = 1$, while we estimated $\\hat{\\sigma} = 0.327$!\n", "\n", "Let's find the actual errors and compare their magnitude to the estimated standard errors of their estimates.\n", "Student's t statistic is the true error divided by the estimated standard error." ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
-1.03005
0.02245
0.3655
\n" ], "text/latex": [ "\\begin{tabular}{l}\n", "\t -1.03005\\\\\n", "\t 0.02245\\\\\n", "\t 0.3655\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "1. -1.03005\n", "2. 0.0224499999999926\n", "3. 0.365500000000006\n", "\n", "\n" ], "text/plain": [ " [,1]\n", "[1,] -1.03005\n", "[2,] 0.02245\n", "[3,] 0.36550" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/html": [ "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
-3.234903
0.04390339
1.118799
\n" ], "text/latex": [ "\\begin{tabular}{l}\n", "\t -3.234903\\\\\n", "\t 0.04390339\\\\\n", "\t 1.118799\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "1. -3.23490317143562\n", "2. 0.0439033929793539\n", "3. 1.11879924130895\n", "\n", "\n" ], "text/plain": [ " [,1]\n", "[1,] -3.23490317\n", "[2,] 0.04390339\n", "[3,] 1.11879924" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# find betahat - beta\n", "beta <- c(3, 0, 9.8);\n", "betahat - beta\n", "tstat <- (betahat - beta)/sqrt(diag(covbhat)); # this is Student's t statistic for the estimates\n", "tstat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The estimated value of $h$ was off by more than 3.23 estimated standard errors.\n", "\n", "We haven't discussed this yet, but if the components of $\\epsilon$ are iid normal, then the $t$-statistics\n", "would have Student's $t$ distribution with $n-p$ degrees of freedom.\n", "In this case, $n=4$ and $p=3$, so the $t$-statistics have $4-3=1$ degree of freedom.\n", "\n", "The chance that such a variable would be larger in magnitude than 3.234903 can be found using the R\n", "function pt(), which finds probabilities for Student's $t$-distribution." ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
TDist {stats}R Documentation
\n", "\n", "

The Student t Distribution

\n", "\n", "

Description

\n", "\n", "

Density, distribution function, quantile function and random\n", "generation for the t distribution with df degrees of freedom\n", "(and optional non-centrality parameter ncp).\n", "

\n", "\n", "\n", "

Usage

\n", "\n", "
\n",
       "dt(x, df, ncp, log = FALSE)\n",
       "pt(q, df, ncp, lower.tail = TRUE, log.p = FALSE)\n",
       "qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE)\n",
       "rt(n, df, ncp)\n",
       "
\n", "\n", "\n", "

Arguments

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
x, q\n", "

vector of quantiles.

\n", "
p\n", "

vector of probabilities.

\n", "
n\n", "

number of observations. If length(n) > 1, the length\n", "is taken to be the number required.

\n", "
df\n", "

degrees of freedom (> 0, maybe non-integer). df\n", " = Inf is allowed.

\n", "
ncp\n", "

non-centrality parameter delta;\n", "currently except for rt(), only for abs(ncp) <= 37.62.\n", "If omitted, use the central t distribution.

\n", "
log, log.p\n", "

logical; if TRUE, probabilities p are given as log(p).

\n", "
lower.tail\n", "

logical; if TRUE (default), probabilities are\n", "P[X ≤ x], otherwise, P[X > x].

\n", "
\n", "\n", "\n", "

Details

\n", "\n", "

The t distribution with df = n degrees of\n", "freedom has density\n", "

\n", "

f(x) = Γ((n+1)/2) / (√(n π) Γ(n/2)) (1 + x^2/n)^-((n+1)/2)

\n", "\n", "

for all real x.\n", "It has mean 0 (for n > 1) and\n", "variance n/(n-2) (for n > 2).\n", "

\n", "

The general non-central t\n", "with parameters (df, Del) = (df, ncp)\n", "is defined as the distribution of\n", "T(df, Del) := (U + Del) / √(V/df) \n", "where U and V are independent random\n", "variables, U ~ N(0,1) and\n", "V ~ χ^2(df) (see Chisquare).\n", "

\n", "

The most used applications are power calculations for t-tests:
\n", "Let T= (mX - m0) / (S/sqrt(n))\n", "where\n", "mX is the mean and S the sample standard\n", "deviation (sd) of X_1, X_2, …, X_n which are\n", "i.i.d. N(μ, σ^2)\n", "Then T is distributed as non-central t with\n", "df= n - 1\n", "degrees of freedom and non-centrality parameter\n", "ncp = (μ - m0) * sqrt(n)/σ.\n", "

\n", "\n", "\n", "

Value

\n", "\n", "

dt gives the density,\n", "pt gives the distribution function,\n", "qt gives the quantile function, and\n", "rt generates random deviates.\n", "

\n", "

Invalid arguments will result in return value NaN, with a warning.\n", "

\n", "

The length of the result is determined by n for\n", "rt, and is the maximum of the lengths of the\n", "numerical arguments for the other functions. \n", "

\n", "

The numerical arguments other than n are recycled to the\n", "length of the result. Only the first elements of the logical\n", "arguments are used.\n", "

\n", "\n", "\n", "

Note

\n", "\n", "

Supplying ncp = 0 uses the algorithm for the non-central\n", "distribution, which is not the same algorithm used if ncp is\n", "omitted. This is to give consistent behaviour in extreme cases with\n", "values of ncp very near zero.\n", "

\n", "

The code for non-zero ncp is principally intended to be used\n", "for moderate values of ncp: it will not be highly accurate,\n", "especially in the tails, for large values.\n", "

\n", "\n", "\n", "

Source

\n", "\n", "

The central dt is computed via an accurate formula\n", "provided by Catherine Loader (see the reference in dbinom).\n", "

\n", "

For the non-central case of dt, C code contributed by\n", "Claus Ekstrøm based on the relationship (for\n", "x != 0) to the cumulative distribution.\n", "

\n", "

For the central case of pt, a normal approximation in the\n", "tails, otherwise via pbeta.\n", "

\n", "

For the non-central case of pt based on a C translation of\n", "

\n", "

Lenth, R. V. (1989). Algorithm AS 243 —\n", "Cumulative distribution function of the non-central t distribution,\n", "Applied Statistics 38, 185–189.\n", "

\n", "

This computes the lower tail only, so the upper tail suffers from\n", "cancellation and a warning will be given when this is likely to be\n", "significant.\n", "

\n", "

For central qt, a C translation of\n", "

\n", "

Hill, G. W. (1970) Algorithm 396: Student's t-quantiles.\n", "Communications of the ACM, 13(10), 619–620.\n", "

\n", "

altered to take account of\n", "

\n", "

Hill, G. W. (1981) Remark on Algorithm 396, ACM Transactions on\n", "Mathematical Software, 7, 250–1.\n", "

\n", "

The non-central case is done by inversion.\n", "

\n", "\n", "\n", "

References

\n", "\n", "

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)\n", "The New S Language.\n", "Wadsworth & Brooks/Cole. (Except non-central versions.)\n", "

\n", "

Johnson, N. L., Kotz, S. and Balakrishnan, N. (1995)\n", "Continuous Univariate Distributions, volume 2, chapters 28 and 31.\n", "Wiley, New York.\n", "

\n", "\n", "\n", "

See Also

\n", "\n", "

Distributions for other standard distributions, including\n", "df for the F distribution.\n", "

\n", "\n", "\n", "

Examples

\n", "\n", "
\n",
       "require(graphics)\n",
       "\n",
       "1 - pt(1:5, df = 1)\n",
       "qt(.975, df = c(1:10,20,50,100,1000))\n",
       "\n",
       "tt <- seq(0, 10, len = 21)\n",
       "ncp <- seq(0, 6, len = 31)\n",
       "ptn <- outer(tt, ncp, function(t, d) pt(t, df = 3, ncp = d))\n",
       "t.tit <- \"Non-central t - Probabilities\"\n",
       "image(tt, ncp, ptn, zlim = c(0,1), main = t.tit)\n",
       "persp(tt, ncp, ptn, zlim = 0:1, r = 2, phi = 20, theta = 200, main = t.tit,\n",
       "      xlab = \"t\", ylab = \"non-centrality parameter\",\n",
       "      zlab = \"Pr(T <= t)\")\n",
       "\n",
       "plot(function(x) dt(x, df = 3, ncp = 2), -3, 11, ylim = c(0, 0.32),\n",
       "     main = \"Non-central t - Density\", yaxs = \"i\")\n",
       "
\n", "\n", "
[Package stats version 3.1.3 ]
" ], "text/latex": [ "\\inputencoding{utf8}\n", "\\HeaderA{TDist}{The Student t Distribution}{TDist}\n", "\\aliasA{dt}{TDist}{dt}\n", "\\aliasA{pt}{TDist}{pt}\n", "\\aliasA{qt}{TDist}{qt}\n", "\\aliasA{rt}{TDist}{rt}\n", "\\keyword{distribution}{TDist}\n", "%\n", "\\begin{Description}\\relax\n", "Density, distribution function, quantile function and random\n", "generation for the t distribution with \\code{df} degrees of freedom\n", "(and optional non-centrality parameter \\code{ncp}).\n", "\\end{Description}\n", "%\n", "\\begin{Usage}\n", "\\begin{verbatim}\n", "dt(x, df, ncp, log = FALSE)\n", "pt(q, df, ncp, lower.tail = TRUE, log.p = FALSE)\n", "qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE)\n", "rt(n, df, ncp)\n", "\\end{verbatim}\n", "\\end{Usage}\n", "%\n", "\\begin{Arguments}\n", "\\begin{ldescription}\n", "\\item[\\code{x, q}] vector of quantiles.\n", "\\item[\\code{p}] vector of probabilities.\n", "\\item[\\code{n}] number of observations. If \\code{length(n) > 1}, the length\n", "is taken to be the number required.\n", "\\item[\\code{df}] degrees of freedom (\\eqn{> 0}{}, maybe non-integer). \\code{df\n", " = Inf} is allowed.\n", "\\item[\\code{ncp}] non-centrality parameter \\eqn{\\delta}{};\n", "currently except for \\code{rt()}, only for \\code{abs(ncp) <= 37.62}.\n", "If omitted, use the central t distribution.\n", "\\item[\\code{log, log.p}] logical; if TRUE, probabilities p are given as log(p).\n", "\\item[\\code{lower.tail}] logical; if TRUE (default), probabilities are\n", "\\eqn{P[X \\le x]}{}, otherwise, \\eqn{P[X > x]}{}.\n", "\\end{ldescription}\n", "\\end{Arguments}\n", "%\n", "\\begin{Details}\\relax\n", "The \\eqn{t}{} distribution with \\code{df} \\eqn{= \\nu}{} degrees of\n", "freedom has density\n", "\\deqn{\n", " f(x) = \\frac{\\Gamma ((\\nu+1)/2)}{\\sqrt{\\pi \\nu} \\Gamma (\\nu/2)}\n", " (1 + x^2/\\nu)^{-(\\nu+1)/2}%\n", " }{}\n", "for all real \\eqn{x}{}.\n", "It has mean \\eqn{0}{} (for \\eqn{\\nu > 1}{}) and\n", "variance \\eqn{\\frac{\\nu}{\\nu-2}}{} (for \\eqn{\\nu > 2}{}).\n", "\n", "The general \\emph{non-central} \\eqn{t}{}\n", "with parameters \\eqn{(\\nu, \\delta)}{} \\code{= (df, ncp)}\n", "is defined as the distribution of\n", "\\eqn{T_{\\nu}(\\delta) := (U + \\delta)/\\sqrt{V/\\nu}}{}\n", "where \\eqn{U}{} and \\eqn{V}{} are independent random\n", "variables, \\eqn{U \\sim {\\cal N}(0,1)}{} and\n", "\\eqn{V \\sim \\chi^2_\\nu}{} (see \\LinkA{Chisquare}{Chisquare}).\n", "\n", "The most used applications are power calculations for \\eqn{t}{}-tests:\\\\{}\n", "Let \\eqn{T = \\frac{\\bar{X} - \\mu_0}{S/\\sqrt{n}}}{}\n", "where\n", "\\eqn{\\bar{X}}{} is the \\code{\\LinkA{mean}{mean}} and \\eqn{S}{} the sample standard\n", "deviation (\\code{\\LinkA{sd}{sd}}) of \\eqn{X_1, X_2, \\dots, X_n}{} which are\n", "i.i.d. \\eqn{{\\cal N}(\\mu, \\sigma^2)}{}\n", "Then \\eqn{T}{} is distributed as non-central \\eqn{t}{} with\n", "\\code{df}\\eqn{{} = n-1}{}\n", "degrees of freedom and \\bold{n}on-\\bold{c}entrality \\bold{p}arameter\n", "\\code{ncp}\\eqn{{} = (\\mu - \\mu_0) \\sqrt{n}/\\sigma}{}.\n", "\\end{Details}\n", "%\n", "\\begin{Value}\n", "\\code{dt} gives the density,\n", "\\code{pt} gives the distribution function,\n", "\\code{qt} gives the quantile function, and\n", "\\code{rt} generates random deviates.\n", "\n", "Invalid arguments will result in return value \\code{NaN}, with a warning.\n", "\n", "The length of the result is determined by \\code{n} for\n", "\\code{rt}, and is the maximum of the lengths of the\n", "numerical arguments for the other functions. \n", "\n", "The numerical arguments other than \\code{n} are recycled to the\n", "length of the result. Only the first elements of the logical\n", "arguments are used.\n", "\\end{Value}\n", "%\n", "\\begin{Note}\\relax\n", "Supplying \\code{ncp = 0} uses the algorithm for the non-central\n", "distribution, which is not the same algorithm used if \\code{ncp} is\n", "omitted. This is to give consistent behaviour in extreme cases with\n", "values of \\code{ncp} very near zero.\n", "\n", "The code for non-zero \\code{ncp} is principally intended to be used\n", "for moderate values of \\code{ncp}: it will not be highly accurate,\n", "especially in the tails, for large values.\n", "\\end{Note}\n", "%\n", "\\begin{Source}\\relax\n", "The central \\code{dt} is computed via an accurate formula\n", "provided by Catherine Loader (see the reference in \\code{\\LinkA{dbinom}{dbinom}}).\n", "\n", "For the non-central case of \\code{dt}, C code contributed by\n", "Claus Ekstrøm based on the relationship (for\n", "\\eqn{x \\neq 0}{}) to the cumulative distribution.\n", "\n", "For the central case of \\code{pt}, a normal approximation in the\n", "tails, otherwise via \\code{\\LinkA{pbeta}{pbeta}}.\n", "\n", "For the non-central case of \\code{pt} based on a C translation of\n", "\n", "Lenth, R. V. (1989). \\emph{Algorithm AS 243} ---\n", "Cumulative distribution function of the non-central \\eqn{t}{} distribution,\n", "\\emph{Applied Statistics} \\bold{38}, 185--189.\n", "\n", "This computes the lower tail only, so the upper tail suffers from\n", "cancellation and a warning will be given when this is likely to be\n", "significant.\n", "\n", "For central \\code{qt}, a C translation of\n", "\n", "Hill, G. W. (1970) Algorithm 396: Student's t-quantiles.\n", "\\emph{Communications of the ACM}, \\bold{13(10)}, 619--620.\n", "\n", "altered to take account of\n", "\n", "Hill, G. W. (1981) Remark on Algorithm 396, \\emph{ACM Transactions on\n", "Mathematical Software}, \\bold{7}, 250--1.\n", "\n", "The non-central case is done by inversion.\n", "\\end{Source}\n", "%\n", "\\begin{References}\\relax\n", "Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)\n", "\\emph{The New S Language}.\n", "Wadsworth \\& Brooks/Cole. (Except non-central versions.)\n", "\n", "Johnson, N. L., Kotz, S. and Balakrishnan, N. (1995)\n", "\\emph{Continuous Univariate Distributions}, volume 2, chapters 28 and 31.\n", "Wiley, New York.\n", "\\end{References}\n", "%\n", "\\begin{SeeAlso}\\relax\n", "\\LinkA{Distributions}{Distributions} for other standard distributions, including\n", "\\code{\\LinkA{df}{df}} for the F distribution.\n", "\\end{SeeAlso}\n", "%\n", "\\begin{Examples}\n", "\\begin{ExampleCode}\n", "require(graphics)\n", "\n", "1 - pt(1:5, df = 1)\n", "qt(.975, df = c(1:10,20,50,100,1000))\n", "\n", "tt <- seq(0, 10, len = 21)\n", "ncp <- seq(0, 6, len = 31)\n", "ptn <- outer(tt, ncp, function(t, d) pt(t, df = 3, ncp = d))\n", "t.tit <- \"Non-central t - Probabilities\"\n", "image(tt, ncp, ptn, zlim = c(0,1), main = t.tit)\n", "persp(tt, ncp, ptn, zlim = 0:1, r = 2, phi = 20, theta = 200, main = t.tit,\n", " xlab = \"t\", ylab = \"non-centrality parameter\",\n", " zlab = \"Pr(T <= t)\")\n", "\n", "plot(function(x) dt(x, df = 3, ncp = 2), -3, 11, ylim = c(0, 0.32),\n", " main = \"Non-central t - Density\", yaxs = \"i\")\n", "\\end{ExampleCode}\n", "\\end{Examples}" ], "text/plain": [ "TDist package:stats R Documentation\n", "\n", "_\bT_\bh_\be _\bS_\bt_\bu_\bd_\be_\bn_\bt _\bt _\bD_\bi_\bs_\bt_\br_\bi_\bb_\bu_\bt_\bi_\bo_\bn\n", "\n", "_\bD_\be_\bs_\bc_\br_\bi_\bp_\bt_\bi_\bo_\bn:\n", "\n", " Density, distribution function, quantile function and random\n", " generation for the t distribution with ‘df’ degrees of freedom\n", " (and optional non-centrality parameter ‘ncp’).\n", "\n", "_\bU_\bs_\ba_\bg_\be:\n", "\n", " dt(x, df, ncp, log = FALSE)\n", " pt(q, df, ncp, lower.tail = TRUE, log.p = FALSE)\n", " qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE)\n", " rt(n, df, ncp)\n", " \n", "_\bA_\br_\bg_\bu_\bm_\be_\bn_\bt_\bs:\n", "\n", " x, q: vector of quantiles.\n", "\n", " p: vector of probabilities.\n", "\n", " n: number of observations. If ‘length(n) > 1’, the length is\n", " taken to be the number required.\n", "\n", " df: degrees of freedom (> 0, maybe non-integer). ‘df = Inf’ is\n", " allowed.\n", "\n", " ncp: non-centrality parameter delta; currently except for ‘rt()’,\n", " only for ‘abs(ncp) <= 37.62’. If omitted, use the central t\n", " distribution.\n", "\n", "log, log.p: logical; if TRUE, probabilities p are given as log(p).\n", "\n", "lower.tail: logical; if TRUE (default), probabilities are P[X <= x],\n", " otherwise, P[X > x].\n", "\n", "_\bD_\be_\bt_\ba_\bi_\bl_\bs:\n", "\n", " The t distribution with ‘df’ = n degrees of freedom has density\n", "\n", " f(x) = Gamma((n+1)/2) / (sqrt(n pi) Gamma(n/2)) (1 + x^2/n)^-((n+1)/2)\n", " \n", " for all real x. It has mean 0 (for n > 1) and variance n/(n-2)\n", " (for n > 2).\n", "\n", " The general _non-central_ t with parameters (df, Del) ‘= (df,\n", " ncp)’ is defined as the distribution of T(df, Del) := (U + Del) /\n", " sqrt(V/df) where U and V are independent random variables, U ~\n", " N(0,1) and V ~ chi^2(df) (see Chisquare).\n", "\n", " The most used applications are power calculations for t-tests:\n", " Let T= (mX - m0) / (S/sqrt(n)) where mX is the ‘mean’ and S the\n", " sample standard deviation (‘sd’) of X_1, X_2, ..., X_n which are\n", " i.i.d. N(mu, sigma^2) Then T is distributed as non-central t with\n", " ‘df’= n - 1 degrees of freedom and *n*on-*c*entrality *p*arameter\n", " ‘ncp’ = (mu - m0) * sqrt(n)/sigma.\n", "\n", "_\bV_\ba_\bl_\bu_\be:\n", "\n", " ‘dt’ gives the density, ‘pt’ gives the distribution function, ‘qt’\n", " gives the quantile function, and ‘rt’ generates random deviates.\n", "\n", " Invalid arguments will result in return value ‘NaN’, with a\n", " warning.\n", "\n", " The length of the result is determined by ‘n’ for ‘rt’, and is the\n", " maximum of the lengths of the numerical arguments for the other\n", " functions.\n", "\n", " The numerical arguments other than ‘n’ are recycled to the length\n", " of the result. Only the first elements of the logical arguments\n", " are used.\n", "\n", "_\bN_\bo_\bt_\be:\n", "\n", " Supplying ‘ncp = 0’ uses the algorithm for the non-central\n", " distribution, which is not the same algorithm used if ‘ncp’ is\n", " omitted. This is to give consistent behaviour in extreme cases\n", " with values of ‘ncp’ very near zero.\n", "\n", " The code for non-zero ‘ncp’ is principally intended to be used for\n", " moderate values of ‘ncp’: it will not be highly accurate,\n", " especially in the tails, for large values.\n", "\n", "_\bS_\bo_\bu_\br_\bc_\be:\n", "\n", " The central ‘dt’ is computed via an accurate formula provided by\n", " Catherine Loader (see the reference in ‘dbinom’).\n", "\n", " For the non-central case of ‘dt’, C code contributed by Claus\n", " Ekstrøm based on the relationship (for x != 0) to the cumulative\n", " distribution.\n", "\n", " For the central case of ‘pt’, a normal approximation in the tails,\n", " otherwise via ‘pbeta’.\n", "\n", " For the non-central case of ‘pt’ based on a C translation of\n", "\n", " Lenth, R. V. (1989). _Algorithm AS 243_ - Cumulative distribution\n", " function of the non-central t distribution, _Applied Statistics_\n", " *38*, 185-189.\n", "\n", " This computes the lower tail only, so the upper tail suffers from\n", " cancellation and a warning will be given when this is likely to be\n", " significant.\n", "\n", " For central ‘qt’, a C translation of\n", "\n", " Hill, G. W. (1970) Algorithm 396: Student's t-quantiles.\n", " _Communications of the ACM_, *13(10)*, 619-620.\n", "\n", " altered to take account of\n", "\n", " Hill, G. W. (1981) Remark on Algorithm 396, _ACM Transactions on\n", " Mathematical Software_, *7*, 250-1.\n", "\n", " The non-central case is done by inversion.\n", "\n", "_\bR_\be_\bf_\be_\br_\be_\bn_\bc_\be_\bs:\n", "\n", " Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n", " Language_. Wadsworth & Brooks/Cole. (Except non-central\n", " versions.)\n", "\n", " Johnson, N. L., Kotz, S. and Balakrishnan, N. (1995) _Continuous\n", " Univariate Distributions_, volume 2, chapters 28 and 31. Wiley,\n", " New York.\n", "\n", "_\bS_\be_\be _\bA_\bl_\bs_\bo:\n", "\n", " Distributions for other standard distributions, including ‘df’ for\n", " the F distribution.\n", "\n", "_\bE_\bx_\ba_\bm_\bp_\bl_\be_\bs:\n", "\n", " require(graphics)\n", " \n", " 1 - pt(1:5, df = 1)\n", " qt(.975, df = c(1:10,20,50,100,1000))\n", " \n", " tt <- seq(0, 10, len = 21)\n", " ncp <- seq(0, 6, len = 31)\n", " ptn <- outer(tt, ncp, function(t, d) pt(t, df = 3, ncp = d))\n", " t.tit <- \"Non-central t - Probabilities\"\n", " image(tt, ncp, ptn, zlim = c(0,1), main = t.tit)\n", " persp(tt, ncp, ptn, zlim = 0:1, r = 2, phi = 20, theta = 200, main = t.tit,\n", " xlab = \"t\", ylab = \"non-centrality parameter\",\n", " zlab = \"Pr(T <= t)\")\n", " \n", " plot(function(x) dt(x, df = 3, ncp = 2), -3, 11, ylim = c(0, 0.32),\n", " main = \"Non-central t - Density\", yaxs = \"i\")\n", " " ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "help(pt)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
0.1908651
0.9720682
0.4643426
\n" ], "text/latex": [ "\\begin{tabular}{l}\n", "\t 0.1908651\\\\\n", "\t 0.9720682\\\\\n", "\t 0.4643426\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "1. 0.190865080036943\n", "2. 0.9720681690153\n", "3. 0.46434261452647\n", "\n", "\n" ], "text/plain": [ " [,1]\n", "[1,] 0.1908651\n", "[2,] 0.9720682\n", "[3,] 0.4643426" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# how surprising is it for our estimates to be off by so much?\n", "2*pt(abs(tstat),1, lower.tail = F) # estimated p-values: twice the upper tail probability" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is, there's about a 19% chance that the estimated value of $h$ would differ from the true value of $h$ by as much as it did, assuming the model is true—which it is (ignoring the distinction between truly random and pseudo-random)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Standard errors and the \"hat matrix\"\n", "\n", "Below we prove that $\\hat{\\sigma}^2$ is an unbiased estimate of $\\sigma^2$ if the model is true.\n", "\n", "The key to the proof is the \"hat matrix\" $H$, so called because it puts a \"hat\" on $Y$ (i.e., $HY = \\hat{Y} = X\\hat{\\beta}$).\n", "\n", "We have\n", "\n", "$$\n", "\\hat{Y} = X \\hat{\\beta} = X(X'X)^{-1}X' Y = HY,\n", "$$\n", "where \n", "$$\n", "H \\equiv X(X'X)^{-1}X'.\n", "$$\n", "\n", "The hat matrix has many useful properties:\n", "\n", "1. $Y - \\hat{Y} = ({\\mathbf 1}_n - H)Y$.\n", "1. $H' = (X(X'X)^{-1}X')' = H$ ($H$ is symmetric)\n", "1. $H^2 = (X(X'X)^{-1}X')(X(X'X)^{-1}X') = H$ ($H$ is _idempotent_)\n", "1. $HX = X(X'X)^{-1}X'X = X$\n", "1. $e = Y - HY \\perp X$ (the residual vector is orthogonal to $X$; i.e., $H$ _projects_ $Y$ onto the subspace spanned by the columns of $X$)\n", "1. $({\\mathbf 1}_n - H)X = 0$\n", "1. $({\\mathbf 1}_n - H)H = {\\mathbf 0}_n$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now prove that $\\hat{\\sigma}^2 = \\frac{1}{n-p} \\sum_{i=1}^n e_i^2$ is an unbiased estimate of $\\sigma^2$.\n", "\n", "\n", "
\n", "__Theorem.__\n", "${\\mathbb E}(\\hat{\\sigma}^2 \\mid X) = \\sigma^2$.\n", "\n", "_Proof._\n", "\n", "$$\n", "e = ({\\mathbf 1}_n - H)Y = ({\\mathbf 1}_n - H)(X\\beta + \\epsilon \n", "$$\n", "$$\n", "= ({\\mathbf 1}_n - H)X\\beta + ({\\mathbf 1}_n - H)\\epsilon = ({\\mathbf 1}_n - H)\\epsilon.\n", "$$\n", "\n", "Hence,\n", "\n", "$$\n", "\\| e \\|^2 = \\epsilon'({\\mathbf 1}_n - H)'({\\mathbf 1}_n - H)\\epsilon = \\epsilon' \\tilde{H} \\epsilon,\n", "$$\n", "where \n", "$$\n", "\\tilde{H} = ({\\mathbf 1}_n - H)'({\\mathbf 1}_n - H) = ({\\mathbf 1}_n - H) - H({\\mathbf 1}_n - H) = ({\\mathbf 1}_n - H).\n", "$$\n", "\n", "$$\n", "{\\mathbb E} \\left ( \\epsilon'\\tilde{H}\\epsilon \\mid X \\right ) = \n", "$$\n", "[TO DO: finish]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## OLS is BLUE\n", "\n", "The Gauss-Markov Theorem says that if $Y = X\\beta + \\epsilon$ with $X$ fixed, \n", "and the components of $\\epsilon$ are uncorrelated and have mean zero and common variance $\\sigma^2$, then\n", "the linear estimator of $\\beta$ that has smallest variance among all unbiased estimates\n", "(the _best linear unbiased estimator_ or _BLUE_) is the ordinary least squares estimate \n", "$$\n", "\\hat{\\beta}_{\\mbox{OLS}} = (X'X)^{-1}X'Y.\n", "$$\n", "\n", "\n", "[TO DO: add proof.]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generalized least squares\n", "\n", "The model we have been studying assumes that $\\epsilon$ and $X$ are independent and that the components of $\\epsilon$ are uncorrelated, have zero mean, and have the same variance $\\sigma^2$ (i.e., ${\\mathbb E} \\epsilon = 0$, and \n", "$\\mbox{cov } \\epsilon = \\sigma^2 {\\mathbf 1}_n$).\n", "\n", "When the covariance matrix is not diagonal but we still have ${\\mathbb E} \\epsilon = 0$ and $\\epsilon$ independent of $X$, the ordinary least squares estimate is still unbiased. \n", "\n", "$$\n", " {\\mathbb E} (\\hat{\\beta}_{\\mbox{OLS}} \\mid X) = {\\mathbb E} \\left ( (X'X)^{-1}X'Y \\mid X \\right )\n", "$$\n", "$$\n", "= {\\mathbb E} \\left ( (X'X)^{-1}X'(X\\beta + \\epsilon) \\mid X \\right ) =\n", " {\\mathbb E} \\left ( \\beta + (X'X)^{-1}X'\\epsilon \\mid X \\right )\n", "$$\n", "$$\n", "= \\beta + (X'X)^{-1}X'{\\mathbb E}(\\epsilon \\mid X) = \\beta + 0 = \\beta.\n", "$$\n", "\n", "Moreover,\n", "$$\n", "{\\mbox cov } (\\hat{\\beta}_{\\mbox{OLS}} \\mid X) = {\\mathbb E} \\left ( \\left [ (X'X)^{-1}X'\\epsilon \\right ] \n", "\\left [(X'X)^{-1}X'\\epsilon \\right ]' \\mid X \\right )\n", "$$\n", "$$\n", "= {\\mathbb E} \\left ( (X'X)^{-1} X' \\epsilon \\epsilon' X (X'X)^{-1} \\mid X \\right )\n", "$$\n", "$$\n", "= (X'X)^{-1}X'{\\mathbb E} (\\epsilon \\epsilon' \\mid X) X (X'X)^{-1}\n", "$$\n", "$$\n", "= (X'X)^{-1}X' \\Sigma X (X'X)^{-1}.\n", "$$\n", "\n", "In general, this is not diagonal and the OLS estimate is not longer BLUE\n", "unless $\\Sigma = \\sigma^2 {\\mathbf 1}_n$.\n", "\n", "If the covariance matrix ${\\mbox cov }\\epsilon = \\Sigma$ is known and positive definite, we can take it into account through _generalized least squares_ (GLS), as follows.\n", "Since $\\Sigma$ is positive definite, its square root $\\Sigma^{1/2}$ exists and is invertible.\n", "We change the problem by pre-multiplying by $\\Sigma^{-1/2}$:\n", "\n", "$$\n", " \\Sigma^{-1/2}Y = (\\Sigma^{-1/2}X) \\beta + \\Sigma^{-1/2} \\epsilon.\n", "$$\n", "\n", "In this new problem the unknown parameter is still $\\beta$, but there is a different \"noise,\" $\\Sigma^{-1/2}\\epsilon$,\n", "we have new data $\\Sigma^{-1/2}Y$, and we have a new design matrix $\\Sigma^{-1/2} X$.\n", "Since $\\Sigma$, $Y$, and $X$ are known, we can set up this regression problem using only things we know.\n", "\n", "The components of the new noise terms are uncorrelated:\n", "\n", "$$\n", " \\mbox{cov } (\\Sigma^{-1/2} \\epsilon | X) = {\\mathbb E} (\\Sigma^{-1/2} \\epsilon \\epsilon' \\Sigma^{-1/2} \\mid X)\n", "$$\n", "$$\n", "= \\Sigma^{-1/2} {\\mathbb E} ( \\epsilon \\epsilon' \\mid X) \\Sigma^{-1/2} = \\Sigma^{-1/2} \\Sigma \\Sigma^{-1/2} = \n", "{\\mathbf 1}_n.\n", "$$\n", "This transforms the original problem into a problem where the covariance matrix of the new noise is a diagonal\n", "with equal elements.\n", "OLS is BLUE for this problem.\n", "That leads to the estimator\n", "$$\n", "\\hat{\\beta}_{\\mbox{GLS}} \\equiv \\left [ (\\Sigma^{-1/2} X)'(\\Sigma^{-1/2} X) \\right ]^{-1}(\\Sigma^{-1/2} X)'\\Sigma^{-1/2}Y\n", "$$\n", "$$\n", "= \\left [ X'\\Sigma^{-1/2}\\Sigma^{-1/2}X \\right ] ^{-1} X' \\Sigma^{-1/2}\\Sigma^{-1/2}Y =\n", "(X' \\Sigma^{-1}X)^{-1}X'\\Sigma^{-1} Y.\n", "$$\n", "\n", "To construct this estimator, we need $X'\\Sigma^{-1}X$ to be invertible.\n", "Is it?\n", "$$\n", "X'\\Sigma^{-1}X = (\\Sigma^{-1/2}X)'(\\Sigma^{-1/2}X).\n", "$$\n", "The matrix $\\Sigma^{-1/2}X$ is full rank, so we are OK.\n", "\n", "It's straightforward to show that $\\hat{\\beta}_{\\mbox{GLS}}$ is conditionally unbiased given $X$.\n", "Also,\n", "\n", "$$\n", " \\mbox{cov } (\\hat{\\beta}_{\\mbox{GLS}} \\mid X) = (\\tilde{X}'\\tilde{X})^{-1},\n", "$$\n", "where $\\tilde{X} = \\Sigma^{-1/2}X$, i.e.,\n", "$$\n", " \\mbox{cov } (\\hat{\\beta}_{\\mbox{GLS}} \\mid X) = \\left [ (\\Sigma^{-1/2}X)'(\\Sigma^{-1/2}X) \\right ] ^{-1}\n", " = (X' \\Sigma^{-1} X)^{-1}.\n", "$$\n", "\n", "For fixed $X$, $\\hat{\\beta}_{\\mbox{GLS}}$ is BLUE.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: weighted least squares\n", "\n", "Suppose $\\Sigma = \\lambda \\Gamma$, where $\\lambda$ is an unknown positive constant and\n", "$\\Gamma$ is a known, positive definite matrix.\n", "\n", "We re-write the regression equation as\n", "\n", "$$\n", " (\\Gamma^{-1/2} Y) = (\\Gamma^{-1/2}X) \\beta + (\\Gamma^{-1/2} \\epsilon).\n", "$$\n", "\n", "Then we still have ${\\mathbb E} (\\Gamma^{-1/2}\\epsilon) = 0$ and\n", "$$\n", "\\mbox{cov } (\\Gamma^{-1/2} \\epsilon) = {\\mathbb E} ((\\Gamma^{-1/2} \\epsilon)(\\Gamma^{-1/2} \\epsilon)')\n", "$$\n", "$$\n", "= \n", " {\\mathbb E} (\\Gamma^{-1/2} \\epsilon \\epsilon' \\Gamma^{-1/2}) = \\Gamma^{-1/2} {\\mathbb E}( \\epsilon \\epsilon') \\Gamma^{-1/2}\n", "$$\n", "$$\n", "= \\Gamma^{-1'2} \\lambda \\Gamma \\Gamma^{-1/2} = \\lambda {\\mathbf 1}_n.\n", "$$\n", "\n", "Thus, the new problem fits the standard OLS model with $\\lambda = \\sigma^2$, and we have\n", "\n", "$$\n", "\\hat{\\beta} = \\left ( ( \\Gamma^{-1/2}X)'(\\Gamma^{-1/2}X \\right )^{-1} (\\Gamma^{-1/2}X)'(\\Gamma^{-1/2}Y)\n", "= (X'\\Gamma^{-1}X)^{-1}X'\\Gamma^{-1}Y.\n", "$$\n", "\n", "Also,\n", "\n", "$$\n", "\\mbox{cov } \\hat{\\beta} = \\lambda \\left ( (\\Gamma^{-1/2}X)'(\\Gamma^{-1/2}X) \\right )^{-1} = \\lambda (X'\\Gamma^{-1}X)^{-1}.\n", "$$\n", "\n", "An unbiased estimate of $\\lambda$ is\n", "\n", "$$\n", "\\hat{\\lambda} = \\frac{1}{n-p} \\sum_{i=1}^n e_i^2 =\n", "\\frac{1}{n-p} \\sum_{i=1}^n \\left [ (\\Gamma^{-1/2} Y)_j - (\\Gamma^{-1/2}X\\hat{\\beta})_j \\right ]^2\n", "$$\n", "$$\n", "= \\frac{1}{n-p} \\sum_{i=1}^n \\left [ \\Gamma^{-1/2}(Y - X \\hat{\\beta} )\\right ]_j^2.\n", "$$\n", "\n", "The estimated covariance matrix of $\\hat{\\beta}$ is\n", "\n", "$$\n", "\\hat{\\mbox{cov}} \\hat{\\beta} = \\hat{\\lambda} (X'\\Gamma^{-1}X)^{-1}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Numerical example\n", "\n", "Suppose that the measuring instrument the falling weight example has correlated errors with\n", "covariance matrix $\\lambda \\Gamma$, with $\\lambda > 0$ unknown and\n", "\n", "$$\n", "\\Gamma = \\begin{pmatrix}\n", " 1 & 0.2 & 0.1 & 0.1 \\\\\n", " 0.2 & 1 & 0.2 & 0.1 \\\\\n", " 0.1 & 0.2 & 1 & 0.2 \\\\\n", " 0.1 & 0.1 & 0.2 & 1\n", " \\end{pmatrix}\n", " .\n", "$$\n", "\n", "Let's see how this changes the estimate." ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
1.00.20.10.1
0.21.00.20.1
0.10.21.00.2
0.10.10.21.0
\n" ], "text/latex": [ "\\begin{tabular}{llll}\n", "\t 1.0 & 0.2 & 0.1 & 0.1\\\\\n", "\t 0.2 & 1.0 & 0.2 & 0.1\\\\\n", "\t 0.1 & 0.2 & 1.0 & 0.2\\\\\n", "\t 0.1 & 0.1 & 0.2 & 1.0\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "1. 1\n", "2. 0.2\n", "3. 0.1\n", "4. 0.1\n", "5. 0.2\n", "6. 1\n", "7. 0.2\n", "8. 0.1\n", "9. 0.1\n", "10. 0.2\n", "11. 1\n", "12. 0.2\n", "13. 0.1\n", "14. 0.1\n", "15. 0.2\n", "16. 1\n", "\n", "\n" ], "text/plain": [ " [,1] [,2] [,3] [,4]\n", "[1,] 1.0 0.2 0.1 0.1\n", "[2,] 0.2 1.0 0.2 0.1\n", "[3,] 0.1 0.2 1.0 0.2\n", "[4,] 0.1 0.1 0.2 1.0" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/html": [ "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
1.98456
0.01271
10.1655
\n" ], "text/latex": [ "\\begin{tabular}{l}\n", "\t 1.98456\\\\\n", "\t 0.01271\\\\\n", "\t 10.1655\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "1. 1.98456000000002\n", "2. 0.0127099999999441\n", "3. 10.1655\n", "\n", "\n" ], "text/plain": [ " [,1]\n", "[1,] 1.98456\n", "[2,] 0.01271\n", "[3,] 10.16550" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# R example.\n", "gam <- matrix(c(1, .2, .1, .1, .2, 1, .2, .1, .1, .2, 1, .2, .1, .1, .2, 1), nrow=4, ncol=4, byrow=T)\n", "gam\n", "decomp <- eigen(gam)\n", "sqGamInv <- decomp$vectors %*% diag(1/sqrt(decomp$values)) %*% t(decomp$vectors);\n", "gY <- sqGamInv %*% Y;\n", "gX <- sqGamInv %*% X;\n", "betahat_GLS <- solve(t(gX) %*% gX, t(gX) %*% gY )\n", "betahat_GLS # the GLS estimate " ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "0.142301400000001" ], "text/latex": [ "0.142301400000001" ], "text/markdown": [ "0.142301400000001" ], "text/plain": [ "[1] 0.1423014" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Estimate lambda\n", "lambdahat <- sum((gY - gX %*% betahat_GLS)^2)/(dims[1]-dims[2]);\n", "lambdahat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feasible GLS\n", "\n", "Usually $\\Sigma$ is unknown: we have to estimate it to take advantage of the covariance of $\\epsilon$.\n", "But a covariance matrix has $n^2$ unknowns, and we only have $n$ data, so it's impossible to estimate $\\Sigma$\n", "without constraints.\n", "(There are $(n^2+n)/2$ unknowns, since $\\Sigma$ is symmetric.)\n", "\n", "Substituting an estimate $\\hat{\\Sigma}$ for $\\Sigma$ in GLS is called _feasible GLS_ or the _Aitken method_.\n", "\n", "The resulting _feasible GLS estimate_ is\n", "\n", "$$\n", "\\hat{\\beta}_{\\mbox{FGLS}} \\equiv (X'\\hat{\\Sigma}X)^{-1}X' \\hat{\\Sigma}^{-1}Y,\n", "$$\n", "which has estimated covariance\n", "$$\n", "\\hat{\\mbox{cov}}(\\hat{\\beta}_{\\mbox{FGLS}} \\mid X) = (X' \\hat{\\Sigma}^{-1} X)^{-1}.\n", "$$\n", "\n", "Sometimes, this is a good estimator; sometimes it has terrible performance.\n", "Moreover, FGLS is a _nonlinear_ estimator, and is usually biased. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dummy variables\n", "\n", "Imagine a covariate that is binary or categorical, such as gender, or whether a subject\n", "in an experiment receives a treatment, or which treatment the subject receives.\n", "\n", "Dummy variables are a way to inforporate such a covariate into a regression model.\n", "That is not always the right way to incorporate such covariates—especially in\n", "the context of an experiment—but it is extremely common in practice.\n", "\n", "Suppose there is a covariate that can take any of $k$ levels. \n", "Augment the list of variables in the model by including $k-1$ new variables.\n", "The $i$th such variable will be equal to 1 if the case in question has the $i$th\n", "level of treatment and 0 if not, for $i=1, \\ldots, k-1$.\n", "The $k$th level corresponds to having all $k-1$ of those variables equal to zero.\n", "If we used $k$ such variables, they would be linearly dependent:\n", "the $k$th variable would be 1 minus the sum of the first $k-1$\n", "variables.\n", "\n", "### What does the model say?\n", "\n", "[To do: discuss constancy assumptions, additive effects, independence of $\\epsilon$ and $X$, equal variances, uncorrelated, etc.]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Illustration: Student Evaluations\n", "\n", "We will deliberately mis-analyze the MacNell et al. (2014) data using a dummy variable for\n", "the identified gender of the instructor and a dummy variable for the true gender of\n", "the instructor.\n", "Note that the assumptions of the regression model are __not__ satisfied here.\n", "They are violated in many ways, including the lack of any justification for\n", "an underlying linear model.\n", "The permutation analysis in the chapter on inference is far preferable,\n", "in part because it matches the randomization that was used in the experiment.\n", "Regression modeling ignores that randomization.\n", "\n", "However, regression can sometimes adjust for the effect of confounding variables.\n", "We will illustrate this by example—but the circumstances in\n", "which the approach gives reliable inferences are limited, and typically cannot be checked\n", "from the data themselves.\n", "\n", "### The model\n", "\n", "Here is an example of a regression model for student evaluations:\n", "\n", "$$\n", " Y = X \\beta + \\epsilon,\n", "$$\n", "\n", "where $Y_i$ is the rating assigned by student $i$; $X_{i1} = 1$ for all $i$ (to account for the\n", "fact that a typical rating is about 4),\n", "and $X_{i2} = 1$ if student $i$ was assigned to an instructor identified as\n", "male, and $X_{i2} = 0$ student $i$ was assigned to an instructor identified as\n", "female; where $\\epsilon$ and $X$ are independent; and the components of $\\epsilon$ \n", "are uncorrelated, have zero mean,\n", "have common variance $\\sigma^2$.\n", "\n", "\n", "In this model, the rating student $i$ assigns to an instructor is $\\beta_1 + \\epsilon_i$ \n", "if the instructor\n", "is apparently female and $\\beta_1+ \\beta_2 + \\epsilon_i$ if the instructor is\n", "apparently male. \n", "instructor; $\\beta_1$ if assigned apparently female.\n", "\n", "[Show that $\\epsilon$ and $X$ are not independent for reasonable ticket models.\n", "Show ${\\mathbb E} \\epsilon \\ne 0$ for reasonable ticket models. Etc. Note that both\n", "have non-interference built in.]\n", "\n", "In finding $p$-values in the regression model, the randomness is in $\\epsilon$, the\n", "(invented) randomness of individual students' responses.\n", "The $p$-values are with respect to different realizations of the individual \"noise\"\n", "terms.\n", "The randomness in assigning students to sections of the course is hidden in $X$, but\n", "the regression—and the $p$-value calculation—are conditional on $X$.\n", "\n", "In the \"ticket\" model, the randomness comes from the random assignment of students to \n", "sections of the class, and the $p$-value calculations imagine different random\n", "assignments.\n", "\n", "Both models assume that each student's response does not depend on other students'\n", "assignments to sections, the _non interference_ assumption.\n", "That assumption might not be accurate if students study in groups, but\n", "\n", "The regression model assumes that $X$ and $\\epsilon$ are independent.\n", "Suppose that the expected rating both for\n", "apparently male and apparently female instructors is 4 (corresponding to $\\beta_1 = 4$ and $\\beta_2 = 0$).\n", "Suppose that student $i$ would rate an apparently female instructor 5 and an\n", "apparently male instructor 3.\n", "Then $X_{i2}$ and $\\epsilon_i$ are not independent: $\\epsilon_i = -1$ if $X_{i2} = 1$\n", "amd $\\epsilon_i = +1$ if $X_{i2} = 0$.\n", "\n", "The regression model assumes that $\\epsilon_i$ has mean zero, for all $i$,\n", "so every student's expected rating of an apparently female instructor is the same,\n", "amd every student's expected rating of an apparently male instructor is the same.\n", "There is no reason that should be true.\n", "Again, suppose that the average rating both for\n", "apparently male and for apparently female instructors is 4.\n", "Imagine that student $i$ will always rate an instructor 5, regardless of the instructor's apparent\n", "gender.\n", "Then $\\epsilon_i = 1$ and hence ${\\mathbb E} \\epsilon_i = +1$, violating\n", "this assumption of the regression model.\n", "\n", "In short, the assumptions of the regression model seem highly contrived." ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] professional\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-0.6087 0.3913 0.3913 4.0000 5.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 4.6087 0.6134 7.513 2.71e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.942 on 42 degrees of freedom\n", "Multiple R-squared: 0.5734,\tAdjusted R-squared: 0.5632 \n", "F-statistic: 56.45 on 1 and 42 DF, p-value: 2.706e-09\n", "\n", "[1] respect\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-0.6087 0.3913 0.3913 4.0000 5.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 4.6087 0.6134 7.513 2.71e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.942 on 42 degrees of freedom\n", "Multiple R-squared: 0.5734,\tAdjusted R-squared: 0.5632 \n", "F-statistic: 56.45 on 1 and 42 DF, p-value: 2.706e-09\n", "\n", "[1] caring\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.1739 -0.1739 0.8261 4.0000 5.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 4.1739 0.5668 7.364 4.4e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.718 on 42 degrees of freedom\n", "Multiple R-squared: 0.5636,\tAdjusted R-squared: 0.5532 \n", "F-statistic: 54.23 on 1 and 42 DF, p-value: 4.397e-09\n", "\n", "[1] enthusiastic\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-1.1739 -0.1739 0.8261 4.0000 5.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 4.1739 0.5566 7.499 2.84e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.67 on 42 degrees of freedom\n", "Multiple R-squared: 0.5724,\tAdjusted R-squared: 0.5622 \n", "F-statistic: 56.23 on 1 and 42 DF, p-value: 2.839e-09\n", "\n", "[1] communicate\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-1.2174 -0.2174 0.7826 4.0000 5.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 4.2174 0.5673 7.434 3.51e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.721 on 42 degrees of freedom\n", "Multiple R-squared: 0.5682,\tAdjusted R-squared: 0.5579 \n", "F-statistic: 55.26 on 1 and 42 DF, p-value: 3.505e-09\n", "\n", "[1] helpful\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-1.9565 0.0435 1.0435 3.5000 5.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.9565 0.5488 7.209 7.31e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.632 on 42 degrees of freedom\n", "Multiple R-squared: 0.5531,\tAdjusted R-squared: 0.5424 \n", "F-statistic: 51.97 on 1 and 42 DF, p-value: 7.309e-09\n", "\n", "[1] feedback\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.1739 -0.1739 0.8261 4.0000 5.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 4.1739 0.5803 7.193 7.72e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.783 on 42 degrees of freedom\n", "Multiple R-squared: 0.5519,\tAdjusted R-squared: 0.5413 \n", "F-statistic: 51.73 on 1 and 42 DF, p-value: 7.718e-09\n", "\n", "[1] prompt\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-1.3478 -0.3478 0.6522 4.0000 5.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 4.3478 0.5519 7.878 8.28e-10 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.647 on 42 degrees of freedom\n", "Multiple R-squared: 0.5964,\tAdjusted R-squared: 0.5868 \n", "F-statistic: 62.07 on 1 and 42 DF, p-value: 8.284e-10\n", "\n", "[1] consistent\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-1.9565 0.0435 1.0435 4.0000 5.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.9565 0.5563 7.112 1.01e-08 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.668 on 42 degrees of freedom\n", "Multiple R-squared: 0.5463,\tAdjusted R-squared: 0.5355 \n", "F-statistic: 50.58 on 1 and 42 DF, p-value: 1.005e-08\n", "\n", "[1] fair\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-1.2609 -0.2609 0.7391 3.5000 5.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 4.2609 0.5388 7.908 7.52e-10 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.584 on 42 degrees of freedom\n", "Multiple R-squared: 0.5982,\tAdjusted R-squared: 0.5887 \n", "F-statistic: 62.54 on 1 and 42 DF, p-value: 7.523e-10\n", "\n", "[1] responsive\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-0.8696 0.1304 1.1304 4.0000 5.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.8696 0.5643 6.857 2.33e-08 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.706 on 42 degrees of freedom\n", "Multiple R-squared: 0.5282,\tAdjusted R-squared: 0.517 \n", "F-statistic: 47.02 on 1 and 42 DF, p-value: 2.326e-08\n", "\n", "[1] praised\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-1.5217 0.4783 0.4783 4.0000 5.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 4.5217 0.5834 7.751 1.25e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.798 on 42 degrees of freedom\n", "Multiple R-squared: 0.5886,\tAdjusted R-squared: 0.5788 \n", "F-statistic: 60.08 on 1 and 42 DF, p-value: 1.249e-09\n", "\n", "[1] knowledgeable\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-1.3043 -0.3043 0.6957 4.0000 5.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 4.3043 0.6035 7.132 9.42e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.894 on 42 degrees of freedom\n", "Multiple R-squared: 0.5477,\tAdjusted R-squared: 0.537 \n", "F-statistic: 50.86 on 1 and 42 DF, p-value: 9.421e-09\n", "\n", "[1] clear\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-1.913 0.087 1.087 4.000 5.000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.9130 0.5663 6.909 1.96e-08 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.716 on 42 degrees of freedom\n", "Multiple R-squared: 0.532,\tAdjusted R-squared: 0.5208 \n", "F-statistic: 47.74 on 1 and 42 DF, p-value: 1.959e-08\n", "\n", "[1] overall\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-1.1739 -0.1739 0.8261 4.0000 5.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 4.1739 0.5713 7.306 5.33e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 2.74 on 42 degrees of freedom\n", "Multiple R-squared: 0.5596,\tAdjusted R-squared: 0.5491 \n", "F-statistic: 53.37 on 1 and 42 DF, p-value: 5.327e-09\n", "\n" ] } ], "source": [ "# Use regression with a dummy variable for the instructor's identified gender \n", "ratings <- read.csv(\"Data/Macnell-RatingsData.csv\", as.is=T); # reads a .csv file into a DataFrame\n", "\n", "character <- setdiff(names(ratings),c(\"group\",\"gender\",\"tagender\",\"taidgender\",\"age\"));\n", "\n", "# for each characteristic, we will model the student's rating of the instructor as\n", "# using regression. The model will include a dummy variable for the instructor's perceived gender.\n", "#\n", "# Then we will try a variety of adjustments--for the instructor's true gender and for the\n", "# student's gender, as well as including an \"intercept\" or not\n", "\n", "for (ch in character) {\n", " reg_mod <- lm(unlist(ratings[ch]) ~ ratings$taidgender - 1) # omit the intercept\n", " print(ch, quote=F)\n", " print(summary(reg_mod))\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the highly significant results.\n", "This is not evidence that gender matters; raterh, it is strong evidence that the average rating is not zero—if there is no intercept, having the coefficient of taidgender = 0 means that the data are just iid noice with zero mean, which is a bogus assumption! The mean of the data is large (about 4), so we are bound to reject the null hypothesis in this case." ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] professional\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.0000 -0.6087 0.3913 0.3913 1.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.0000 0.2303 17.371 <2e-16 ***\n", "ratings$taidgender 0.6087 0.3148 1.933 0.0601 . \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.03 on 41 degrees of freedom\n", "Multiple R-squared: 0.08355,\tAdjusted R-squared: 0.06119 \n", "F-statistic: 3.738 on 1 and 41 DF, p-value: 0.06012\n", "\n", "[1] respect\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.0000 -0.6087 0.3913 0.3913 1.0000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.0000 0.2303 17.371 <2e-16 ***\n", "ratings$taidgender 0.6087 0.3148 1.933 0.0601 . \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.03 on 41 degrees of freedom\n", "Multiple R-squared: 0.08355,\tAdjusted R-squared: 0.06119 \n", "F-statistic: 3.738 on 1 and 41 DF, p-value: 0.06012\n", "\n", "[1] caring\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.6500 -0.4120 0.3500 0.8261 1.3500 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 3.6500 0.2313 15.783 <2e-16 ***\n", "ratings$taidgender 0.5239 0.3162 1.657 0.105 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.034 on 41 degrees of freedom\n", "Multiple R-squared: 0.06275,\tAdjusted R-squared: 0.0399 \n", "F-statistic: 2.745 on 1 and 41 DF, p-value: 0.1052\n", "\n", "[1] enthusiastic\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.6000 -0.1739 -0.1739 0.6130 1.4000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 3.6000 0.2212 16.278 <2e-16 ***\n", "ratings$taidgender 0.5739 0.3024 1.898 0.0648 . \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 0.989 on 41 degrees of freedom\n", "Multiple R-squared: 0.08076,\tAdjusted R-squared: 0.05834 \n", "F-statistic: 3.602 on 1 and 41 DF, p-value: 0.06476\n", "\n", "[1] communicate\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.6500 -0.2174 0.3500 0.7826 1.3500 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 3.6500 0.2329 15.675 <2e-16 ***\n", "ratings$taidgender 0.5674 0.3184 1.782 0.0821 . \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.041 on 41 degrees of freedom\n", "Multiple R-squared: 0.07189,\tAdjusted R-squared: 0.04925 \n", "F-statistic: 3.176 on 1 and 41 DF, p-value: 0.08215\n", "\n", "[1] helpful\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.50000 -0.72826 0.04348 1.04348 1.50000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 3.5000 0.2367 14.78 <2e-16 ***\n", "ratings$taidgender 0.4565 0.3237 1.41 0.166 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.059 on 41 degrees of freedom\n", "Multiple R-squared: 0.04627,\tAdjusted R-squared: 0.02301 \n", "F-statistic: 1.989 on 1 and 41 DF, p-value: 0.166\n", "\n", "[1] feedback\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.7000 -0.1739 0.3000 0.8261 1.3000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 3.7000 0.2506 14.763 <2e-16 ***\n", "ratings$taidgender 0.4739 0.3427 1.383 0.174 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.121 on 41 degrees of freedom\n", "Multiple R-squared: 0.04457,\tAdjusted R-squared: 0.02127 \n", "F-statistic: 1.913 on 1 and 41 DF, p-value: 0.1742\n", "\n", "[1] prompt\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.5500 -0.3478 0.4500 0.6522 1.4500 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 3.5500 0.2268 15.655 <2e-16 ***\n", "ratings$taidgender 0.7978 0.3101 2.573 0.0138 * \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.014 on 41 degrees of freedom\n", "Multiple R-squared: 0.139,\tAdjusted R-squared: 0.118 \n", "F-statistic: 6.621 on 1 and 41 DF, p-value: 0.0138\n", "\n", "[1] consistent\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.50000 -0.50000 0.04348 1.04348 1.50000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 3.5000 0.2565 13.644 <2e-16 ***\n", "ratings$taidgender 0.4565 0.3507 1.302 0.2 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.147 on 41 degrees of freedom\n", "Multiple R-squared: 0.03968,\tAdjusted R-squared: 0.01626 \n", "F-statistic: 1.694 on 1 and 41 DF, p-value: 0.2003\n", "\n", "[1] fair\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.5000 -0.3804 -0.2609 0.7391 1.5000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 3.5000 0.2079 16.837 <2e-16 ***\n", "ratings$taidgender 0.7609 0.2842 2.677 0.0106 * \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 0.9297 on 41 degrees of freedom\n", "Multiple R-squared: 0.1488,\tAdjusted R-squared: 0.128 \n", "F-statistic: 7.166 on 1 and 41 DF, p-value: 0.01064\n", "\n", "[1] responsive\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.6500 -0.8696 0.1304 0.7402 1.3500 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 3.6500 0.2240 16.292 <2e-16 ***\n", "ratings$taidgender 0.2196 0.3063 0.717 0.478 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.002 on 41 degrees of freedom\n", "Multiple R-squared: 0.01238,\tAdjusted R-squared: -0.01171 \n", "F-statistic: 0.5137 on 1 and 41 DF, p-value: 0.4776\n", "\n", "[1] praised\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.8500 -0.5217 0.1500 0.4783 1.1500 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 3.8500 0.1984 19.402 <2e-16 ***\n", "ratings$taidgender 0.6717 0.2713 2.476 0.0175 * \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 0.8874 on 41 degrees of freedom\n", "Multiple R-squared: 0.1301,\tAdjusted R-squared: 0.1088 \n", "F-statistic: 6.129 on 1 and 41 DF, p-value: 0.01752\n", "\n", "[1] knowledgeable\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.9500 -0.3044 0.0500 0.6956 1.0500 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 3.9500 0.2204 17.925 <2e-16 ***\n", "ratings$taidgender 0.3543 0.3013 1.176 0.246 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 0.9855 on 41 degrees of freedom\n", "Multiple R-squared: 0.03263,\tAdjusted R-squared: 0.009038 \n", "F-statistic: 1.383 on 1 and 41 DF, p-value: 0.2464\n", "\n", "[1] clear\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.50000 -0.50000 0.08696 1.08696 1.50000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 3.5000 0.2812 12.448 1.64e-15 ***\n", "ratings$taidgender 0.4130 0.3844 1.074 0.289 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.257 on 41 degrees of freedom\n", "Multiple R-squared: 0.02738,\tAdjusted R-squared: 0.00366 \n", "F-statistic: 1.154 on 1 and 41 DF, p-value: 0.2889\n", "\n", "[1] overall\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.7000 -0.1739 0.3000 0.8261 1.3000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 3.7000 0.2250 16.446 <2e-16 ***\n", "ratings$taidgender 0.4739 0.3076 1.541 0.131 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.006 on 41 degrees of freedom\n", "Multiple R-squared: 0.05472,\tAdjusted R-squared: 0.03167 \n", "F-statistic: 2.373 on 1 and 41 DF, p-value: 0.1311\n", "\n" ] } ], "source": [ "# include an intercept NOTE HOW BIG THE CHANGE IS!\n", "for (ch in character) {\n", " reg_mod <- lm(unlist(ratings[ch]) ~ ratings$taidgender ) # include the intercept\n", " print(ch, quote=F)\n", " print(summary(reg_mod))\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# allow for a real difference between the two instructors by including gender (really a proxy for the two instructors)\n", "for (ch in character) {\n", " reg_mod <- lm(unlist(ratings[ch]) ~ \n", " ratings$taidgender \n", " + ratings$tagender \n", " - 1 ) # omit the intercept\n", " print(ch, quote=F)\n", " print(summary(reg_mod))\n", "}" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] professional\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.3897 -0.3897 0.5483 1.1239 3.0620 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.4981 0.4110 3.645 0.000761 ***\n", "ratings$tagender 0.5136 0.4458 1.152 0.256055 \n", "ratings$gender 1.9380 0.2389 8.111 5.59e-10 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.463 on 40 degrees of freedom\n", "Multiple R-squared: 0.8995,\tAdjusted R-squared: 0.892 \n", "F-statistic: 119.3 on 3 and 40 DF, p-value: < 2.2e-16\n", "\n", "[1] respect\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.3897 -0.3897 0.5483 1.1239 3.0620 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.4981 0.4110 3.645 0.000761 ***\n", "ratings$tagender 0.5136 0.4458 1.152 0.256055 \n", "ratings$gender 1.9380 0.2389 8.111 5.59e-10 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.463 on 40 degrees of freedom\n", "Multiple R-squared: 0.8995,\tAdjusted R-squared: 0.892 \n", "F-statistic: 119.3 on 3 and 40 DF, p-value: < 2.2e-16\n", "\n", "[1] caring\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.6121 -0.7091 0.0705 1.1322 3.1940 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.3523 0.4055 3.335 0.00185 ** \n", "ratings$tagender 0.3174 0.4398 0.722 0.47464 \n", "ratings$gender 1.8060 0.2358 7.661 2.29e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.444 on 40 degrees of freedom\n", "Multiple R-squared: 0.8828,\tAdjusted R-squared: 0.874 \n", "F-statistic: 100.4 on 3 and 40 DF, p-value: < 2.2e-16\n", "\n", "[1] enthusiastic\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.8464 -0.6099 0.1536 0.7849 3.1583 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.3734 0.3896 3.525 0.00108 ** \n", "ratings$tagender 0.1630 0.4226 0.386 0.70182 \n", "ratings$gender 1.8417 0.2265 8.130 5.28e-10 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.387 on 40 degrees of freedom\n", "Multiple R-squared: 0.89,\tAdjusted R-squared: 0.8818 \n", "F-statistic: 107.9 on 3 and 40 DF, p-value: < 2.2e-16\n", "\n", "[1] communicate\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.9570 -0.5723 0.0430 1.3926 3.1963 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.3839 0.4015 3.446 0.00135 ** \n", "ratings$tagender 0.3496 0.4355 0.803 0.42687 \n", "ratings$gender 1.8037 0.2335 7.726 1.87e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.429 on 40 degrees of freedom\n", "Multiple R-squared: 0.8865,\tAdjusted R-squared: 0.878 \n", "F-statistic: 104.1 on 3 and 40 DF, p-value: < 2.2e-16\n", "\n", "[1] helpful\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.3328 -0.6125 0.1081 1.1079 3.3336 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.2257 0.3855 3.180 0.00284 ** \n", "ratings$tagender 0.5591 0.4181 1.337 0.18866 \n", "ratings$gender 1.6664 0.2241 7.436 4.67e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.372 on 40 degrees of freedom\n", "Multiple R-squared: 0.8843,\tAdjusted R-squared: 0.8756 \n", "F-statistic: 101.9 on 3 and 40 DF, p-value: < 2.2e-16\n", "\n", "[1] feedback\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.98771 -0.26829 0.01229 1.06223 3.05609 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.25253 0.40651 3.081 0.00372 ** \n", "ratings$tagender 0.09989 0.44093 0.227 0.82193 \n", "ratings$gender 1.94391 0.23635 8.225 3.93e-10 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.447 on 40 degrees of freedom\n", "Multiple R-squared: 0.8846,\tAdjusted R-squared: 0.876 \n", "F-statistic: 102.2 on 3 and 40 DF, p-value: < 2.2e-16\n", "\n", "[1] prompt\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.7624 -0.5702 0.3289 1.4298 3.2653 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.6434 0.4115 3.994 0.000271 ***\n", "ratings$tagender 0.2931 0.4463 0.657 0.515176 \n", "ratings$gender 1.7347 0.2392 7.251 8.41e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.465 on 40 degrees of freedom\n", "Multiple R-squared: 0.8823,\tAdjusted R-squared: 0.8734 \n", "F-statistic: 99.91 on 3 and 40 DF, p-value: < 2.2e-16\n", "\n", "[1] consistent\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.8227 -0.7780 0.1773 1.0973 3.2134 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.1961 0.4029 2.969 0.00503 ** \n", "ratings$tagender 0.2495 0.4370 0.571 0.57131 \n", "ratings$gender 1.7866 0.2342 7.627 2.55e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.434 on 40 degrees of freedom\n", "Multiple R-squared: 0.8751,\tAdjusted R-squared: 0.8658 \n", "F-statistic: 93.44 on 3 and 40 DF, p-value: < 2.2e-16\n", "\n", "[1] fair\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.8183 -0.3668 0.1817 1.0545 3.1273 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.45767 0.33871 4.304 0.000105 ***\n", "ratings$tagender 0.07282 0.36739 0.198 0.843874 \n", "ratings$gender 1.87272 0.19693 9.510 8.07e-12 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.206 on 40 degrees of freedom\n", "Multiple R-squared: 0.9167,\tAdjusted R-squared: 0.9104 \n", "F-statistic: 146.7 on 3 and 40 DF, p-value: < 2.2e-16\n", "\n", "[1] responsive\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.7595 -0.7595 0.2405 1.2262 3.1508 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.10675 0.42540 2.602 0.0129 * \n", "ratings$tagender 0.06112 0.46141 0.132 0.8953 \n", "ratings$gender 1.84919 0.24733 7.477 4.1e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.514 on 40 degrees of freedom\n", "Multiple R-squared: 0.8593,\tAdjusted R-squared: 0.8487 \n", "F-statistic: 81.43 on 3 and 40 DF, p-value: < 2.2e-16\n", "\n", "[1] praised\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.1464 -0.4829 -0.1464 1.3393 3.0126 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.5017 0.3722 4.035 0.000239 ***\n", "ratings$tagender 0.1715 0.4037 0.425 0.673143 \n", "ratings$gender 1.9874 0.2164 9.185 2.11e-11 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.325 on 40 degrees of freedom\n", "Multiple R-squared: 0.9121,\tAdjusted R-squared: 0.9055 \n", "F-statistic: 138.4 on 3 and 40 DF, p-value: < 2.2e-16\n", "\n", "[1] knowledgeable\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.2501 -0.5044 -0.1543 1.2728 3.0500 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.2543 0.4133 3.034 0.00422 ** \n", "ratings$tagender 0.3501 0.4483 0.781 0.43942 \n", "ratings$gender 1.9500 0.2403 8.114 5.54e-10 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.472 on 40 degrees of freedom\n", "Multiple R-squared: 0.8887,\tAdjusted R-squared: 0.8803 \n", "F-statistic: 106.4 on 3 and 40 DF, p-value: < 2.2e-16\n", "\n", "[1] clear\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.8037 -0.8037 0.4014 1.2169 3.2929 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.2051 0.4503 2.676 0.0107 * \n", "ratings$tagender -0.1845 0.4884 -0.378 0.7076 \n", "ratings$gender 1.8915 0.2618 7.225 9.14e-09 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.603 on 40 degrees of freedom\n", "Multiple R-squared: 0.8447,\tAdjusted R-squared: 0.8331 \n", "F-statistic: 72.53 on 3 and 40 DF, p-value: 3.152e-16\n", "\n", "[1] overall\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.99601 -0.38870 0.00399 0.94702 3.11394 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 1.2787 0.3904 3.275 0.00219 ** \n", "ratings$tagender 0.2239 0.4235 0.529 0.59994 \n", "ratings$gender 1.8861 0.2270 8.309 3.03e-10 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.39 on 40 degrees of freedom\n", "Multiple R-squared: 0.8921,\tAdjusted R-squared: 0.884 \n", "F-statistic: 110.2 on 3 and 40 DF, p-value: < 2.2e-16\n", "\n" ] } ], "source": [ "# adjust for the identified gender, the true gender, and the student's gender\n", "for (ch in character) {\n", " reg_mod <- lm(unlist(ratings[ch]) ~ \n", " ratings$taidgender \n", " + ratings$tagender \n", " + ratings$gender \n", " ) # include the intercept \n", " print(ch, quote=F)\n", " print(summary(reg_mod))\n", "}" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] professional\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.0833 -0.5833 0.3636 0.4167 1.1250 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.58333 0.30401 15.076 <2e-16 ***\n", "as.factor(ratings$group)4 0.05303 0.43960 0.121 0.905 \n", "as.factor(ratings$group)5 -0.70833 0.48068 -1.474 0.149 \n", "as.factor(ratings$group)6 -0.50000 0.42994 -1.163 0.252 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.053 on 39 degrees of freedom\n", "Multiple R-squared: 0.08828,\tAdjusted R-squared: 0.01815 \n", "F-statistic: 1.259 on 3 and 39 DF, p-value: 0.3019\n", "\n", "[1] respect\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.0833 -0.5833 0.3636 0.4167 1.1250 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.58333 0.30401 15.076 <2e-16 ***\n", "as.factor(ratings$group)4 0.05303 0.43960 0.121 0.905 \n", "as.factor(ratings$group)5 -0.70833 0.48068 -1.474 0.149 \n", "as.factor(ratings$group)6 -0.50000 0.42994 -1.163 0.252 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.053 on 39 degrees of freedom\n", "Multiple R-squared: 0.08828,\tAdjusted R-squared: 0.01815 \n", "F-statistic: 1.259 on 3 and 39 DF, p-value: 0.3019\n", "\n", "[1] caring\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.6250 -0.4375 0.3333 0.7500 1.3750 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.2500 0.3056 13.908 <2e-16 ***\n", "as.factor(ratings$group)4 -0.1591 0.4419 -0.360 0.721 \n", "as.factor(ratings$group)5 -0.6250 0.4832 -1.294 0.203 \n", "as.factor(ratings$group)6 -0.5833 0.4322 -1.350 0.185 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.059 on 39 degrees of freedom\n", "Multiple R-squared: 0.06604,\tAdjusted R-squared: -0.005806 \n", "F-statistic: 0.9192 on 3 and 39 DF, p-value: 0.4406\n", "\n", "[1] enthusiastic\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.75000 -0.25000 -0.09091 0.62500 1.50000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.2500 0.2911 14.600 <2e-16 ***\n", "as.factor(ratings$group)4 -0.1591 0.4209 -0.378 0.7075 \n", "as.factor(ratings$group)5 -0.5000 0.4603 -1.086 0.2840 \n", "as.factor(ratings$group)6 -0.7500 0.4117 -1.822 0.0762 . \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.008 on 39 degrees of freedom\n", "Multiple R-squared: 0.09097,\tAdjusted R-squared: 0.02104 \n", "F-statistic: 1.301 on 3 and 39 DF, p-value: 0.2878\n", "\n", "[1] communicate\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.7500 -0.3333 0.2500 0.6667 1.5000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.3333 0.3060 14.161 <2e-16 ***\n", "as.factor(ratings$group)4 -0.2424 0.4425 -0.548 0.5869 \n", "as.factor(ratings$group)5 -0.8333 0.4839 -1.722 0.0929 . \n", "as.factor(ratings$group)6 -0.5833 0.4328 -1.348 0.1855 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.06 on 39 degrees of freedom\n", "Multiple R-squared: 0.08519,\tAdjusted R-squared: 0.01482 \n", "F-statistic: 1.211 on 3 and 39 DF, p-value: 0.3187\n", "\n", "[1] helpful\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.12500 -0.82955 0.09091 1.00000 1.87500 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.00000 0.30674 13.040 8.45e-16 ***\n", "as.factor(ratings$group)4 -0.09091 0.44355 -0.205 0.8387 \n", "as.factor(ratings$group)5 -0.87500 0.48500 -1.804 0.0789 . \n", "as.factor(ratings$group)6 -0.25000 0.43380 -0.576 0.5677 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.063 on 39 degrees of freedom\n", "Multiple R-squared: 0.08617,\tAdjusted R-squared: 0.01587 \n", "F-statistic: 1.226 on 3 and 39 DF, p-value: 0.3133\n", "\n", "[1] feedback\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.7500 -0.4167 0.2500 0.5833 1.3750 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.4167 0.3267 13.519 2.66e-16 ***\n", "as.factor(ratings$group)4 -0.5076 0.4724 -1.074 0.289 \n", "as.factor(ratings$group)5 -0.7917 0.5166 -1.533 0.133 \n", "as.factor(ratings$group)6 -0.6667 0.4620 -1.443 0.157 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.132 on 39 degrees of freedom\n", "Multiple R-squared: 0.07339,\tAdjusted R-squared: 0.002112 \n", "F-statistic: 1.03 on 3 and 39 DF, p-value: 0.3901\n", "\n", "[1] prompt\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.7500 -0.3636 0.5833 0.6364 1.2500 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.3333 0.2983 14.529 <2e-16 ***\n", "as.factor(ratings$group)4 0.0303 0.4313 0.070 0.9443 \n", "as.factor(ratings$group)5 -0.5833 0.4716 -1.237 0.2235 \n", "as.factor(ratings$group)6 -0.9167 0.4218 -2.173 0.0359 * \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.033 on 39 degrees of freedom\n", "Multiple R-squared: 0.15,\tAdjusted R-squared: 0.08465 \n", "F-statistic: 2.295 on 3 and 39 DF, p-value: 0.09292\n", "\n", "[1] consistent\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.62500 -0.52083 0.08333 1.00000 1.58333 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 3.91667 0.33876 11.562 3.6e-14 ***\n", "as.factor(ratings$group)4 0.08333 0.48985 0.170 0.866 \n", "as.factor(ratings$group)5 -0.29167 0.53563 -0.545 0.589 \n", "as.factor(ratings$group)6 -0.50000 0.47909 -1.044 0.303 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.174 on 39 degrees of freedom\n", "Multiple R-squared: 0.0441,\tAdjusted R-squared: -0.02943 \n", "F-statistic: 0.5997 on 3 and 39 DF, p-value: 0.619\n", "\n", "[1] fair\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.6250 -0.3750 -0.1818 0.6667 1.5833 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.3333 0.2738 15.824 <2e-16 ***\n", "as.factor(ratings$group)4 -0.1515 0.3960 -0.383 0.704 \n", "as.factor(ratings$group)5 -0.7083 0.4330 -1.636 0.110 \n", "as.factor(ratings$group)6 -0.9167 0.3873 -2.367 0.023 * \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 0.9486 on 39 degrees of freedom\n", "Multiple R-squared: 0.1569,\tAdjusted R-squared: 0.09209 \n", "F-statistic: 2.42 on 3 and 39 DF, p-value: 0.08061\n", "\n", "[1] responsive\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.7500 -0.5644 0.4167 0.6439 1.4546 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.1667 0.2880 14.469 <2e-16 ***\n", "as.factor(ratings$group)4 -0.6212 0.4164 -1.492 0.144 \n", "as.factor(ratings$group)5 -0.4167 0.4553 -0.915 0.366 \n", "as.factor(ratings$group)6 -0.5833 0.4073 -1.432 0.160 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 0.9976 on 39 degrees of freedom\n", "Multiple R-squared: 0.06872,\tAdjusted R-squared: -0.002918 \n", "F-statistic: 0.9593 on 3 and 39 DF, p-value: 0.4216\n", "\n", "[1] praised\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.0000 -0.4546 0.2500 0.4167 1.2500 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.5833 0.2611 17.557 <2e-16 ***\n", "as.factor(ratings$group)4 -0.1288 0.3775 -0.341 0.7348 \n", "as.factor(ratings$group)5 -0.5833 0.4128 -1.413 0.1655 \n", "as.factor(ratings$group)6 -0.8333 0.3692 -2.257 0.0297 * \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 0.9043 on 39 degrees of freedom\n", "Multiple R-squared: 0.1407,\tAdjusted R-squared: 0.0746 \n", "F-statistic: 2.129 on 3 and 39 DF, p-value: 0.1122\n", "\n", "[1] knowledgeable\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.1250 -0.3636 0.1667 0.7500 1.1667 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.2500 0.2899 14.659 <2e-16 ***\n", "as.factor(ratings$group)4 0.1136 0.4192 0.271 0.788 \n", "as.factor(ratings$group)5 -0.1250 0.4584 -0.273 0.787 \n", "as.factor(ratings$group)6 -0.4167 0.4100 -1.016 0.316 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.004 on 39 degrees of freedom\n", "Multiple R-squared: 0.04435,\tAdjusted R-squared: -0.02916 \n", "F-statistic: 0.6033 on 3 and 39 DF, p-value: 0.6168\n", "\n", "[1] clear\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.5000 -0.5000 0.5000 0.6364 1.6364 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.4167 0.3534 12.496 3.26e-15 ***\n", "as.factor(ratings$group)4 -1.0530 0.5111 -2.060 0.0461 * \n", "as.factor(ratings$group)5 -0.9167 0.5588 -1.640 0.1090 \n", "as.factor(ratings$group)6 -0.9167 0.4998 -1.834 0.0743 . \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.224 on 39 degrees of freedom\n", "Multiple R-squared: 0.1229,\tAdjusted R-squared: 0.05539 \n", "F-statistic: 1.821 on 3 and 39 DF, p-value: 0.1593\n", "\n", "[1] overall\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group))\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.7500 -0.3333 0.2500 0.6667 1.3750 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "(Intercept) 4.3333 0.2952 14.678 <2e-16 ***\n", "as.factor(ratings$group)4 -0.3333 0.4269 -0.781 0.440 \n", "as.factor(ratings$group)5 -0.7083 0.4668 -1.517 0.137 \n", "as.factor(ratings$group)6 -0.5833 0.4175 -1.397 0.170 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.023 on 39 degrees of freedom\n", "Multiple R-squared: 0.07095,\tAdjusted R-squared: -0.0005127 \n", "F-statistic: 0.9928 on 3 and 39 DF, p-value: 0.4063\n", "\n" ] } ], "source": [ "# model using \"group\" variable; note how R deals with the fact that the levels are dependent when there is\n", "# an intercept.\n", "# use as.factor() to get R to treat the variable as a dummy\n", "for (ch in character) {\n", " reg_mod <- lm(unlist(ratings[ch]) ~ as.factor(ratings$group) ) # include the constant term for now\n", " print(ch, quote=F)\n", " print(summary(reg_mod))\n", "}" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] professional\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.0833 -0.5833 0.3636 0.4167 1.1250 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 4.5833 0.3040 15.08 < 2e-16 ***\n", "as.factor(ratings$group)4 4.6364 0.3175 14.60 < 2e-16 ***\n", "as.factor(ratings$group)5 3.8750 0.3723 10.41 8.16e-13 ***\n", "as.factor(ratings$group)6 4.0833 0.3040 13.43 3.28e-16 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.053 on 39 degrees of freedom\n", "Multiple R-squared: 0.9492,\tAdjusted R-squared: 0.944 \n", "F-statistic: 182.3 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] respect\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.0833 -0.5833 0.3636 0.4167 1.1250 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 4.5833 0.3040 15.08 < 2e-16 ***\n", "as.factor(ratings$group)4 4.6364 0.3175 14.60 < 2e-16 ***\n", "as.factor(ratings$group)5 3.8750 0.3723 10.41 8.16e-13 ***\n", "as.factor(ratings$group)6 4.0833 0.3040 13.43 3.28e-16 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.053 on 39 degrees of freedom\n", "Multiple R-squared: 0.9492,\tAdjusted R-squared: 0.944 \n", "F-statistic: 182.3 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] caring\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.6250 -0.4375 0.3333 0.7500 1.3750 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 4.2500 0.3056 13.908 < 2e-16 ***\n", "as.factor(ratings$group)4 4.0909 0.3192 12.818 1.46e-15 ***\n", "as.factor(ratings$group)5 3.6250 0.3743 9.686 6.27e-12 ***\n", "as.factor(ratings$group)6 3.6667 0.3056 11.999 1.15e-14 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.059 on 39 degrees of freedom\n", "Multiple R-squared: 0.9385,\tAdjusted R-squared: 0.9322 \n", "F-statistic: 148.9 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] enthusiastic\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.75000 -0.25000 -0.09091 0.62500 1.50000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 4.2500 0.2911 14.60 < 2e-16 ***\n", "as.factor(ratings$group)4 4.0909 0.3040 13.46 3.10e-16 ***\n", "as.factor(ratings$group)5 3.7500 0.3565 10.52 6.00e-13 ***\n", "as.factor(ratings$group)6 3.5000 0.2911 12.02 1.08e-14 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.008 on 39 degrees of freedom\n", "Multiple R-squared: 0.9433,\tAdjusted R-squared: 0.9375 \n", "F-statistic: 162.3 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] communicate\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.7500 -0.3333 0.2500 0.6667 1.5000 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 4.3333 0.3060 14.161 < 2e-16 ***\n", "as.factor(ratings$group)4 4.0909 0.3196 12.799 1.53e-15 ***\n", "as.factor(ratings$group)5 3.5000 0.3748 9.339 1.71e-11 ***\n", "as.factor(ratings$group)6 3.7500 0.3060 12.254 6.00e-15 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.06 on 39 degrees of freedom\n", "Multiple R-squared: 0.9391,\tAdjusted R-squared: 0.9329 \n", "F-statistic: 150.4 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] helpful\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.12500 -0.82955 0.09091 1.00000 1.87500 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 4.0000 0.3067 13.040 8.45e-16 ***\n", "as.factor(ratings$group)4 3.9091 0.3204 12.201 6.87e-15 ***\n", "as.factor(ratings$group)5 3.1250 0.3757 8.318 3.58e-10 ***\n", "as.factor(ratings$group)6 3.7500 0.3067 12.225 6.46e-15 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.063 on 39 degrees of freedom\n", "Multiple R-squared: 0.9324,\tAdjusted R-squared: 0.9254 \n", "F-statistic: 134.4 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] feedback\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.7500 -0.4167 0.2500 0.5833 1.3750 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 4.4167 0.3267 13.52 2.66e-16 ***\n", "as.factor(ratings$group)4 3.9091 0.3412 11.46 4.75e-14 ***\n", "as.factor(ratings$group)5 3.6250 0.4001 9.06 3.88e-11 ***\n", "as.factor(ratings$group)6 3.7500 0.3267 11.48 4.48e-14 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.132 on 39 degrees of freedom\n", "Multiple R-squared: 0.9312,\tAdjusted R-squared: 0.9241 \n", "F-statistic: 132 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] prompt\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.7500 -0.3636 0.5833 0.6364 1.2500 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 4.3333 0.2982 14.53 < 2e-16 ***\n", "as.factor(ratings$group)4 4.3636 0.3115 14.01 < 2e-16 ***\n", "as.factor(ratings$group)5 3.7500 0.3653 10.27 1.21e-12 ***\n", "as.factor(ratings$group)6 3.4167 0.2982 11.46 4.76e-14 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.033 on 39 degrees of freedom\n", "Multiple R-squared: 0.9429,\tAdjusted R-squared: 0.937 \n", "F-statistic: 161 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] consistent\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.62500 -0.52083 0.08333 1.00000 1.58333 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 3.9167 0.3388 11.562 3.60e-14 ***\n", "as.factor(ratings$group)4 4.0000 0.3538 11.305 7.10e-14 ***\n", "as.factor(ratings$group)5 3.6250 0.4149 8.737 1.01e-10 ***\n", "as.factor(ratings$group)6 3.4167 0.3388 10.086 2.01e-12 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.174 on 39 degrees of freedom\n", "Multiple R-squared: 0.9185,\tAdjusted R-squared: 0.9101 \n", "F-statistic: 109.9 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] fair\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.6250 -0.3750 -0.1818 0.6667 1.5833 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 4.3333 0.2738 15.82 < 2e-16 ***\n", "as.factor(ratings$group)4 4.1818 0.2860 14.62 < 2e-16 ***\n", "as.factor(ratings$group)5 3.6250 0.3354 10.81 2.70e-13 ***\n", "as.factor(ratings$group)6 3.4167 0.2738 12.48 3.42e-15 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 0.9486 on 39 degrees of freedom\n", "Multiple R-squared: 0.9497,\tAdjusted R-squared: 0.9446 \n", "F-statistic: 184.2 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] responsive\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.7500 -0.5644 0.4167 0.6439 1.4546 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 4.1667 0.2880 14.47 < 2e-16 ***\n", "as.factor(ratings$group)4 3.5455 0.3008 11.79 1.99e-14 ***\n", "as.factor(ratings$group)5 3.7500 0.3527 10.63 4.38e-13 ***\n", "as.factor(ratings$group)6 3.5833 0.2880 12.44 3.72e-15 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 0.9976 on 39 degrees of freedom\n", "Multiple R-squared: 0.9405,\tAdjusted R-squared: 0.9344 \n", "F-statistic: 154 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] praised\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.0000 -0.4546 0.2500 0.4167 1.2500 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 4.5833 0.2611 17.56 < 2e-16 ***\n", "as.factor(ratings$group)4 4.4545 0.2727 16.34 < 2e-16 ***\n", "as.factor(ratings$group)5 4.0000 0.3197 12.51 3.14e-15 ***\n", "as.factor(ratings$group)6 3.7500 0.2611 14.37 < 2e-16 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 0.9043 on 39 degrees of freedom\n", "Multiple R-squared: 0.9601,\tAdjusted R-squared: 0.956 \n", "F-statistic: 234.5 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] knowledgeable\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.1250 -0.3636 0.1667 0.7500 1.1667 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 4.2500 0.2899 14.66 < 2e-16 ***\n", "as.factor(ratings$group)4 4.3636 0.3028 14.41 < 2e-16 ***\n", "as.factor(ratings$group)5 4.1250 0.3551 11.62 3.11e-14 ***\n", "as.factor(ratings$group)6 3.8333 0.2899 13.22 5.43e-16 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.004 on 39 degrees of freedom\n", "Multiple R-squared: 0.9494,\tAdjusted R-squared: 0.9443 \n", "F-statistic: 183.1 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] clear\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.5000 -0.5000 0.5000 0.6364 1.6364 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 4.4167 0.3534 12.496 3.26e-15 ***\n", "as.factor(ratings$group)4 3.3636 0.3692 9.112 3.33e-11 ***\n", "as.factor(ratings$group)5 3.5000 0.4329 8.086 7.28e-10 ***\n", "as.factor(ratings$group)6 3.5000 0.3534 9.903 3.37e-12 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.224 on 39 degrees of freedom\n", "Multiple R-squared: 0.9117,\tAdjusted R-squared: 0.9026 \n", "F-statistic: 100.7 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] overall\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ as.factor(ratings$group) - \n", " 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.7500 -0.3333 0.2500 0.6667 1.3750 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "as.factor(ratings$group)3 4.3333 0.2952 14.68 < 2e-16 ***\n", "as.factor(ratings$group)4 4.0000 0.3084 12.97 9.99e-16 ***\n", "as.factor(ratings$group)5 3.6250 0.3616 10.03 2.38e-12 ***\n", "as.factor(ratings$group)6 3.7500 0.2952 12.70 1.95e-15 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.023 on 39 degrees of freedom\n", "Multiple R-squared: 0.943,\tAdjusted R-squared: 0.9372 \n", "F-statistic: 161.4 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n" ] } ], "source": [ "# model using \"group\" variable.\n", "# use as.factor() to get R to treat the variable as a dummy\n", "# Omit the intercept\n", "for (ch in character) {\n", " reg_mod <- lm(unlist(ratings[ch]) ~ as.factor(ratings$group) - 1) # omit the intercept\n", " print(ch, quote=F)\n", " print(summary(reg_mod))\n", "}" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] professional\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.6610 -0.5996 0.3390 0.6526 2.8263 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.9249 0.8741 4.490 6.16e-05 ***\n", "ratings$tagender 0.3136 0.4103 0.764 0.44927 \n", "ratings$gender 2.1737 0.2303 9.438 1.28e-11 ***\n", "ratings$taidgender:ratings$gender -1.8126 0.5903 -3.071 0.00388 ** \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.33 on 39 degrees of freedom\n", "Multiple R-squared: 0.9191,\tAdjusted R-squared: 0.9108 \n", "F-statistic: 110.7 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] respect\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.6610 -0.5996 0.3390 0.6526 2.8263 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.9249 0.8741 4.490 6.16e-05 ***\n", "ratings$tagender 0.3136 0.4103 0.764 0.44927 \n", "ratings$gender 2.1737 0.2303 9.438 1.28e-11 ***\n", "ratings$taidgender:ratings$gender -1.8126 0.5903 -3.071 0.00388 ** \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.33 on 39 degrees of freedom\n", "Multiple R-squared: 0.9191,\tAdjusted R-squared: 0.9108 \n", "F-statistic: 110.7 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] caring\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.9971 -0.5246 0.1475 0.9204 3.0015 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.3344 0.8946 3.727 0.000613 ***\n", "ratings$tagender 0.1541 0.4200 0.367 0.715698 \n", "ratings$gender 1.9985 0.2357 8.478 2.21e-10 ***\n", "ratings$taidgender:ratings$gender -1.4805 0.6041 -2.451 0.018852 * \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.361 on 39 degrees of freedom\n", "Multiple R-squared: 0.8984,\tAdjusted R-squared: 0.888 \n", "F-statistic: 86.22 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] enthusiastic\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.07312 -0.45347 0.08235 0.54653 2.96581 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.37947 0.85239 3.965 0.000304 ***\n", "ratings$tagender -0.00237 0.40014 -0.006 0.995304 \n", "ratings$gender 2.03656 0.22460 9.067 3.8e-11 ***\n", "ratings$taidgender:ratings$gender -1.49838 0.57562 -2.603 0.012999 * \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.297 on 39 degrees of freedom\n", "Multiple R-squared: 0.9063,\tAdjusted R-squared: 0.8967 \n", "F-statistic: 94.33 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] communicate\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.1930 -0.5344 -0.1024 0.9401 2.9913 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.4949 0.8751 3.994 0.000279 ***\n", "ratings$tagender 0.1756 0.4108 0.427 0.671379 \n", "ratings$gender 2.0087 0.2306 8.711 1.09e-10 ***\n", "ratings$taidgender:ratings$gender -1.5768 0.5910 -2.668 0.011054 * \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.331 on 39 degrees of freedom\n", "Multiple R-squared: 0.904,\tAdjusted R-squared: 0.8942 \n", "F-statistic: 91.82 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] helpful\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-2.76834 -0.78077 -0.05169 1.03079 3.11583 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.4680 0.8227 4.215 0.000143 ***\n", "ratings$tagender 0.3743 0.3862 0.969 0.338415 \n", "ratings$gender 1.8842 0.2168 8.691 1.16e-10 ***\n", "ratings$taidgender:ratings$gender -1.6748 0.5556 -3.015 0.004509 ** \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.252 on 39 degrees of freedom\n", "Multiple R-squared: 0.9062,\tAdjusted R-squared: 0.8965 \n", "F-statistic: 94.16 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] feedback\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.2539 -0.5203 0.1990 0.8016 2.9285 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.13717 0.90379 3.471 0.00128 ** \n", "ratings$tagender -0.05545 0.42427 -0.131 0.89669 \n", "ratings$gender 2.12695 0.23815 8.931 5.68e-11 ***\n", "ratings$taidgender:ratings$gender -1.40768 0.61033 -2.306 0.02649 * \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.375 on 39 degrees of freedom\n", "Multiple R-squared: 0.8985,\tAdjusted R-squared: 0.888 \n", "F-statistic: 86.27 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] prompt\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.05733 -0.37746 -0.05733 0.69819 3.00916 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 4.28125 0.85603 5.001 1.25e-05 ***\n", "ratings$tagender 0.07565 0.40185 0.188 0.85166 \n", "ratings$gender 1.99084 0.22556 8.826 7.76e-11 ***\n", "ratings$taidgender:ratings$gender -1.97028 0.57808 -3.408 0.00153 ** \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.302 on 39 degrees of freedom\n", "Multiple R-squared: 0.9093,\tAdjusted R-squared: 0.9 \n", "F-statistic: 97.72 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] consistent\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.02803 -0.47156 0.07008 0.95445 3.03504 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.03236 0.89781 3.377 0.00167 ** \n", "ratings$tagender 0.09811 0.42146 0.233 0.81714 \n", "ratings$gender 1.96496 0.23657 8.306 3.72e-10 ***\n", "ratings$taidgender:ratings$gender -1.37153 0.60629 -2.262 0.02934 * \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.366 on 39 degrees of freedom\n", "Multiple R-squared: 0.8896,\tAdjusted R-squared: 0.8783 \n", "F-statistic: 78.57 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] fair\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.04457 -0.50273 0.05483 0.86143 2.93072 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.48160 0.71829 4.847 2.03e-05 ***\n", "ratings$tagender -0.09399 0.33719 -0.279 0.78191 \n", "ratings$gender 2.06928 0.18927 10.933 1.93e-13 ***\n", "ratings$taidgender:ratings$gender -1.51172 0.48506 -3.117 0.00343 ** \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.093 on 39 degrees of freedom\n", "Multiple R-squared: 0.9333,\tAdjusted R-squared: 0.9264 \n", "F-statistic: 136.4 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] responsive\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.12009 -0.73658 -0.08528 0.91472 3.04229 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 4.3317 0.8308 5.214 6.37e-06 ***\n", "ratings$tagender -0.2047 0.3900 -0.525 0.602692 \n", "ratings$gender 2.1624 0.2189 9.878 3.62e-12 ***\n", "ratings$taidgender:ratings$gender -2.4088 0.5610 -4.293 0.000113 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.264 on 39 degrees of freedom\n", "Multiple R-squared: 0.9045,\tAdjusted R-squared: 0.8947 \n", "F-statistic: 92.3 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] praised\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.4200 -0.4200 0.2563 0.6541 2.8051 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.94806 0.76826 5.139 8.07e-06 ***\n", "ratings$tagender -0.03009 0.36065 -0.083 0.93393 \n", "ratings$gender 2.22503 0.20244 10.991 1.65e-13 ***\n", "ratings$taidgender:ratings$gender -1.82722 0.51881 -3.522 0.00111 ** \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.169 on 39 degrees of freedom\n", "Multiple R-squared: 0.9333,\tAdjusted R-squared: 0.9265 \n", "F-statistic: 136.5 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] knowledgeable\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.5392 -0.5392 -0.1096 0.7076 2.7990 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.8393 0.8661 4.433 7.35e-05 ***\n", "ratings$tagender 0.1371 0.4066 0.337 0.73782 \n", "ratings$gender 2.2010 0.2282 9.645 7.06e-12 ***\n", "ratings$taidgender:ratings$gender -1.9308 0.5849 -3.301 0.00206 ** \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.318 on 39 degrees of freedom\n", "Multiple R-squared: 0.913,\tAdjusted R-squared: 0.9041 \n", "F-statistic: 102.3 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] clear\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.2322 -0.5763 0.1429 0.9817 3.2590 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.5171 0.9856 3.569 0.000971 ***\n", "ratings$tagender -0.3750 0.4627 -0.811 0.422506 \n", "ratings$gender 2.1161 0.2597 8.148 6.01e-10 ***\n", "ratings$taidgender:ratings$gender -1.7269 0.6656 -2.595 0.013273 * \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.499 on 39 degrees of freedom\n", "Multiple R-squared: 0.8676,\tAdjusted R-squared: 0.854 \n", "F-statistic: 63.88 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n", "[1] overall\n", "\n", "Call:\n", "lm(formula = unlist(ratings[ch]) ~ ratings$taidgender + ratings$tagender + \n", " ratings$gender + ratings$taidgender * ratings$gender - 1)\n", "\n", "Residuals:\n", " Min 1Q Median 3Q Max \n", "-3.2292 -0.4522 0.0530 0.7969 2.9114 \n", "\n", "Coefficients:\n", " Estimate Std. Error t value Pr(>|t|) \n", "ratings$taidgender 3.36381 0.84848 3.965 0.000304 ***\n", "ratings$tagender 0.05205 0.39831 0.131 0.896710 \n", "ratings$gender 2.08856 0.22357 9.342 1.7e-11 ***\n", "ratings$taidgender:ratings$gender -1.55738 0.57298 -2.718 0.009747 ** \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "Residual standard error: 1.291 on 39 degrees of freedom\n", "Multiple R-squared: 0.9093,\tAdjusted R-squared: 0.8999 \n", "F-statistic: 97.7 on 4 and 39 DF, p-value: < 2.2e-16\n", "\n" ] } ], "source": [ "# model using taidgender variable and adjust for student gender and an interaction between\n", "# student gender and taid gender\n", "# Omit the intercept\n", "# adjust for \n", "for (ch in character) {\n", " reg_mod <- lm(unlist(ratings[ch]) ~ \n", " ratings$taidgender + \n", " ratings$tagender + \n", " ratings$gender +\n", " ratings$taidgender*ratings$gender\n", " - 1 ) # omit intercept\n", " print(ch, quote=F)\n", " print(summary(reg_mod))\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary\n", "\n", "You might have noticed that we can get an enormous range of $p$-values for the effect of perceived gender,\n", "depending on the other terms we choose to keep in the model.\n", "In particular, omitting the intercept term will make many things appear significant, since\n", "the data have a large mean, which the other variables cannot capture well.\n", "\n", "Analyses like these are rather _ad hoc_—and they don't match the underlying experiment in the first place!\n", "\n", "_Why_ should a student's rating of an instructor be a linear combination of these things, with additive \"noise\"?\n", "\n", "Moreover, if we perform a variety of analyses and then just keep the one we like best,\n", "the $p$-values and other statistical summaries become meaningless.\n", "This can lead to \"significance hunting\" or \"significance mining.\"\n", "Methods that can control for the selection process—deciding which terms to keep in\n", "the model on the basis of the data—are a current area of active research in Statistics.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assignment\n", "\n", "The directory /Data has a .csv file called \"ModuloRimbSpesePrestazOccasGratuiteRisoluz.49E2013_NonRes_Ingl.csv\"\n", "downloaded from http://nats.sct.gob.mx/english/go-to-tables/table-8-domestic-passenger-travel/table-8-1-domestic-passenger-travel-by-mode/\n", "\n", "The data are reported passenger miles annually in the U.S.A. from 1990–2012 for a variety of modes of transportation.\n", "The data are _time series_: they give a sequence of observations of the same variables at different times.\n", "\n", "Please provide your _code_ and your _results_ for the following.\n", "The data editing to remove variables, transpose tables, etc., should be done using R commands—not using\n", "point-and-click tools. _Do not use Excel for any part of this assignment_.\n", "\n", "1. Data munging. The .csv file has some notes at the bottom; those notes will make it hard to load the data file into R.\n", " 1. Make a \"clean\" version of the table without the notes and with appropriate column headers. It is OK to use a text editor to do this.\n", " 1. Read the data into an R dataframe and print summary statistics of the dataframe\n", " 1. Transpose the table so that the columns are modes of transportation and the rows are years, using R commands.\n", " 1. Use R commands to delete any variables for which data are unavailable.\n", " 1. Print the transposed data, after deleting those variables.\n", " 1. Visually inspect the data to look for linear dependence. Are any of the variables obviously linearly dependent? If so, which? \n", " 1. Given what the table represents, _should_ there be linear dependencies among the rows?\n", "1. Data modeling\n", " 1. Fit a regression model to the data by ordinary least squares using lm(), modeling bus travel as a linear function of rail travel, air travel, (air travel)2, and personal vehicle travel. \n", " 1. Discuss the R output for the fitted model. What do the numbers mean? What do you conclude?\n", "1. Testing\n", " 1. Because these data are a time series, we expect trends. Variables with trends will tend to be associated with each other, even if there is no real connection between them. A common first step in analyzing time series is to \"pre-whiten\" the data, for instance, by removing trends. Construct new variables from the old variables by taking year-to-year differences. For example, the first element of the new \"rail\" variable should be the amount of rail travel in 1991, minus the amount in 1990. See the documentation for the R function diff().\n", " 1. Develop a permutation test to check whether year-over-year differences in rail travel are associated with year-over-year differences in motorcycle travel. As a test statistic, use the _Pearson correlation coefficient_ $r_{xy}$. \n", " 1. Is the association positive or negative?\n", " 1. Use $10^5$ random permutations to estimate the $p$-value of the hypothesis of no association.\n", " 1. Find a 95% upper confidence bound on the true $p$-value from the estimated $p$-value.\n", " 1. Explain clearly what the null hypothesis means and how to interpret the estimated $p$-value. How strong is the evidence for association? What would association signify?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logistic regression\n", "\n", "[TO DO]" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.1.3" } }, "nbformat": 4, "nbformat_minor": 0 }