{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Multiple Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section is based partly on Freedman, D.A., 2009. [_Statistical Models: Theory and Practice_, Revised Edition](http://www.amazon.com/Statistical-Models-Practice-David-Freedman/dp/0521743850/), Cambridge University Press.\n", "\n", "Statistical models, and regression in particular, are used primarily for three purposes:\n", "\n", "1. _Description_: to summarize data\n", "2. _Prediction_: to predict future data\n", "3. _Causal Inference_: to predict what would happen in response to an intervention\n", "\n", "It is straightforward to check whether a regression model is a good summary of _existing_ data, although there is some subtlety in determining whether the summary is _good enough_. How to measure goodness of fit appropriately is not always obvious, and adequacy of fit depends on the use of the summary.\n", "\n", "Prediction is harder than description because it involves _extrapolation_: how can one tell what the future will bring? Why should the future be like the past? Is the system under study stable (i.e., _stationary_) or are its properties changing with time?\n", "\n", "However, the hardest of these tasks is causal inference. The biggest difficulty in drawing causal inferences is _confounding_, especially when the data arise from _observational studies_ rather than _randomized, controlled experiments_. (_Natural experiments_ lie somewhere in between; there are few that approach the utility of\n", "a randomized controlled experiment, but John Snow's study of the communication of cholera is a notable exception.)\n", "\n", "_Confounding_ happens when one factor or variable manifests in the outcome in a way that cannot be distinguished from the _treatment_.\n", "\n", "_Stratification_ (e.g., _cross tabulation_) can help reduce confounding. So can modeling—in some cases, but not in others. \n", "For modeling to help, it is generally necessary for the structure of the model to \n", "correspond to how the data were actually generated.\n", "Unfortunately, most models in science, especially social science, are chosen out of habit or computational\n", "convenience, not because they have a basis in the science itself.\n", "This often produces misleading results, and the misleading impression that those results have small\n", "uncertainties." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some notation\n", "\n", "If $\\{x_i\\}_{i=1}^n$ is a list of numbers, \n", "$$\n", "\\bar{x} \\equiv \\frac{1}{n} \\sum_{i=1}^n\n", "$$\n", "is the _mean_ of the list;\n", "$$\n", "\\mbox{var } x \\equiv \\frac{1}{n} \\sum_{i=1}^n (x_i - \\bar{x})^2\n", "$$\n", "is the _variance_ of the list;\n", "$$\n", "s_x \\equiv \\sqrt{\\mbox{var } x}\n", "$$\n", "is the SD or _standard deviation_ of the list;\n", "and\n", "$$\n", " z_i \\equiv \\frac{x_i - \\bar{x}}{s_x}\n", "$$\n", "is _$x_i$ in standard units_.\n", "With $\\bar{y}$ and $s_y$ defined analogously for the list $\\{y_i\\}_{i=1}^n$, and if $s_x$\n", "and $s_y$ are nonzero,\n", "\n", "$$\n", " r_{xy} \\equiv \\frac{1}{n} \\sum_{i=1} \\frac{x_i - \\bar{x}}{s_x} \\cdot \\frac{y_i-\\bar{y}}{s_y}\n", "$$\n", "is the _correlation of $x$ and $y$_.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bivariate regression\n", "\n", "See [SticiGui: Regression](http://www.stat.berkeley.edu/~stark/SticiGui/Text/regression.htm)\n", "\n", "Suppose we observe pairs $\\{(x_i, y_i)\\}_{i=1}^n$.\n", "What straight line $y = a + bx$ comes closest to fitting these data, in the least-squares sense?\n", "That is, what values $a$ and $b$ minimize\n", "$$\n", " \\mbox{mean squared residual} = \\mbox{MSR} = \\equiv \\frac{1}{n}\\sum_{i=1}^n \\left ( y_i - (a + bx_i) \\right )^2?\n", "$$\n", "The solution is $b = r_{xy} \\frac{s_y}{s_x}$ and $a = \\bar{y} - b \\bar{x}$.\n", "\n", "_Proof._\n", "The mean squared residual is \n", "$$\n", " \\mbox{MSR} = \\frac{1}{n}\\sum_{i=1}^n \\left ( y_i - (a + bx_i) \\right )^2 = \n", " \\frac{1}{n}\\sum_{i=1}^n \\left ( y_i^2 - 2y_i(a + bx_i) + (a+bx_i)^2 \\right ) =\n", " \\frac{1}{n}\\sum_{i=1}^n \\left ( y_i^2 - 2y_i(a + bx_i) + a^2+2abx_i + b^2x_i^2 \\right ).\n", "$$\n", "\n", "This is quadratic in both $a$ and $b$, and has positive leading coefficients in both.\n", "Hence, its minimum occurs at a stationary point with respect to both $a$ and $b$.\n", "We can differentiate inside the sum and solve for the stationary point:\n", "\n", "$$\n", " 0 = \\frac{\\partial \\mbox{MSR}}{\\partial a} = \\frac{1}{n} \\sum_{i=1}^n 2(y_i - (a+b x_i))\n", " = 2 (\\bar{y} - a - b \\bar{x}).\n", "$$\n", "\n", "Hence, the optimal $a$ solves $a = \\bar{y} - b \\bar{x}$.\n", "Substituting this into the expression for mean squared residual gives\n", "\n", "$$ \n", "\\mbox{MSR} = \\frac{1}{n} \\sum_{i=1}^n \\left ( y_i - (\\bar{y} - b \\bar{x} + bx_i) \\right )^2 \n", "= \\frac{1}{n} \\sum_{i=1}^n \\left ( y_i - (\\bar{y} - b (\\bar{x} - x_i)) \\right )^2.\n", "$$\n", "\n", "Differentiating with respect to $b$ and finding the stationary point gives\n", "$$\n", " 0 = \\frac{\\partial \\mbox{MSR}}{\\partial b} = \n", " \\frac{1}{n} \\sum_{i=1}^n 2 \\left ( y_i - (\\bar{y} - b (\\bar{x} - x_i) ) \\right )(\\bar{x} - x_i)\n", " = \\frac{1}{n} \\sum_{i=1}^n ( y_i - \\bar{y})(\\bar{x} - x_i) + \\frac{b}{n} \\sum_{i=1}^n (\\bar{x}-x_i)^2\n", " = s_x s_y r_{xy} - b s_x^2,\n", "$$\n", "\n", "i.e., \n", "\n", "$$\n", "b = \\frac{s_x s_y r_{xy}}{s_x^2} = r_{xy}\\frac{s_y}{s_x}.\n", "$$\n", "\n", "The mean of the squared residuals from the regression line\n", "has a simple expression: \n", "\n", "$$\\frac{1}{n} \\sum_{i=1}^n e_i^2 = \n", "\\frac{1}{n}\\sum_{i=1}^n \\left ( y_i - \\bar{y} + r_{xy} \\frac{s_y}{s_x} \\bar{x} - r_{xy} \\frac{s_y}{s_x} x_i) \\right )^2\n", "$$\n", "$$\n", "= \\frac{1}{n} \\sum_{i=1}^n ( y_i - \\bar{y})^2 + \n", "2r_{xy} \\frac{s_y}{s_x} \\frac{1}{n} \\sum_{i=1}^n (y_i - \\bar{y})(\\bar{x} - x_i)\n", "+ r_{xy}^2 \\frac{s_y^2}{s_x^2} \\frac{1}{n} \\sum_{i=1}^n (\\bar{x} - x_i)^2\n", "$$\n", "$$\n", "= s_y^2 + 2r_{xy} \\frac{s_y}{s_x} (-r_{xy}s_x s_y) + r_{xy}^2 \\frac{s_y^2}{s_x^2} s_x^2\n", "$$\n", "$$\n", "= s_y^2 - 2 r^2 s_y^2 + r^2 s_y^2 \n", "$$\n", "$$\n", "= s_y^2 (1 - r_{xy}^2).\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some statistical terminology: measuring error\n", "\n", "Suppose that $\\theta \\in \\Re^p$ is a parameter, and let $\\hat{\\theta}$ be an estimator of $\\theta$.\n", "The _bias_ of $\\hat{\\theta}$ is\n", "\n", "$$\n", " \\mbox{bias } \\hat{\\theta} = {\\mathbb E}(\\hat{\\theta} - \\theta) = {\\mathbb E} \\hat{\\theta} - \\theta.\n", "$$\n", "\n", "An estimate $\\hat{\\theta}$ of a parameter $\\theta \\in \\Theta$ is _unbiased_ \n", "if ${\\mathbb E}_\\eta \\hat{\\theta} = \\eta$ for all $\\eta \\in \\Theta$.\n", "\n", "The _mean squared error_ (MSE) of an estimate $\\hat{\\theta}$ of $\\theta$ is\n", "\n", "$$\n", "\\mbox{MSE } \\hat{\\theta} = {\\mathbb E} \\| \\hat{\\theta} - \\theta \\|^2.\n", "$$\n", "\n", "The _variance_ of $\\hat{\\theta}$ is\n", "\n", "$$\n", "\\mbox{var } \\hat{\\theta} = {\\mathbb E} \\| \\hat{\\theta} - {\\mathbb E} \\hat{\\theta} \\|^2.\n", "$$\n", "\n", "__Proposition.__\n", "\n", "$$\n", "\\mbox{MSE } \\hat{\\theta} = \\mbox{var } \\hat{\\theta} + \\| \\mbox{bias} \\hat{\\theta} \\|^2.\n", "$$\n", "\n", "_Proof._\n", "\n", "$$\n", "\\| \\hat{\\theta} - \\theta \\|^2 = \n", "\\| \\hat{\\theta} - {\\mathbb E} \\hat{\\theta} - (\\theta - {\\mathbb E} \\hat{\\theta}) \\|^2\n", "= (\\hat{\\theta} - {\\mathbb E} \\hat{\\theta} - (\\theta - {\\mathbb E} \\hat{\\theta}))'\n", "(\\hat{\\theta} - {\\mathbb E} \\hat{\\theta} - (\\theta - {\\mathbb E} \\hat{\\theta})) \n", "$$\n", "$$ =\n", "(\\hat{\\theta} - {\\mathbb E} \\hat{\\theta})'(\\hat{\\theta} - {\\mathbb E} \\hat{\\theta})\n", "+ 2 (\\hat{\\theta} - {\\mathbb E} \\hat{\\theta})'(\\theta - {\\mathbb E} \\hat{\\theta})\n", "+ (\\theta - {\\mathbb E} \\hat{\\theta})'(\\theta - {\\mathbb E} \\hat{\\theta})\n", "$$\n", "$$\n", "=\n", "\\| \\hat{\\theta} - {\\mathbb E} \\hat{\\theta} \\|^2 + \n", "2(\\hat{\\theta} - {\\mathbb E} \\hat{\\theta})'(\\theta - {\\mathbb E} \\hat{\\theta})\n", "+ \\| \\theta - {\\mathbb E} \\hat{\\theta} \\|^2.\n", "$$\n", "\n", "Hence,\n", "\n", "$$\n", "\\mbox{MSE } \\hat{\\theta} = {\\mathbb E} \\| \\hat{\\theta} - {\\mathbb E} \\hat{\\theta} \\|^2\n", "+ {\\mathbb E} \\left ( 2(\\hat{\\theta} - {\\mathbb E} \\hat{\\theta})'(\\theta - {\\mathbb E} \\hat{\\theta}) \\right )\n", "+ {\\mathbb E} \\| \\theta - {\\mathbb E} \\hat{\\theta} \\|^2\n", "$$\n", "$$\n", "=\n", "{\\mathbb E} \\| \\hat{\\theta} - {\\mathbb E} \\hat{\\theta} \\|^2\n", "+ 2 \\left ({\\mathbb E} \\hat{\\theta} - {\\mathbb E} \\hat{\\theta})'(\\theta - {\\mathbb E} \\hat{\\theta}) \\right )\n", "+ {\\mathbb E} \\| \\theta - {\\mathbb E} \\hat{\\theta} \\|^2\n", "$$\n", "$$\n", "= \\mbox{var } \\hat{\\theta} + \\| \\mbox{bias } \\hat{\\theta} \\|^2.\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiple Regression\n", "\n", "Multiple linear regression is a commonly used tool in most branches of science, as well as economics.\n", "It is frequently misinterpreted.\n", "\n", "We will start with an introduction to the linear algebra of constructing the least-squares estimate for\n", "linear regression, then discuss the features and limitations of regression, especially the limitations\n", "of drawing causal inferences from regression models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Notation\n", "\n", "The basic relationship for linear regression is the equation\n", "\n", "$$\n", " Y = X \\beta + \\epsilon.\n", "$$\n", "\n", "Here, $Y$ is an $n$-vector of data, called the _dependent variable_, the _response_, the _explained variables_,\n", "the _modeled variables_, or the _left hand side_.\n", "The matrix $X \\in {\\mathcal M}(n,p)$ with $n \\ge p$, $\\mbox{rank}(X)=p$ (full rank)\n", "so that the columns of $X$ are linearly independent.\n", "\n", "The vector $\\beta$ is a $p$-vector of _parameters_, _model parameters_, or _coefficients_. \n", "The usual goal of regression is to estimate $\\beta$.\n", "\n", "The vector $\\epsilon$ is an $n$-vector of _error_, _disturbance_, or _noise_.\n", "\n", "We will usually assume that $\\epsilon$ is random, which makes $Y$ random as well.\n", "We will usually treat $X$ as fixed rather than random.\n", "\n", "There is an observation $Y_i$ for each _unit_ of observation, a row of $X$ for each observation,\n", "and a column of $X$ for each parameter (element of $\\beta$).\n", "The columns correspond to _explanatory variables_, _independent variables_, _covariates_, _control variables_,\n", "or _right-hand side variables_.\n", "\n", "We have observaations of $Y$, which are _assumed_ to be values of $X\\beta + \\epsilon$.\n", "The value of $\\beta$ is unknown.\n", "The value of $\\epsilon$ cannot be observed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The standard assumption in multiple regression is that the noise terms $\\epsilon$ are random, with\n", "\n", "+ $\\{\\epsilon_i\\}$ iid (independent and identically distributed)\n", "+ ${\\mathbb E}\\epsilon_i = 0$\n", "+ $\\mbox{var}\\epsilon_i = \\sigma^2$, generally a known constant\n", "\n", "It is also standard to assume that if $X$ is random, $X$ and $\\epsilon$ are independent.\n", "Regardless, the value of $X$ is observed$.\n", "\n", "Since $X$ is observed, we can find $\\mbox{rank}(X)$.\n", "But there is no way to test whether the main assumptions are true, that is, whether\n", "+ $Y = X\\beta + \\epsilon$\n", "+ $\\{ \\epsilon_i \\}$ are iid with mean 0 and finite variance\n", "+ $X$ and $\\epsilon$ are independent.\n", "\n", "It is common to _condition_ on $X$; that is, to treat it as fixed erather than random." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ordinary least squares\n", "\n", "The ordinary least squares (OLS) estimate of $\\beta$ is\n", "\n", "$$ \n", "\\hat{\\beta} \\equiv \\left ( X'X \\right )^{-1}X' Y.\n", "$$\n", "\n", "The estimate $\\hat{\\beta}$ is a $p$-vector, just like $\\beta$.\n", "\n", "The _residuals_ or _fitting errors_ are\n", "$$\n", " e \\equiv Y - X \\hat{\\beta}.\n", "$$\n", "\n", "
$t_i$ | $Y_i$ | \n", "
---|---|
0 | 2.043 |
1 | 6.856 |
2 | 22.565 |
3 | 47.709 |
1 | 0 | 0 |
1.0 | 1.0 | 0.5 |
1 | 2 | 2 |
1.0 | 3.0 | 4.5 |
1.96995 |
0.02245 |
10.1655 |
0.07305 |
-0.21915 |
0.21915 |
-0.07305 |
-9.992007e-15 | -2.753353e-14 | -4.218847e-14 |
lm {stats} | R Documentation |
lm
is used to fit linear models.\n",
"It can be used to carry out regression,\n",
"single stratum analysis of variance and\n",
"analysis of covariance (although aov
may provide a more\n",
"convenient interface for these).\n",
"
\n", "lm(formula, data, subset, weights, na.action,\n", " method = \"qr\", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,\n", " singular.ok = TRUE, contrasts = NULL, offset, ...)\n", "\n", "\n", "\n", "
formula | \n",
"\n",
" an object of class |
data | \n",
"\n",
" an optional data frame, list or environment (or object\n",
"coercible by |
subset | \n",
"\n",
" an optional vector specifying a subset of observations\n", "to be used in the fitting process. \n", " |
weights | \n",
"\n",
" an optional vector of weights to be used in the fitting\n",
"process. Should be |
na.action | \n",
"\n",
" a function which indicates what should happen\n",
"when the data contain |
method | \n",
"\n",
" the method to be used; for fitting, currently only\n",
" |
model, x, y, qr | \n",
"\n",
" logicals. If |
singular.ok | \n",
"\n",
" logical. If |
contrasts | \n",
"\n",
" an optional list. See the |
offset | \n",
"\n",
" this can be used to specify an a priori known\n",
"component to be included in the linear predictor during fitting.\n",
"This should be |
... | \n",
"\n",
" additional arguments to be passed to the low level\n", "regression fitting functions (see below). \n", " |
Models for lm
are specified symbolically. A typical model has\n",
"the form response ~ terms
where response
is the (numeric)\n",
"response vector and terms
is a series of terms which specifies a\n",
"linear predictor for response
. A terms specification of the form\n",
"first + second
indicates all the terms in first
together\n",
"with all the terms in second
with duplicates removed. A\n",
"specification of the form first:second
indicates the set of\n",
"terms obtained by taking the interactions of all terms in first
\n",
"with all terms in second
. The specification first*second
\n",
"indicates the cross of first
and second
. This is\n",
"the same as first + second + first:second
.\n",
"
If the formula includes an offset
, this is evaluated and\n",
"subtracted from the response.\n",
"
If response
is a matrix a linear model is fitted separately by\n",
"least-squares to each column of the matrix.\n",
"
See model.matrix
for some further details. The terms in\n",
"the formula will be re-ordered so that main effects come first,\n",
"followed by the interactions, all second-order, all third-order and so\n",
"on: to avoid this pass a terms
object as the formula (see\n",
"aov
and demo(glm.vr)
for an example).\n",
"
A formula has an implied intercept term. To remove this use either\n",
"y ~ x - 1
or y ~ 0 + x
. See formula
for\n",
"more details of allowed formulae.\n",
"
Non-NULL
weights
can be used to indicate that different\n",
"observations have different variances (with the values in\n",
"weights
being inversely proportional to the variances); or\n",
"equivalently, when the elements of weights
are positive\n",
"integers w_i, that each response y_i is the mean of\n",
"w_i unit-weight observations (including the case that there are\n",
"w_i observations equal to y_i and the data have been\n",
"summarized).\n",
"
lm
calls the lower level functions lm.fit
, etc,\n",
"see below, for the actual numerical computations. For programming\n",
"only, you may consider doing likewise.\n",
"
All of weights
, subset
and offset
are evaluated\n",
"in the same way as variables in formula
, that is first in\n",
"data
and then in the environment of formula
.\n",
"
lm
returns an object of class
\"lm\"
or for\n",
"multiple responses of class c(\"mlm\", \"lm\")
.\n",
"
The functions summary
and anova
are used to\n",
"obtain and print a summary and analysis of variance table of the\n",
"results. The generic accessor functions coefficients
,\n",
"effects
, fitted.values
and residuals
extract\n",
"various useful features of the value returned by lm
.\n",
"
An object of class \"lm\"
is a list containing at least the\n",
"following components:\n",
"
coefficients | \n",
"\n",
" a named vector of coefficients \n", " |
residuals | \n",
"\n",
" the residuals, that is response minus fitted values. \n", " |
fitted.values | \n",
"\n",
" the fitted mean values. \n", " |
rank | \n",
"\n",
" the numeric rank of the fitted linear model. \n", " |
weights | \n",
"\n",
" (only for weighted fits) the specified weights. \n", " |
df.residual | \n",
"\n",
" the residual degrees of freedom. \n", " |
call | \n",
"\n",
" the matched call. \n", " |
terms | \n",
"\n",
" the |
contrasts | \n",
"\n",
" (only where relevant) the contrasts used. \n", " |
xlevels | \n",
"\n",
" (only where relevant) a record of the levels of the\n", "factors used in fitting. \n", " |
offset | \n",
"\n",
" the offset used (missing if none were used). \n", " |
y | \n",
"\n",
" if requested, the response used. \n", " |
x | \n",
"\n",
" if requested, the model matrix used. \n", " |
model | \n",
"\n",
" if requested (the default), the model frame used. \n", " |
na.action | \n",
"\n",
" (where relevant) information returned by\n",
" |
In addition, non-null fits will have components assign
,\n",
"effects
and (unless not requested) qr
relating to the linear\n",
"fit, for use by extractor functions such as summary
and\n",
"effects
.\n",
"
Considerable care is needed when using lm
with time series.\n",
"
Unless na.action = NULL
, the time series attributes are\n",
"stripped from the variables before the regression is done. (This is\n",
"necessary as omitting NA
s would invalidate the time series\n",
"attributes, and if NA
s are omitted in the middle of the series\n",
"the result would no longer be a regular time series.)\n",
"
Even if the time series attributes are retained, they are not used to\n",
"line up series, so that the time shift of a lagged or differenced\n",
"regressor would be ignored. It is good practice to prepare a\n",
"data
argument by ts.intersect(..., dframe = TRUE)
,\n",
"then apply a suitable na.action
to that data frame and call\n",
"lm
with na.action = NULL
so that residuals and fitted\n",
"values are time series.\n",
"
Offsets specified by offset
will not be included in predictions\n",
"by predict.lm
, whereas those specified by an offset term\n",
"in the formula will be.\n",
"
The design was inspired by the S function of the same name described\n", "in Chambers (1992). The implementation of model formula by Ross Ihaka\n", "was based on Wilkinson & Rogers (1973).\n", "
\n", "\n", "\n", "Chambers, J. M. (1992)\n", "Linear models.\n", "Chapter 4 of Statistical Models in S\n", "eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.\n", "
\n", "Wilkinson, G. N. and Rogers, C. E. (1973)\n", "Symbolic descriptions of factorial models for analysis of variance.\n", "Applied Statistics, 22, 392–9.\n", "
\n", "\n", "\n", "summary.lm
for summaries and anova.lm
for\n",
"the ANOVA table; aov
for a different interface.\n",
"
The generic functions coef
, effects
,\n",
"residuals
, fitted
, vcov
.\n",
"
predict.lm
(via predict
) for prediction,\n",
"including confidence and prediction intervals;\n",
"confint
for confidence intervals of parameters.\n",
"
lm.influence
for regression diagnostics, and\n",
"glm
for generalized linear models.\n",
"
The underlying low level functions,\n",
"lm.fit
for plain, and lm.wfit
for weighted\n",
"regression fitting.\n",
"
More lm()
examples are available e.g., in\n",
"anscombe
, attitude
, freeny
,\n",
"LifeCycleSavings
, longley
,\n",
"stackloss
, swiss
.\n",
"
biglm
in package biglm for an alternative\n",
"way to fit linear models to large datasets (especially those with many\n",
"cases).\n",
"
\n", "require(graphics)\n", "\n", "## Annette Dobson (1990) \"An Introduction to Generalized Linear Models\".\n", "## Page 9: Plant Weight Data.\n", "ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)\n", "trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)\n", "group <- gl(2, 10, 20, labels = c(\"Ctl\",\"Trt\"))\n", "weight <- c(ctl, trt)\n", "lm.D9 <- lm(weight ~ group)\n", "lm.D90 <- lm(weight ~ group - 1) # omitting intercept\n", "\n", "anova(lm.D9)\n", "summary(lm.D90)\n", "\n", "opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))\n", "plot(lm.D9, las = 1) # Residuals, Fitted, ...\n", "par(opar)\n", "\n", "### less simple examples in \"See Also\" above\n", "\n", "\n", "
formula {stats} | R Documentation |
The generic function formula
and its specific methods provide a\n",
"way of extracting formulae which have been included in other objects.\n",
"
as.formula
is almost identical, additionally preserving\n",
"attributes when object
already inherits from\n",
"\"formula\"
.\n",
"
\n", "formula(x, ...)\n", "as.formula(object, env = parent.frame())\n", "\n", "## S3 method for class 'formula'\n", "print(x, showEnv = !identical(e, .GlobalEnv), ...)\n", "\n", "\n", "\n", "
x, object | \n",
"\n",
" R object. \n", " |
... | \n",
"\n",
" further arguments passed to or from other methods. \n", " |
env | \n",
"\n",
" the environment to associate with the result, if not\n", "already a formula. \n", " |
showEnv | \n",
"\n",
" logical indicating if the environment should be printed\n", "as well. \n", " |
The models fit by, e.g., the lm
and glm
functions\n",
"are specified in a compact symbolic form.\n",
"The ~
operator is basic in the formation of such models.\n",
"An expression of the form y ~ model
is interpreted\n",
"as a specification that the response y
is modelled\n",
"by a linear predictor specified symbolically by model
.\n",
"Such a model consists of a series of terms separated\n",
"by +
operators.\n",
"The terms themselves consist of variable and factor\n",
"names separated by :
operators.\n",
"Such a term is interpreted as the interaction of\n",
"all the variables and factors appearing in the term.\n",
"
In addition to +
and :
, a number of other operators are\n",
"useful in model formulae. The *
operator denotes factor\n",
"crossing: a*b
interpreted as a+b+a:b
. The ^
\n",
"operator indicates crossing to the specified degree. For example\n",
"(a+b+c)^2
is identical to (a+b+c)*(a+b+c)
which in turn\n",
"expands to a formula containing the main effects for a
,\n",
"b
and c
together with their second-order interactions.\n",
"The %in%
operator indicates that the terms on its left are\n",
"nested within those on the right. For example a + b %in% a
\n",
"expands to the formula a + a:b
. The -
operator removes\n",
"the specified terms, so that (a+b+c)^2 - a:b
is identical to\n",
"a + b + c + b:c + a:c
. It can also used to remove the\n",
"intercept term: when fitting a linear model y ~ x - 1
specifies\n",
"a line through the origin. A model with no intercept can be also\n",
"specified as y ~ x + 0
or y ~ 0 + x
.\n",
"
While formulae usually involve just variable and factor\n",
"names, they can also involve arithmetic expressions.\n",
"The formula log(y) ~ a + log(x)
is quite legal.\n",
"When such arithmetic expressions involve\n",
"operators which are also used symbolically\n",
"in model formulae, there can be confusion between\n",
"arithmetic and symbolic operator use.\n",
"
To avoid this confusion, the function I()
\n",
"can be used to bracket those portions of a model\n",
"formula where the operators are used in their\n",
"arithmetic sense. For example, in the formula\n",
"y ~ a + I(b+c)
, the term b+c
is to be\n",
"interpreted as the sum of b
and c
.\n",
"
Variable names can be quoted by backticks `like this`
in\n",
"formulae, although there is no guarantee that all code using formulae\n",
"will accept such non-syntactic names.\n",
"
Most model-fitting functions accept formulae with right-hand-side\n",
"including the function offset
to indicate terms with a\n",
"fixed coefficient of one. Some functions accept other\n",
"‘specials’ such as strata
or cluster
(see the\n",
"specials
argument of terms.formula)
.\n",
"
There are two special interpretations of .
in a formula. The\n",
"usual one is in the context of a data
argument of model\n",
"fitting functions and means ‘all columns not otherwise in the\n",
"formula’: see terms.formula
. In the context of\n",
"update.formula
, only, it means ‘what was\n",
"previously in this part of the formula’.\n",
"
When formula
is called on a fitted model object, either a\n",
"specific method is used (such as that for class \"nls\"
) or the\n",
"default method. The default first looks for a \"formula\"
\n",
"component of the object (and evaluates it), then a \"terms\"
\n",
"component, then a formula
parameter of the call (and evaluates\n",
"its value) and finally a \"formula\"
attribute.\n",
"
There is a formula
method for data frames. If there is only\n",
"one column this forms the RHS with an empty LHS. For more columns,\n",
"the first column is the LHS of the formula and the remaining columns\n",
"separated by +
form the RHS.\n",
"
All the functions above produce an object of class \"formula\"
\n",
"which contains a symbolic model formula.\n",
"
A formula object has an associated environment, and\n",
"this environment (rather than the parent\n",
"environment) is used by model.frame
to evaluate variables\n",
"that are not found in the supplied data
argument.\n",
"
Formulas created with the ~
operator use the\n",
"environment in which they were created. Formulas created with\n",
"as.formula
will use the env
argument for their\n",
"environment.\n",
"
Chambers, J. M. and Hastie, T. J. (1992)\n", "Statistical models.\n", "Chapter 2 of Statistical Models in S\n", "eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.\n", "
\n", "\n", "\n", "I
, offset
.\n",
"
For formula manipulation: terms
, and all.vars
;\n",
"for typical use: lm
, glm
, and\n",
"coplot
.\n",
"
\n", "class(fo <- y ~ x1*x2) # \"formula\"\n", "fo\n", "typeof(fo) # R internal : \"language\"\n", "terms(fo)\n", "\n", "environment(fo)\n", "environment(as.formula(\"y ~ x\"))\n", "environment(as.formula(\"y ~ x\", env = new.env()))\n", "\n", "\n", "## Create a formula for a model with a large number of variables:\n", "xnam <- paste0(\"x\", 1:25)\n", "(fmla <- as.formula(paste(\"y ~ \", paste(xnam, collapse= \"+\"))))\n", "\n", "\n", "
0.10138975 | -0.11206235 | 0.05336302 |
-0.1120624 | 0.2614788 | -0.1600891 |
0.05336302 | -0.16008907 | 0.10672605 |
$h = 1.97 \\pm 0.318$ | \n", "$v = 0.02 \\pm 0.511$ | \n", "$a = 10.16 \\pm 0.327$ | \n", "
-1.03005 |
0.02245 |
0.3655 |
-3.234903 |
0.04390339 |
1.118799 |
TDist {stats} | R Documentation |
Density, distribution function, quantile function and random\n",
"generation for the t distribution with df
degrees of freedom\n",
"(and optional non-centrality parameter ncp
).\n",
"
\n", "dt(x, df, ncp, log = FALSE)\n", "pt(q, df, ncp, lower.tail = TRUE, log.p = FALSE)\n", "qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE)\n", "rt(n, df, ncp)\n", "\n", "\n", "\n", "
x, q | \n",
"\n",
" vector of quantiles. \n", " |
p | \n",
"\n",
" vector of probabilities. \n", " |
n | \n",
"\n",
" number of observations. If |
df | \n",
"\n",
" degrees of freedom (> 0, maybe non-integer). |
ncp | \n",
"\n",
" non-centrality parameter delta;\n",
"currently except for |
log, log.p | \n",
"\n",
" logical; if TRUE, probabilities p are given as log(p). \n", " |
lower.tail | \n",
"\n",
" logical; if TRUE (default), probabilities are\n", "P[X ≤ x], otherwise, P[X > x]. \n", " |
The t distribution with df
= n degrees of\n",
"freedom has density\n",
"
f(x) = Γ((n+1)/2) / (√(n π) Γ(n/2)) (1 + x^2/n)^-((n+1)/2)
\n", "\n", "for all real x.\n", "It has mean 0 (for n > 1) and\n", "variance n/(n-2) (for n > 2).\n", "
\n", "The general non-central t\n",
"with parameters (df, Del) = (df, ncp)
\n",
"is defined as the distribution of\n",
"T(df, Del) := (U + Del) / √(V/df) \n",
"where U and V are independent random\n",
"variables, U ~ N(0,1) and\n",
"V ~ χ^2(df) (see Chisquare).\n",
"
The most used applications are power calculations for t-tests:
\n",
"Let T= (mX - m0) / (S/sqrt(n))\n",
"where\n",
"mX is the mean
and S the sample standard\n",
"deviation (sd
) of X_1, X_2, …, X_n which are\n",
"i.i.d. N(μ, σ^2)\n",
"Then T is distributed as non-central t with\n",
"df
= n - 1\n",
"degrees of freedom and non-centrality parameter\n",
"ncp
= (μ - m0) * sqrt(n)/σ.\n",
"
dt
gives the density,\n",
"pt
gives the distribution function,\n",
"qt
gives the quantile function, and\n",
"rt
generates random deviates.\n",
"
Invalid arguments will result in return value NaN
, with a warning.\n",
"
The length of the result is determined by n
for\n",
"rt
, and is the maximum of the lengths of the\n",
"numerical arguments for the other functions. \n",
"
The numerical arguments other than n
are recycled to the\n",
"length of the result. Only the first elements of the logical\n",
"arguments are used.\n",
"
Supplying ncp = 0
uses the algorithm for the non-central\n",
"distribution, which is not the same algorithm used if ncp
is\n",
"omitted. This is to give consistent behaviour in extreme cases with\n",
"values of ncp
very near zero.\n",
"
The code for non-zero ncp
is principally intended to be used\n",
"for moderate values of ncp
: it will not be highly accurate,\n",
"especially in the tails, for large values.\n",
"
The central dt
is computed via an accurate formula\n",
"provided by Catherine Loader (see the reference in dbinom
).\n",
"
For the non-central case of dt
, C code contributed by\n",
"Claus Ekstrøm based on the relationship (for\n",
"x != 0) to the cumulative distribution.\n",
"
For the central case of pt
, a normal approximation in the\n",
"tails, otherwise via pbeta
.\n",
"
For the non-central case of pt
based on a C translation of\n",
"
Lenth, R. V. (1989). Algorithm AS 243 —\n", "Cumulative distribution function of the non-central t distribution,\n", "Applied Statistics 38, 185–189.\n", "
\n", "This computes the lower tail only, so the upper tail suffers from\n", "cancellation and a warning will be given when this is likely to be\n", "significant.\n", "
\n", "For central qt
, a C translation of\n",
"
Hill, G. W. (1970) Algorithm 396: Student's t-quantiles.\n", "Communications of the ACM, 13(10), 619–620.\n", "
\n", "altered to take account of\n", "
\n", "Hill, G. W. (1981) Remark on Algorithm 396, ACM Transactions on\n", "Mathematical Software, 7, 250–1.\n", "
\n", "The non-central case is done by inversion.\n", "
\n", "\n", "\n", "Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)\n", "The New S Language.\n", "Wadsworth & Brooks/Cole. (Except non-central versions.)\n", "
\n", "Johnson, N. L., Kotz, S. and Balakrishnan, N. (1995)\n", "Continuous Univariate Distributions, volume 2, chapters 28 and 31.\n", "Wiley, New York.\n", "
\n", "\n", "\n", "Distributions for other standard distributions, including\n",
"df
for the F distribution.\n",
"
\n", "require(graphics)\n", "\n", "1 - pt(1:5, df = 1)\n", "qt(.975, df = c(1:10,20,50,100,1000))\n", "\n", "tt <- seq(0, 10, len = 21)\n", "ncp <- seq(0, 6, len = 31)\n", "ptn <- outer(tt, ncp, function(t, d) pt(t, df = 3, ncp = d))\n", "t.tit <- \"Non-central t - Probabilities\"\n", "image(tt, ncp, ptn, zlim = c(0,1), main = t.tit)\n", "persp(tt, ncp, ptn, zlim = 0:1, r = 2, phi = 20, theta = 200, main = t.tit,\n", " xlab = \"t\", ylab = \"non-centrality parameter\",\n", " zlab = \"Pr(T <= t)\")\n", "\n", "plot(function(x) dt(x, df = 3, ncp = 2), -3, 11, ylim = c(0, 0.32),\n", " main = \"Non-central t - Density\", yaxs = \"i\")\n", "\n", "\n", "
0.1908651 |
0.9720682 |
0.4643426 |
1.0 | 0.2 | 0.1 | 0.1 |
0.2 | 1.0 | 0.2 | 0.1 |
0.1 | 0.2 | 1.0 | 0.2 |
0.1 | 0.1 | 0.2 | 1.0 |
1.98456 |
0.01271 |
10.1655 |