{ "cells": [ { "cell_type": "markdown", "id": "e2e302a3", "metadata": {}, "source": [ "# FORMULAS" ] }, { "cell_type": "markdown", "id": "623e728d", "metadata": {}, "source": [ "You can perform an ANOVA test in R using the `aov` built-in function. As we mentioned in the lecture, this function uses **formulas** as their inputs. Let's briefly describe their meaning and uses." ] }, { "cell_type": "markdown", "id": "c2db3226", "metadata": {}, "source": [ "A formula object is just a variable, but a special type that specifies a **relationship** between other variables. A formula is specified using the \"tilde operator\". ~. A very simple example of a formula is shown below:" ] }, { "cell_type": "code", "execution_count": 7, "id": "fbaaa5f4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "out ~ pred" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "formula1 <- out ~ pred\n", "formula1" ] }, { "cell_type": "markdown", "id": "35ea1386", "metadata": {}, "source": [ "Normally, the variable on the left-hand side of a tilde, ~ is called the \"dependent variable\", while the variables on the right-hand side are called the \"independent variables\" and are joined by plus signs +. That is why, you could also consider other examples involving" ] }, { "cell_type": "code", "execution_count": 19, "id": "5585a8a8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "out ~ pred1 + pred2" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "formula2 <- out ~ pred1 + pred2 # more than one variable on the right\n", "formula2" ] }, { "cell_type": "markdown", "id": "4241f0f7", "metadata": {}, "source": [ "**N.B.** The names for these variables change depending on the context. You might have already seen independent variables appear as \"predictor (variable)\", \"controlled variable\", \"feature\", etc. Similarly, you might come across dependent variables as \"response variable\", \"outcome variable\" or \"label\"." ] }, { "cell_type": "markdown", "id": "7b1ef572", "metadata": {}, "source": [ "Some times, we will need or want to create a formula from an R object, such as a string. In such cases, you can use the formula or as.formula() function" ] }, { "cell_type": "code", "execution_count": 12, "id": "12756dfc", "metadata": {}, "outputs": [], "source": [ "?as.formula" ] }, { "cell_type": "code", "execution_count": 18, "id": "bb012dcf", "metadata": {}, "outputs": [ { "data": { "text/html": [ "'out ~ pred'" ], "text/latex": [ "'out \\textasciitilde{} pred'" ], "text/markdown": [ "'out ~ pred'" ], "text/plain": [ "[1] \"out ~ pred\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "'character'" ], "text/latex": [ "'character'" ], "text/markdown": [ "'character'" ], "text/plain": [ "[1] \"character\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "out ~ pred" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "'formula'" ], "text/latex": [ "'formula'" ], "text/markdown": [ "'formula'" ], "text/plain": [ "[1] \"formula\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "formula5<- \"out ~ pred\"\n", "formula5\n", "class(formula5)\n", "\n", "formula5<-as.formula(\"out ~ pred\")\n", "formula5\n", "class(formula5)" ] }, { "cell_type": "markdown", "id": "ecbb9de4", "metadata": {}, "source": [ "## Operators\n", "\n", "We have just seen that the independent variables can be joined with the + symbol. However, this is not the only symbol that we can use in your formulas. Let's have a look at other kind of symbols:\n", "\n", "\n", "\"~\" : As we saw above, this operator separates the dependent variable from the independent variables. For example, y ~ x means \"y is predicted by x\".\n", "\n", "\"+\" : As we saw above, this operator adds independent variables to the model. For example, y ~ x + z means \"y is predicted by x and z\".\n", "\n", "\"-\" : This operator removes independent variables from the model. For example, y ~ x - z means \"y is predicted by x, but not z\".\n", "\n", "\"*\" : This operator includes all possible interactions between the predictor variables. For example, y ~ x * z means \"y is predicted by the main effects of x and z, as well as their interaction\"." ] }, { "cell_type": "markdown", "id": "3a224cba", "metadata": {}, "source": [ "## Functions\n", "\n", "You can also use functions within formulas to transform variables or perform other operations. For example, y ~ log(x) means \"y is predicted by the logarithm of x\"." ] }, { "cell_type": "code", "execution_count": 21, "id": "ad00a4b7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "out ~ pred1 * pred2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "~var1 + var2" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "formula3 <- out ~ pred1 * pred2 # different relationship between predictors\n", "formula3\n", "\n", "formula4 <- ~ var1 + var2 # a ’one-sided’ formula\n", "formula4" ] }, { "cell_type": "markdown", "id": "19152786", "metadata": {}, "source": [ "# Statistical test for two variables (continuous vs categorical)" ] }, { "cell_type": "markdown", "id": "350c78c1", "metadata": {}, "source": [ "Let's generate some data for the rest of the tutorial" ] }, { "cell_type": "code", "execution_count": 2, "id": "90f71bf5", "metadata": {}, "outputs": [], "source": [ "set.seed(1234)\n", "\n", "# Generate random data for group 1 with mean 10 and standard deviation 2\n", "welch.data.1<-rbind(data.frame(value=rnorm(25, mean = 10, sd = 2), group='a'),\n", " data.frame(value=rnorm(25, mean = 13, sd = 3), group='b'))\n", "\n", "welch.data.2<-rbind(data.frame(value=rnorm(25, mean = 10, sd = 2), group='a'),\n", " data.frame(value=rnorm(25, mean = 11, sd = 3), group='b'))\n", "\n", "# Generate random data for group 1 with mean 10 and standard deviation 2\n", "students.data<-rbind(data.frame(value=rnorm(25, mean = 10, sd = 2), group='a'),\n", " data.frame(value=rnorm(25, mean = 11, sd = 2), group='b'))\n", "\n", "# Generate random data for group 1 with mean 10 and standard deviation 2\n", "anova.data<-rbind(data.frame(value=rnorm(25, mean = 10, sd = 2), group='a'),\n", " data.frame(value=rnorm(25, mean = 12, sd = 2), group='b'), \n", " data.frame(value=rnorm(25, mean = 10, sd = 2), group='c'))" ] }, { "cell_type": "markdown", "id": "13a1dd5e", "metadata": {}, "source": [ "
Practice: Visualize the above data frames using boxplots." ] }, { "cell_type": "markdown", "id": "0bb8cfc5", "metadata": {}, "source": [ "## WELCH'S T-TEST\n", "\n", "To perform a Welch's t-test, you can use the R function `t.test`. If we look at its documenation, this is what we see:" ] }, { "cell_type": "markdown", "id": "a302af7f", "metadata": {}, "source": [ " ## Default S3 method:\n", " t.test(x, y = NULL,\n", " alternative = c(\"two.sided\", \"less\", \"greater\"),\n", " mu = 0, paired = FALSE, var.equal = FALSE,\n", " conf.level = 0.95, ...)```" ] }, { "cell_type": "markdown", "id": "39be12db", "metadata": {}, "source": [ "This function is used in the same as we saw for performing a one-sample t-test. The only thing that changes is that now you have two populations, which you have to pass to the x and y arguments in `t.test`. In addition, if you perform a Welch's t-test, you are asumming that the variances between both populations are differences. To specify this in `t.test` you have to set ```var.equal = FALSE```, otherwise you would be performing a Student's t-test:" ] }, { "cell_type": "code", "execution_count": 36, "id": "2665e850", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "\tWelch Two Sample t-test\n", "\n", "data: welch.data.1[welch.data.1$group == \"a\", \"value\"] and welch.data.1[welch.data.1$group == \"b\", \"value\"]\n", "t = -2.445, df = 44.835, p-value = 0.01848\n", "alternative hypothesis: true difference in means is not equal to 0\n", "95 percent confidence interval:\n", " -2.7186234 -0.2625635\n", "sample estimates:\n", "mean of x mean of y \n", " 9.516435 11.007029 \n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "t.test(x=welch.data.1[welch.data.1$group==\"a\", \"value\"],\n", " y=welch.data.1[welch.data.1$group==\"b\", \"value\"], \n", " var.equal = FALSE)" ] }, { "cell_type": "markdown", "id": "da5d7660", "metadata": {}, "source": [ "As a result of this test, assuming a type I error $\\alpha=0.05$, we should reject the null that assumes that the means of both populations are equal. " ] }, { "cell_type": "markdown", "id": "e112fcc1", "metadata": {}, "source": [ "## STUDENT'S T-TEST\n", "\n", "Running a Student's t-test is similar to Welch's, but setting var.equal = TRUE:" ] }, { "cell_type": "code", "execution_count": 37, "id": "024476d4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "\tTwo Sample t-test\n", "\n", "data: students.data[students.data$group == \"a\", \"value\"] and students.data[students.data$group == \"b\", \"value\"]\n", "t = -1.7401, df = 48, p-value = 0.08826\n", "alternative hypothesis: true difference in means is not equal to 0\n", "95 percent confidence interval:\n", " -1.8455038 0.1331359\n", "sample estimates:\n", "mean of x mean of y \n", " 10.11186 10.96805 \n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "t.test(x=students.data[students.data$group==\"a\", \"value\"],\n", " y=students.data[students.data$group==\"b\", \"value\"], \n", " var.equal = TRUE)" ] }, { "cell_type": "markdown", "id": "ecdd4a20", "metadata": {}, "source": [ "## T-test using formulas" ] }, { "cell_type": "markdown", "id": "6b1f2781", "metadata": {}, "source": [ "As you can use, running these tests from a data frame like the ones we have above can make our code a bit lengthy and difficult to read. Fortunately, for these cases, we can use the formula syntax we introduced at the beginning of this tutorial. If we look at the documentation of `t.test`, this is what it says about using formulas:" ] }, { "cell_type": "markdown", "id": "1482e012", "metadata": {}, "source": [ " ## S3 method for class 'formula'\n", " t.test(formula, data, subset, na.action, ...)" ] }, { "cell_type": "markdown", "id": "1562d064", "metadata": {}, "source": [ "Using formulas, running the previous examples is very easy:" ] }, { "cell_type": "code", "execution_count": 35, "id": "52c7f10c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "\tWelch Two Sample t-test\n", "\n", "data: value by group\n", "t = -2.445, df = 44.835, p-value = 0.01848\n", "alternative hypothesis: true difference in means between group a and group b is not equal to 0\n", "95 percent confidence interval:\n", " -2.7186234 -0.2625635\n", "sample estimates:\n", "mean in group a mean in group b \n", " 9.516435 11.007029 \n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\n", "\tTwo Sample t-test\n", "\n", "data: value by group\n", "t = -1.7401, df = 48, p-value = 0.08826\n", "alternative hypothesis: true difference in means between group a and group b is not equal to 0\n", "95 percent confidence interval:\n", " -1.8455038 0.1331359\n", "sample estimates:\n", "mean in group a mean in group b \n", " 10.11186 10.96805 \n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Run Welch's t-test\n", "t.test(value ~ group, data = welch.data.1, var.equal = FALSE)\n", "\n", "# Run Student's t-test\n", "t.test(value ~ group, data = students.data, var.equal = TRUE)" ] }, { "cell_type": "markdown", "id": "4ccb7115", "metadata": {}, "source": [ "## ANOVA TEST" ] }, { "cell_type": "markdown", "id": "4fbb842d", "metadata": {}, "source": [ "To run an anova test, we will the `aov` built-in function. This function has the following form:\n", "\n", " aov(formula, data = NULL, projections = FALSE, qr = TRUE,\n", " contrasts = NULL, ...)\n", " \n", "Here we can only use formulas, but the syntax is similar to what we have seen already for the function `t-test`." ] }, { "cell_type": "code", "execution_count": 47, "id": "2dbd605f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Call:\n", " aov(formula = value ~ group, data = anova.data)\n", "\n", "Terms:\n", " group Residuals\n", "Sum of Squares 44.6691 348.7716\n", "Deg. of Freedom 2 72\n", "\n", "Residual standard error: 2.20092\n", "Estimated effects may be unbalanced" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "res.aov<-aov(value~group, data = anova.data)\n", "res.aov" ] }, { "cell_type": "markdown", "id": "51bb8c93", "metadata": {}, "source": [ "mmm... And the p-value? Just running the above formula will not give you the p-value directly. For that, you will need to pass the resulting object from calling `aov` to the `summary` function:" ] }, { "cell_type": "code", "execution_count": 51, "id": "5ba072de", "metadata": {}, "outputs": [ { "data": { "text/plain": [ " Df Sum Sq Mean Sq F value Pr(>F) \n", "group 2 44.7 22.335 4.611 0.0131 *\n", "Residuals 72 348.8 4.844 \n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "summary(res.aov)" ] }, { "cell_type": "markdown", "id": "9ebfc279", "metadata": {}, "source": [ "So we finally got our p-value, which, assuming a type I error $\\alpha=0.05$, suggests that we should reject the null that assumes that the means of all populations are equal. " ] }, { "cell_type": "markdown", "id": "699baf97", "metadata": {}, "source": [ "### POST-HOC ANALYSIS" ] }, { "cell_type": "markdown", "id": "5a37a639", "metadata": {}, "source": [ "The above result indicates that there is a significant difference in means between at least one pair of populations, but it does not specify which pairs have a significant difference. To identify significant pairs, we need to conduct a **post-hoc analysis**, which involves performing a Student's t-test for all possible population pairs in our data. \n", "\n", "However, since multiple tests will be performed in this analysis, we must adjust our type I error $\\alpha$ to account for it. Correcting for multiple testing is crucial and will be discussed in detail towards the end of the course. There are multiple procedures for doing this kind of correction, but for now we will just assume a **Bonferroni** correction, which just divides the original type I error $\\alpha$ by the number of tests performed. As a result, in the above result, since we had three groups, we will be performing three tests. Assuming a typical type I error $\\alpha=0.05$, that means that the Bonferroni corrected type I error will be $\\alpha=0.05/3$. That is, now, to claim that of the possible three tests are significant, their p-value must equal or smaller than 0.05/3, instead of the usual 0.05.\n", "\n", "In R, we can run a post-hoc analysis performing pairwise comparisons using t-tests with the function `pairwise.t.test`:" ] }, { "cell_type": "code", "execution_count": 59, "id": "8a4ee445", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "\tPairwise comparisons using t tests with pooled SD \n", "\n", "data: anova.data$value and anova.data$group \n", "\n", " a b \n", "b 0.012 - \n", "c 1.000 0.137\n", "\n", "P value adjustment method: bonferroni " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pairwise.t.test(anova.data$value, \n", " anova.data$group,\n", " p.adjust.method = \"bonferroni\")" ] }, { "cell_type": "markdown", "id": "d856138b", "metadata": {}, "source": [ "From these results, we can claim that the only the means between the group a and b are different." ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.2.2" } }, "nbformat": 4, "nbformat_minor": 5 }