{ "cells": [ { "cell_type": "markdown", "id": "e2e302a3", "metadata": {}, "source": [ "# FORMULAS" ] }, { "cell_type": "markdown", "id": "623e728d", "metadata": {}, "source": [ "You can perform an ANOVA test in R using the `aov` built-in function. As we mentioned in the lecture, this function uses **formulas** as their inputs. Let's briefly describe their meaning and uses." ] }, { "cell_type": "markdown", "id": "c2db3226", "metadata": {}, "source": [ "A formula object is just a variable, but a special type that specifies a **relationship** between other variables. A formula is specified using the \"tilde operator\". ~. A very simple example of a formula is shown below:" ] }, { "cell_type": "code", "execution_count": 7, "id": "fbaaa5f4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "out ~ pred" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "formula1 <- out ~ pred\n", "formula1" ] }, { "cell_type": "markdown", "id": "35ea1386", "metadata": {}, "source": [ "Normally, the variable on the left-hand side of a tilde, ~ is called the \"dependent variable\", while the variables on the right-hand side are called the \"independent variables\" and are joined by plus signs +. That is why, you could also consider other examples involving" ] }, { "cell_type": "code", "execution_count": 19, "id": "5585a8a8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "out ~ pred1 + pred2" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "formula2 <- out ~ pred1 + pred2 # more than one variable on the right\n", "formula2" ] }, { "cell_type": "markdown", "id": "4241f0f7", "metadata": {}, "source": [ "**N.B.** The names for these variables change depending on the context. You might have already seen independent variables appear as \"predictor (variable)\", \"controlled variable\", \"feature\", etc. Similarly, you might come across dependent variables as \"response variable\", \"outcome variable\" or \"label\"." ] }, { "cell_type": "markdown", "id": "7b1ef572", "metadata": {}, "source": [ "Some times, we will need or want to create a formula from an R object, such as a string. In such cases, you can use the formula or as.formula() function" ] }, { "cell_type": "code", "execution_count": 12, "id": "12756dfc", "metadata": {}, "outputs": [], "source": [ "?as.formula" ] }, { "cell_type": "code", "execution_count": 18, "id": "bb012dcf", "metadata": {}, "outputs": [ { "data": { "text/html": [ "'out ~ pred'" ], "text/latex": [ "'out \\textasciitilde{} pred'" ], "text/markdown": [ "'out ~ pred'" ], "text/plain": [ "[1] \"out ~ pred\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "'character'" ], "text/latex": [ "'character'" ], "text/markdown": [ "'character'" ], "text/plain": [ "[1] \"character\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "out ~ pred" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "'formula'" ], "text/latex": [ "'formula'" ], "text/markdown": [ "'formula'" ], "text/plain": [ "[1] \"formula\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "formula5<- \"out ~ pred\"\n", "formula5\n", "class(formula5)\n", "\n", "formula5<-as.formula(\"out ~ pred\")\n", "formula5\n", "class(formula5)" ] }, { "cell_type": "markdown", "id": "ecbb9de4", "metadata": {}, "source": [ "## Operators\n", "\n", "We have just seen that the independent variables can be joined with the + symbol. However, this is not the only symbol that we can use in your formulas. Let's have a look at other kind of symbols:\n", "\n", "\n", "\"~\" : As we saw above, this operator separates the dependent variable from the independent variables. For example, y ~ x means \"y is predicted by x\".\n", "\n", "\"+\" : As we saw above, this operator adds independent variables to the model. For example, y ~ x + z means \"y is predicted by x and z\".\n", "\n", "\"-\" : This operator removes independent variables from the model. For example, y ~ x - z means \"y is predicted by x, but not z\".\n", "\n", "\"*\" : This operator includes all possible interactions between the predictor variables. For example, y ~ x * z means \"y is predicted by the main effects of x and z, as well as their interaction\"." ] }, { "cell_type": "markdown", "id": "3a224cba", "metadata": {}, "source": [ "## Functions\n", "\n", "You can also use functions within formulas to transform variables or perform other operations. For example, y ~ log(x) means \"y is predicted by the logarithm of x\"." ] }, { "cell_type": "code", "execution_count": 21, "id": "ad00a4b7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "out ~ pred1 * pred2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "~var1 + var2" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "formula3 <- out ~ pred1 * pred2 # different relationship between predictors\n", "formula3\n", "\n", "formula4 <- ~ var1 + var2 # a ’one-sided’ formula\n", "formula4" ] }, { "cell_type": "markdown", "id": "19152786", "metadata": {}, "source": [ "# Statistical test for two variables (continuous vs categorical)" ] }, { "cell_type": "markdown", "id": "350c78c1", "metadata": {}, "source": [ "Let's generate some data for the rest of the tutorial" ] }, { "cell_type": "code", "execution_count": 2, "id": "90f71bf5", "metadata": {}, "outputs": [], "source": [ "set.seed(1234)\n", "\n", "# Generate random data for group 1 with mean 10 and standard deviation 2\n", "welch.data.1<-rbind(data.frame(value=rnorm(25, mean = 10, sd = 2), group='a'),\n", " data.frame(value=rnorm(25, mean = 13, sd = 3), group='b'))\n", "\n", "welch.data.2<-rbind(data.frame(value=rnorm(25, mean = 10, sd = 2), group='a'),\n", " data.frame(value=rnorm(25, mean = 11, sd = 3), group='b'))\n", "\n", "# Generate random data for group 1 with mean 10 and standard deviation 2\n", "students.data<-rbind(data.frame(value=rnorm(25, mean = 10, sd = 2), group='a'),\n", " data.frame(value=rnorm(25, mean = 11, sd = 2), group='b'))\n", "\n", "# Generate random data for group 1 with mean 10 and standard deviation 2\n", "anova.data<-rbind(data.frame(value=rnorm(25, mean = 10, sd = 2), group='a'),\n", " data.frame(value=rnorm(25, mean = 12, sd = 2), group='b'), \n", " data.frame(value=rnorm(25, mean = 10, sd = 2), group='c'))" ] }, { "cell_type": "markdown", "id": "13a1dd5e", "metadata": {}, "source": [ "