{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Chapter 7: Analysis of Variance (Anova).ipynb", "provenance": [], "collapsed_sections": [ "1mrsPw8OSpPG", "dthsvFWS-UmB", "f2CRhyLNZZ5N", "EVRQF0HkZhxd", "6r7fJg2TiQlk" ] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# **`Chapter 7: Analysis of Variance (Anova)`**" ], "metadata": { "id": "Wb9LyXR91KSv" } }, { "cell_type": "markdown", "source": [ "**Table of Content:**\n", "\n", "- [Import Libraries](#Import_Libraries)\n", "- [7.1. One-Way Analysis of Variance](#One-Way_Analysis_of_Variance)\n", " - [7.1.1. Equal Sample Sizes](#Equal_Sample_Sizes)\n", " - [7.1.2. Unequal Sample Sizes](#Unequal_Sample_Sizes)\n", "- [7.2. Two-Way Analysis of Variance](#Two-Way_Analysis_of_Variance) " ], "metadata": { "id": "w5zb7kZV85_J" } }, { "cell_type": "markdown", "source": [ "\n", "\n", "## **Import Libraries**" ], "metadata": { "id": "1mrsPw8OSpPG" } }, { "cell_type": "code", "source": [ "!pip install --upgrade scipy" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1GroeqIoC6fr", "outputId": "139f4f13-d57b-472c-9ff7-e83fea5ac7d8" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", "Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (1.7.3)\n", "Requirement already satisfied: numpy<1.23.0,>=1.16.5 in /usr/local/lib/python3.7/dist-packages (from scipy) (1.21.6)\n" ] } ] }, { "cell_type": "code", "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import matplotlib.patches as mpatches\n", "from matplotlib import collections as mc\n", "import seaborn as sns\n", "import math\n", "from scipy import stats\n", "from scipy.stats import norm\n", "from scipy.stats import chi2\n", "from scipy.stats import t\n", "from scipy.stats import f\n", "from scipy.stats import bernoulli\n", "from scipy.stats import binom\n", "from scipy.stats import nbinom\n", "from scipy.stats import geom\n", "from scipy.stats import poisson\n", "from scipy.stats import uniform\n", "from scipy.stats import randint\n", "from scipy.stats import expon\n", "from scipy.stats import gamma\n", "from scipy.stats import beta\n", "from scipy.stats import weibull_min\n", "from scipy.stats import hypergeom\n", "from scipy.stats import shapiro\n", "from scipy.stats import pearsonr\n", "from scipy.stats import normaltest\n", "from scipy.stats import anderson\n", "from scipy.stats import spearmanr\n", "from scipy.stats import kendalltau\n", "from scipy.stats import chi2_contingency\n", "from scipy.stats import ttest_ind\n", "from scipy.stats import ttest_rel\n", "from scipy.stats import mannwhitneyu\n", "from scipy.stats import wilcoxon\n", "from scipy.stats import kruskal\n", "from scipy.stats import friedmanchisquare\n", "from statsmodels.tsa.stattools import adfuller\n", "from statsmodels.tsa.stattools import kpss\n", "from statsmodels.stats.weightstats import ztest\n", "import statsmodels.api as sm\n", "from sklearn.linear_model import LinearRegression\n", "from scipy.stats import f_oneway\n", "from statsmodels.formula.api import ols\n", "from statsmodels.stats.anova import anova_lm\n", "from scipy.integrate import quad\n", "from statsmodels.stats.outliers_influence import summary_table\n", "from statsmodels.sandbox.regression.predstd import wls_prediction_std\n", "from statsmodels.stats.outliers_influence import variance_inflation_factor\n", "from IPython.display import display, Latex\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "warnings.simplefilter(action='ignore', category=FutureWarning)" ], "metadata": { "id": "o60rxBbwmJM5" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "\n", "\n", "## **7.1. One-Way Analysis of Variance:**" ], "metadata": { "id": "dthsvFWS-UmB" } }, { "cell_type": "markdown", "source": [ "This technique, which is rather general and can be used to make inferences about a multitude of parameters relating to population means, is known as the analysis of variance.\n", "\n", "We suppose that we have been provided samples of size $n$ from $m$ distinct populations and that we want to use these data to test the hypothesis that the $m$ population means are equal." ], "metadata": { "id": "yRiF35t89W4t" } }, { "cell_type": "markdown", "source": [ "\n", "\n", "### **7.1.1. Equal Sample Sizes:**" ], "metadata": { "id": "f2CRhyLNZZ5N" } }, { "cell_type": "markdown", "source": [ "Since the mean of a random variable depends only on a single factor, namely, the sample the variable is from, this scenario is said to constitute a one-way analysis of variance.\n", "\n", "One way of thinking about this is to imagine that we have $m$ different treatments, where the result of applying treatment $i$ on an item is a normal random variable with mean $\\mu_i$ and variance $\\sigma^2$. We are then interested in testing the hypothesis that all treatments have the same effect, by applying each treatment to a (different) sample of $n$ items and then analyzing the result.\n", "\n", "Consider $m$ independent samples, each of size $n$, where the members of the ith sample $X_{i1}, X_{i2}, . . . , X_{in}$ are normal random variables with unknown mean $\\mu_i$ and unknown variance $\\sigma^2$.\n", "\n", "$X_{ij} \\sim N(\\mu_i, \\sigma^2) \\quad i=1,...,m,\\ j=1,...,n$\n", "\n", "$\\\\ $\n", "\n", "$H_0: \\mu_1=\\mu_2=...=\\mu_m$\n", "\n", "$H_1:$ not all the means are equal (at least two of them differ.)" ], "metadata": { "id": "MfiPoYrp-4U4" } }, { "cell_type": "markdown", "source": [ "**Within samples sum of squares:**\n", "\n", "Since there are a total of $nm$ independent normal random variables $X_{ij}$, it follows that the sum of the squares of their standardized versions will be a chi-square random variable with $nm$ degrees of freedom.\n", "\n", "$\\sum_{i=1}^m \\sum_{j=1}^n \\frac{(X_{ij}-E[X_{ij}])^2}{\\sigma^2} = \\sum_{i=1}^m \\sum_{j=1}^n \\frac{(X_{ij}- \\mu_i)^2}{\\sigma^2} \\sim \\chi^2_{nm}$\n", "\n", "To obtain estimators for the $m$ unknown parameters $\\mu_1, . . . ,\\mu_m$, let $X_{i.}$ denote the average of all the elements in sample $i$:\n", "\n", "$X_{i.} = \\sum_{j=1}^n \\frac{X_{ij}}{n}$\n", "\n", "The variable $X_{i.}$ is the sample mean of the ith population, and as such is the estimator of the population mean $\\mu_i$ for $i=1,...,m$.\n", "\n", "Then if we substitute the $\\mu$ with $X_{i.}$ the following variable will have chi-square distribution with $nm − m$ degrees of freedom. (Recall that 1 degree of freedom is lost for each parameter that is estimated.)\n", "\n", "$\\sum_{i=1}^m \\sum_{j=1}^n \\frac{(X_{ij}- X_{i.})^2}{\\sigma^2} \\sim \\chi^2_{nm-m}$\n", "\n", "$SS_W = \\sum_{i=1}^m \\sum_{j=1}^n (X_{ij}- X_{i.})^2$\n", "\n", "$\\frac{E[SS_W]}{\\sigma^2} = nm-m \\quad \\rightarrow \\quad \\frac{E[SS_W]}{nm-m} = \\sigma^2$\n", "\n", "$\\frac{SS_W}{nm-m}$ is an estimator of $\\sigma^2$." ], "metadata": { "id": "IsmBOUo2EB6_" } }, { "cell_type": "markdown", "source": [ "**between samples sum of squares:**\n", "\n", "assume that $H_0$ is true and so all the population means $μ_i$ are equal, say, $μ_i = μ$ for all $i$. Under this condition it follows that the $m$ sample means $X_{1.}, X_{2.}, . . ., X_m$. will all be normally distributed with the same mean $\\mu$ and the same variance $\\frac{\\sigma^2}{n}$. Hence, the sum of squares of the $m$ standardized variables $\\frac{X_{i.}-\\mu}{\\sqrt{\\frac{\\sigma^2}{n}}} = \\frac{\\sqrt{n}(X_{i.}-\\mu)}{\\sigma}$ will be a chi-square random variable with $m$ degrees of freedoms.\n", "\n", "$\\sum_{i=1}^m \\frac{n(X_{i.}-\\mu)^2}{\\sigma^2} \\sim \\chi_m^2$\n", "\n", "Now, when all the population means are equal to $\\mu$, then the estimator of $\\mu$ is the average of all the nm data values. That is, the estimator of $\\mu$ is $X_{..}$.\n", "\n", "$X_{..} = \\frac{\\sum_{i=1}^m \\sum_{j=1}^n X_{ij}}{nm} = \\frac{\\sum_{i=1}^m X_{i.}}{m}$\n", "\n", "If we now substitute $X_{..}$ for the unknown parameter μ in expression $\\sum_{i=1}^m \\frac{n(X_{i.}-\\mu)^2}{\\sigma^2}$ it follows, when $H_0$ is true, that the resulting quantity will be a chi-square random variable with $m − 1$ degrees of freedom.\n", "\n", "$\\sum_{i=1}^m \\frac{n(X_{i.}-X_{..})^2}{\\sigma^2} \\sim \\chi_{m-1}^2$\n", "\n", "$SS_b = n \\sum_{i=1}^m (X_{i.}-X_{..})^2$\n", "\n", "When $H_0$ is true:\n", "\n", "$\\frac{E[SS_b]}{\\sigma^2} = m-1 \\quad \\rightarrow \\quad \\frac{E[SS_b]}{m-1} = \\sigma^2$\n", "\n", "$\\frac{SS_b}{m-1}$ is an estimator of $\\sigma^2$.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Estimators of $\\sigma^2$Conditions
$\\frac{SS_W}{nm-m}$Always true
$\\frac{SS_b}{m-1}$Only when $H_0$ is true
" ], "metadata": { "id": "bS5UmLn3IhDF" } }, { "cell_type": "markdown", "source": [ "Because it can be shown that $\\frac{SS_b}{m-1}$ will tend to exceed $\\sigma^2$ when $H_0$ is not true, the test statistic is:\n", "\n", "$F_0 = \\frac{\\frac{SS_b}{m-1}}{\\frac{SS_W}{nm-m}}$\n", "\n", "$\\\\ $\n", "\n", "Significance level = $\\alpha$\n", "\n", "$\\\\ $\n", "\n", "We accept $H_0$ if:\n", "\n", "1. $F_0 < F_{m-1,\\ nm-m,\\ \\alpha}$\n", "\n", "2. P_value = $P(F_{m-1,\\ nm-m} \\geq F_0) > \\alpha$" ], "metadata": { "id": "icLzqh0lRl3g" } }, { "cell_type": "markdown", "source": [ "The sum of squares identity:\n", "\n", "$\\sum_{i=1}^m \\sum_{j=1}^n X_{ij}^2 = nmX_{..}^2 + SS_b + SS_W$" ], "metadata": { "id": "xlWt1DIPYtaY" } }, { "cell_type": "markdown", "source": [ "**Summary:**\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
--Source of Variation------------------------Sum of Squares------------------------Degrees of Freedom----Mean of Squares----Value of Test Statistic--
Between Samples$SS_b=n\\sum_{i=1}^m(X_{i.}-X_{..})^2$$m-1$$MS_b = \\frac{SS_b}{m-1}$$F_0 = \\frac{\\frac{SS_b}{m-1}}{\\frac{SS_W}{nm-m}}$\n", "
Within Samples$SS_W = \\sum_{i=1}^m \\sum_{j=1}^n (X_{ij}- X_{i.})^2$$nm-m$$MS_W = \\frac{SS_W}{nm-m}$
Total$SS_T = SS_W + SS_b = \\sum_{i=1}^m \\sum_{j=1}^n (X_{ij}- X_{..})^2$$nm-1$
" ], "metadata": { "id": "Ti5EqBy9JrrG" } }, { "cell_type": "markdown", "source": [ "You can do this test with f_oneway from scipy library." ], "metadata": { "id": "dq1LdNgzQcq-" } }, { "cell_type": "code", "source": [ "Sample1 = [220, 251, 226, 246, 260]\n", "Sample2 = [244, 235, 232, 242, 225]\n", "Sample3 = [252, 272, 250, 238, 256]\n", "\n", "alpha = 0.05\n", "results = f_oneway(Sample1, Sample2, Sample3)\n", "\n", "print(results, '\\n')\n", "\n", "if results[1] < alpha:\n", " print(f'Since p_value < {alpha}, reject null hypothesis.')\n", "else:\n", " print(f'Since p_value > {alpha}, the null hypothesis cannot be rejected.')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "RoAxmqus7_B2", "outputId": "e84caf74-683b-4402-cd9c-75fd43a202c1" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "F_onewayResult(statistic=2.6009238802972487, pvalue=0.11524892355706169) \n", "\n", "Since p_value > 0.05, the null hypothesis cannot be rejected.\n" ] } ] }, { "cell_type": "markdown", "source": [ "\n", "\n", "### **7.1.2. Unequal Sample Sizes:**" ], "metadata": { "id": "EVRQF0HkZhxd" } }, { "cell_type": "markdown", "source": [ "Suppose that we have $m$ normal samples of respective sizes $n_1, n_2, ... , n_m$. That is, the data consist of the $\\sum_{i=1}^m n_i$\n", "independent random variables $Xij,\\ j = 1, ... , n_i,\\ i = 1, ... , m$, where $X_{ij} ∼ N (\\mu_i, \\sigma^2)$\n", "\n", "$\\\\ $\n", "\n", "$H_0: \\mu_1=\\mu_2=...=\\mu_m$\n", "\n", "$H_1:$ not all the means are equal (at least two of them differ.)" ], "metadata": { "id": "qdRARCxRZttP" } }, { "cell_type": "markdown", "source": [ "**Within samples sum of squares:**\n", "\n", "Since there are a total of $\\sum_{i=1}^m n_i$ independent normal random variables $X_{ij}$, it follows that the sum of the squares of their standardized versions will be a chi-square random variable with $\\sum_{i=1}^m n_i$ degrees of freedom.\n", "\n", "$\\sum_{i=1}^m \\sum_{j=1}^{n_i} \\frac{(X_{ij}-E[X_{ij}])^2}{\\sigma^2} = \\sum_{i=1}^m \\sum_{j=1}^{n_i} \\frac{(X_{ij}- \\mu_i)^2}{\\sigma^2} \\sim \\chi^2_{\\sum_{i=1}^m n_i}$\n", "\n", "To obtain estimators for the $m$ unknown parameters $\\mu_1, . . . ,\\mu_m$, let $X_{i.}$ denote the average of all the elements in sample $i$:\n", "\n", "$X_{i.} = \\sum_{j=1}^{n_i} \\frac{X_{ij}}{n}$\n", "\n", "The variable $X_{i.}$ is the sample mean of the ith population, and as such is the estimator of the population mean $\\mu_i$ for $i=1,...,m$.\n", "\n", "Then if we substitute the $\\mu$ with $X_{i.}$ the following variable will have chi-square distribution with $\\sum_{i=1}^m n_i - m$ degrees of freedom. (Recall that 1 degree of freedom is lost for each parameter that is estimated.)\n", "\n", "$\\sum_{i=1}^m \\sum_{j=1}^{n_i} \\frac{(X_{ij}- X_{i.})^2}{\\sigma^2} \\sim \\chi^2_{\\sum_{i=1}^m n_i - m}$\n", "\n", "$SS_W = \\sum_{i=1}^m \\sum_{j=1}^{n_i} (X_{ij}- X_{i.})^2$\n", "\n", "$\\frac{E[SS_W]}{\\sigma^2} = \\sum_{i=1}^m n_i - m \\quad \\rightarrow \\quad \\frac{E[SS_W]}{\\sum_{i=1}^m n_i - m} = \\sigma^2$\n", "\n", "$\\frac{SS_W}{\\sum_{i=1}^m n_i - m}$ is an estimator of $\\sigma^2$." ], "metadata": { "id": "PHvjr5IDarxc" } }, { "cell_type": "markdown", "source": [ "**between samples sum of squares:**\n", "\n", "assume that $H_0$ is true and so all the population means $μ_i$ are equal, say, $μ_i = μ$ for all $i$. Under this condition it follows that the $m$ sample means $X_{1.}, X_{2.}, . . ., X_m$. will all be normally distributed with the same mean $\\mu$ and the same variance $\\frac{\\sigma^2}{n_i}$. Hence, the sum of squares of the $m$ standardized variables $\\frac{X_{i.}-\\mu}{\\sqrt{\\frac{\\sigma^2}{n_i}}} = \\frac{\\sqrt{n_i}(X_{i.}-\\mu)}{\\sigma}$ will be a chi-square random variable with $m$ degrees of freedoms.\n", "\n", "$\\sum_{i=1}^m \\frac{n_i(X_{i.}-\\mu)^2}{\\sigma^2} \\sim \\chi_m^2$\n", "\n", "Now, when all the population means are equal to $\\mu$, then the estimator of $\\mu$ is the average of all the nm data values. That is, the estimator of $\\mu$ is $X_{..}$.\n", "\n", "$X_{..} = \\frac{\\sum_{i=1}^m \\sum_{j=1}^{n_i} X_{ij}}{\\sum_{i=1}^m n_i} = \\frac{\\sum_{i=1}^m X_{i.}}{m}$\n", "\n", "If we now substitute $X_{..}$ for the unknown parameter $\\mu$ in expression $\\sum_{i=1}^m \\frac{n_i(X_{i.}-\\mu)^2}{\\sigma^2}$ it follows, when $H_0$ is true, that the resulting quantity will be a chi-square random variable with $m − 1$ degrees of freedom.\n", "\n", "$\\sum_{i=1}^m \\frac{n_i(X_{i.}-X_{..})^2}{\\sigma^2} \\sim \\chi_{m-1}^2$\n", "\n", "$SS_b = n_i \\sum_{i=1}^m (X_{i.}-X_{..})^2$\n", "\n", "When $H_0$ is true:\n", "\n", "$\\frac{E[SS_b]}{\\sigma^2} = m-1 \\quad \\rightarrow \\quad \\frac{E[SS_b]}{m-1} = \\sigma^2$\n", "\n", "$\\frac{SS_b}{m-1}$ is an estimator of $\\sigma^2$.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Estimators of $\\sigma^2$Conditions
$\\frac{SS_W}{\\sum_{i=1}^m n_i - m}$Always true
$\\frac{SS_b}{m-1}$Only when $H_0$ is true
" ], "metadata": { "id": "krC3IvY8dJZ2" } }, { "cell_type": "markdown", "source": [ "Because it can be shown that $\\frac{SS_b}{m-1}$ will tend to exceed $\\sigma^2$ when $H_0$ is not true, the test statistic is:\n", "\n", "$F_0 = \\frac{\\frac{SS_b}{m-1}}{\\frac{SS_W}{\\sum_{i=1}^m n_i - m}}$\n", "\n", "$\\\\ $\n", "\n", "Significance level = $\\alpha$\n", "\n", "$\\\\ $\n", "\n", "We accept $H_0$ if:\n", "\n", "1. $F_0 < F_{m-1,\\ \\sum_{i=1}^m n_i - m,\\ \\alpha}$\n", "\n", "2. P_value = $P(F_{m-1,\\ \\sum_{i=1}^m n_i - m} \\geq F_0) > \\alpha$" ], "metadata": { "id": "7f34W5C3fr-I" } }, { "cell_type": "markdown", "source": [ "**Summary:**\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
--Source of Variation------------------------Sum of Squares------------------------Degrees of Freedom-----Mean of Squares-----Value of Test Statistic--
Between Samples$SS_b = n_i \\sum_{i=1}^m (X_{i.}-X_{..})^2$$m-1$$MS_b = \\frac{SS_b}{m-1}$$F_0 = \\frac{\\frac{SS_b}{m-1}}{\\frac{SS_W}{\\sum_{i=1}^m n_i - m}}$\n", "
Within Samples$SS_W = \\sum_{i=1}^m \\sum_{j=1}^{n_i} (X_{ij}- X_{i.})^2$$\\sum_{i=1}^m n_i - m$$MS_W = \\frac{SS_W}{\\sum_{i=1}^m n_i - m}$
Total$SS_T = SS_W + SS_b = \\sum_{i=1}^m \\sum_{j=1}^{n_i} (X_{ij}- X_{..})^2$$\\sum_{i=1}^m n_i-1$
" ], "metadata": { "id": "qpx-mpOhgAtR" } }, { "cell_type": "code", "source": [ "Sample1 = [220, 251, 226, 246, 260]\n", "Sample2 = [244, 235, 232, 242]\n", "Sample3 = [252, 272, 250]\n", "\n", "alpha = 0.05\n", "results = f_oneway(Sample1, Sample2, Sample3)\n", "\n", "print(results, '\\n')\n", "\n", "if results[1] < alpha:\n", " print(f'Since p_value < {alpha}, reject null hypothesis.')\n", "else:\n", " print(f'Since p_value > {alpha}, the null hypothesis cannot be rejected.')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "sRUSJO4YZkFc", "outputId": "44bd1ace-4eb4-46dd-bb68-6792dd9fc594" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "F_onewayResult(statistic=2.2667346740503254, pvalue=0.15949612861261475) \n", "\n", "Since p_value > 0.05, the null hypothesis cannot be rejected.\n" ] } ] }, { "cell_type": "markdown", "source": [ "\n", "\n", "## **7.2. Two-Way Analysis of Variance:**" ], "metadata": { "id": "6r7fJg2TiQlk" } }, { "cell_type": "markdown", "source": [ "We suppose that each data value is affected by two factors. We will refer to the first factor as the \"row\" factor, and the second factor as\n", "the \"column\" factor. we will suppose that the data $X_{ij},\\ i = 1, ... , m,\\ j = 1, ... , n$ are independent normal random variables with a common variance $\\sigma^2$ and we suppose that the mean value of data depends in an additive manner on both its row and its column.\n", "\n", "We let $X_{ij}$ represent the value of the jth member of\n", "sample $i$, then that model could be symbolically represented as: $E[X_{ij}] = \\mu_i$\n", "\n", "However, if we let $\\mu$ denote the average value of the $\\mu_i$ $(\\mu = \\frac{\\sum_{i=1}^m \\mu_i}{m})$ then we can rewrite the model as $E[X_{ij}] = \\mu + \\alpha_i$ where $\\alpha_i = \\mu_i -\\mu$\n", "\n", "With this definition of $\\alpha_i$ as the deviation of $\\mu_i$ from the average mean value, it is easy to see that $\\sum_{i=1}^m \\alpha_i = 0$\n", "\n", "A two-factor additive model can also be expressed in terms of row and column deviations.\n", "\n", "If we let $\\mu_{ij} = E[X_{ij}]$, then the additive model supposes that for some constants $a_i,\\ i = 1, ... , m$ and $b_j,\\ j = 1, ... , n$\n", "\n", "$\\mu_{ij} = \\alpha_i + b_j$\n", "\n", "Continuing our use of the \"dot\" (or averaging) notation, we let\n", "\n", "$\\mu_{i.} = \\sum_{j=1}^n \\frac{\\mu_{ij}}{n} \\qquad \\mu_{.j} = \\sum_{i=1}^m \\frac{\\mu_{ij}}{m} \\qquad \\mu_{..} = \\sum_{i=1}^m \\sum_{j=1}^n\\frac{\\mu_{ij}}{nm}$\n", "\n", "$a_. = \\sum_{i=1}^m \\frac{a_i}{m} \\qquad b_. = \\sum_{j=1}^n \\frac{b_j}{n}$\n", "\n", "Note that:\n", "\n", "$\\mu_{i.} = \\sum_{j=1}^n \\frac{(a_i + b_j)}{n} = a_i + b_. \\qquad \\mu_{.j} = a_. + b_j \\qquad \\mu_{..} = a_. + b_.$\n", "\n", "If we now set\n", "\n", "$\\mu = \\mu_{..} = a_. + b_. \\qquad \\alpha_i = \\mu_{i.}-\\mu = \\alpha_i -\\alpha_. \\qquad \\beta_j = \\mu_{.j}-\\mu = b_j - b_.$\n", "\n", "Then the model can be written as \n", "\n", "$\\mu_{ij} = E[X_{ij}] = \\mu + \\alpha_i + \\beta_j$\n", "\n", "The value $\\mu$ is called the grand mean, $\\alpha_i$ is the deviation from the grand mean due to row $i$, and $\\beta_j$ is the deviation from the grand mean due to column $j$.\n", "\n", "$X_{i.}=\\frac{\\sum_{j=1}^n X_{ij}}{n} \\qquad X_{.j}=\\frac{\\sum_{i=1}^m X_{ij}}{m} \\qquad X_{..}=\\frac{\\sum_{i=1}^m \\sum_{j=1}^n X_{ij}}{nm}$\n", "\n", "Unbiased estimators of $\\mu, \\alpha_i, \\beta_j$ — call them $\\widehat{\\mu},\\ \\widehat{\\alpha_i},\\ \\widehat{\\beta_j}$ — are given by\n", "\n", "$\\widehat{\\mu} = X_{..} \\qquad \\widehat{\\alpha_i} = X_{i.} - X_{..} \\qquad \\widehat{\\beta_j} = X_{.j} - X_{..}$" ], "metadata": { "id": "XqazMIWzi6Wr" } }, { "cell_type": "markdown", "source": [ "**Hypothesis Tests:**\n", "\n", "Test 1:\n", "\n", "$H_0:$ all $\\alpha_i = 0$ \n", "\n", "$H_1:$ not all the $\\alpha_i$ are equal to 0\n", "\n", "This null hypothesis states that there is no row effect, in that the value of a data is not affected by its row factor level.\n", "\n", "$\\\\ $\n", "\n", "Test 2:\n", "\n", "$H_0:$ all $\\beta_i = 0$ \n", "\n", "$H_1:$ not all the $\\beta_i$ are equal to 0\n", "\n", "This null hypothesis states that there is no column effect, in that the value of a data is not affected by its column factor level." ], "metadata": { "id": "0NWDjsqbwXtn" } }, { "cell_type": "markdown", "source": [ "**Error Sum of Squares:**\n", "\n", "To obtain tests for the above null hypotheses, we will apply the analysis of variance approach in which two different estimators are derived for the variance $\\sigma^2$. The first will always be a valid estimator, whereas the second will only be a valid estimator when the null hypothesis is true. In addition, the second estimator will tend to overestimate $\\sigma^2$ when the null hypothesis is not true.\n", "\n", "To obtain our first estimator of σ2, we start with the fact that:\n", "$\\sum_{i=1}^m \\sum_{j=1}^n \\frac{(X_{ij}-E[X_{ij}])^2}{\\sigma^2} = \\sum_{i=1}^m \\sum_{j=1}^n \\frac{(X_{ij}-\\mu-\\alpha_i-\\beta_j)^2}{\\sigma^2} \\sim \\chi_{nm}^2$\n", "\n", "If in the above expression we now replace the unknown parameters $\\mu, \\alpha_1, \\alpha_2, ... , \\alpha_m, \\beta_1, \\beta_2, ... , \\beta_n$ by their estimators $\\widehat{\\mu}, \\widehat{\\alpha_1}, \\widehat{\\alpha_2}, ... , \\widehat{\\alpha_m}, \\widehat{\\beta_1}, \\widehat{\\beta_2}, ... , \\widehat{\\beta_n}$, then it turns out that the resulting expression will remain chi-square but will lose $1$ degree of freedom for each parameter that is estimated. Therefore,\n", "\n", "$\\sum_{i=1}^m \\sum_{j=1}^n \\frac{(X_{ij}-\\widehat{\\mu}-\\widehat{\\alpha_i}-\\widehat{\\beta_j})^2}{\\sigma^2} = \\sum_{i=1}^m \\sum_{j=1}^n \\frac{(X_{ij}-X_{i.}-X_{.j}+X_{..})^2}{\\sigma^2} \\sim \\chi_{nm-(n+m-1)=(n-1)(m-1)}^2$\n", "\n", "$SS_e = \\sum_{i=1}^m \\sum_{j=1}^n (X_{ij}-X_{i.}-X_{.j}+X_{..})^2$\n", "\n", "$\\frac{E[SS_e]}{\\sigma^2} = (n-1)(m-1) \\quad \\rightarrow \\quad \\frac{E[SS_e]}{(n-1)(m-1)} = \\sigma^2$\n", "\n", "$\\frac{SS_e}{(n-1)(m-1)}$ is an unbiased estimator of $\\sigma^2$." ], "metadata": { "id": "3PUDFUAEEbwi" } }, { "cell_type": "markdown", "source": [ "**Row Sum of Squares:**\n", "\n", "Suppose now that we want to test the null hypothesis that there is no row effect.\n", "\n", "To obtain a second estimator of $\\sigma^2$, consider the row averages $X_{i.},\\ i = 1, ... , m$. Note that, when $H_0$ is true, each $\\alpha_i$ is equal to 0, and so $E[X_{i.}] = \\mu+\\alpha_i = \\mu$.\n", "Because each $X_{i.}$ is the average of $n$ random variables, each having variance $\\sigma^2$, it follows that $Var(X_{i.})=\\frac{\\sigma^2}{n}$.\n", "\n", "Thus, we see that when $H_0$ is true:\n", "\n", "$\\sum_{i=1}^m \\frac{(X_{i.}-E[X_{i.}])^2}{Var(X_{i.})} = n \\sum_{i=1}^m \\frac{(X_{i.}-\\mu)^2}{\\sigma^2} \\sim \\chi_m^2$\n", "\n", "$SS_r = n \\sum_{i=1}^m (X_{i.}-X_{..})^2$\n", "\n", "$\\frac{E[SS_r]}{\\sigma^2} = m-1 \\quad \\rightarrow \\quad \\frac{E[SS_r]}{m-1} = \\sigma^2$\n", "\n", "$\\frac{SS_r}{m-1}$ is an estimator of $\\sigma^2$." ], "metadata": { "id": "O-BKFeclL9bv" } }, { "cell_type": "markdown", "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Estimators of $\\sigma^2$Conditions
$\\frac{SS_e}{(n−1)(m−1)}$Always true
$\\frac{SS_r}{m-1}$Only when $H_0$ is true
$\\frac{SS_c}{n-1}$Only when $H_0$ is true
" ], "metadata": { "id": "dA34-5xZPGyq" } }, { "cell_type": "markdown", "source": [ "**Summary:**\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
----------------------Sum of Squares------------------------Degrees of Freedom--
Row$SS_r = n \\sum_{i=1}^m (X_{i.}-X_{..})^2$$m-1$
Column$SS_c = \\sum_{j=1}^n (X_{.j}-X_{..})^2$$m-1$
Error$\\sum_{i=1}^m \\sum_{j=1}^n (X_{ij}-X_{i.}-X_{.j}+X_{..})^2$$(n-1)(m-1)$
\n", "\n", "$\\\\ $\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
--Null Hypothesis----Test Statistics------Significance Level $\\alpha$ Test----------------P_Value------------
All $\\alpha_i = 0$$\\frac{\\frac{SS_r}{m-1}}{\\frac{SS_e}{(n-1)(m-1)}}$Reject if $F_0 \\geq F_{m-1,(n-1)(m-1),\\alpha}$$P(F_{m-1,(n-1)(m-1)} \\geq F_0)$
All $\\beta_j = 0$$\\frac{\\frac{SS_c}{n-1}}{\\frac{SS_e}{(n-1)(m-1)}}$Reject if $F_0 \\geq F_{n-1,(n-1)(m-1),\\alpha}$$P(F_{n-1,(n-1)(m-1)} \\geq F_0)$
" ], "metadata": { "id": "jZfYFeZSQGIh" } }, { "cell_type": "code", "source": [ "A = [9,6,8,4,6]\n", "B = [10,4,4,6,5]\n", "C = [1,2,2,3,1]\n", "\n", "data = pd.DataFrame()\n", "data['A'] = A\n", "data['B'] = B\n", "data['C'] = C\n", "\n", "model = ols('C ~ A + B + A:B', data=data).fit()\n", "aov_table = anova_lm(model, type=2)\n", "print(aov_table.round(4))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "17Z_Tu2jidrS", "outputId": "b10d2ecb-2c94-4b1a-8ee7-21f62bf1d4fb" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " df sum_sq mean_sq F PR(>F)\n", "A 1.0 1.2737 1.2737 1.1071 0.4838\n", "B 1.0 0.0253 0.0253 0.0220 0.9062\n", "A:B 1.0 0.3506 0.3506 0.3047 0.6789\n", "Residual 1.0 1.1504 1.1504 NaN NaN\n" ] } ] }, { "cell_type": "markdown", "source": [ "The p-values for A and B turn out to be greater than 0.05 which implies that the means of both the factors don't possess a statistically significant effect on C." ], "metadata": { "id": "lq5KxFRFARch" } } ] }