{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Spurious Correlation\n", "\n", "**Coauthored by Samuel (Siyang) Li, Thomas Sargent, and Natasha Watkins**\n", "\n", "This notebook illustrates the phenomenon of **spurious correlation** between two uncorrelated but individually highly serially correlated time series \n", "\n", "The phenomenon surfaces when two conditions occur\n", "\n", "* the sample size is small \n", "\n", "* both series are highly serially correlated\n", "\n", "We'll proceed by \n", "\n", "- constructing many simulations of two uncorrelated but individually serially correlated time series \n", "\n", "- for each simulation, constructing the correlation coefficient between the two series\n", "\n", "- forming a histogram of the correlation coefficient\n", "\n", "- taking that histogram as a good approximation of the population distribution of the correlation coefficient\n", "\n", "In more detail, we construct two time series governed by\n", "\n", "\\eqalign{ y_{t+1} & = \\rho y_t + \\sigma \\epsilon_{t+1} \\cr\n", " x_{t+1} & = \\rho x_t + \\sigma \\eta_{t+1}, \\quad t=0, \\ldots , T }\n", " \n", "where\n", "\n", "* $y_0 = 0, x_0 = 0$\n", "\n", "* $\\{\\epsilon_{t+1}\\}$ is an i.i.d. process where $\\epsilon_{t+1}$ follows a normal distribution with mean zero and variance $1$\n", "\n", "* $\\{\\eta_{t+1}\\}$ is an i.i.d. process where $\\eta_{t+1}$ follows a normal distribution with mean zero and variance $1$\n", "\n", "We construct the sample correlation coefficient between the time series $y_t$ and $x_t$ of length $T$\n", "\n", "The population value of correlation coefficient is zero\n", "\n", "We want to study the distribution of the sample correlation coefficient as a function of $\\rho$ and $T$ when\n", "$\\sigma > 0$\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll begin by importing some useful modules" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import scipy.stats as stats\n", "from matplotlib import pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Empirical distribution of correlation coefficient r" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now set up a function to generate a panel of simulations of two identical independent AR(1) time series\n", "\n", "We set the function up so that all arguments are keyword arguments with associated default values\n", "\n", "- location is the common mathematical expectation of the innovations in the two independent autoregressions\n", "\n", "- sigma is the common standard deviation of the indepedent innovations in the two autoregressions \n", "\n", "- rho is the common autoregression coefficient of the two AR(1) processes\n", "\n", "- sample_size_series is the length of each of the two time series\n", "\n", "- simulation is the number of simulations used to generate an empirical distribution of the correlation of the two uncorrelated time series" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def spurious_reg(rho=0, sigma=10, location=0, sample_size_series=300, simulation=5000):\n", " \"\"\"\n", " Generate two independent AR(1) time series with parameters: rho, sigma, location, \n", " sample_size_series(r.v. in one series), simulation. \n", " Output : displays distribution of empirical correlation\n", " \"\"\"\n", " \n", " def generate_time_series():\n", " # Generates a time series given parameters\n", " \n", " x = [] # Array for time series\n", " x.append(np.random.normal(location/(1 - rho), sigma/np.sqrt(1 - rho**2), 1)) # Initial condition\n", " x_temp = x[0]\n", " epsilon = np.random.normal(location, sigma, sample_size_series) # Random draw\n", " T = range(sample_size_series - 1)\n", " for t in T:\n", " x_temp = x_temp * rho + epsilon[t] # Find next step in time series\n", " x.append(x_temp)\n", " return x\n", " \n", " r_list = [] # Create list to store correlation coefficients\n", " \n", " for round in range(simulation): \n", " y = generate_time_series()\n", " x = generate_time_series()\n", " r = stats.pearsonr(y, x)[0] # Find correlation coefficient\n", " r_list.append(r) \n", " \n", " fig, ax = plt.subplots()\n", " sns.distplot(r_list, kde=True, rug=False, hist=True, ax=ax) # Plot distribution of r\n", " ax.set_xlim(-1, 1)\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparisons of two value of $\\rho$\n", "\n", "The next two cells we'll compare outcomes with a low $\\rho$ versus a high $\\rho$\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "