{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# First time here?\n", "Please see our [demo](ReadMeFirst.ipynb) on how to use the notebooks.\n", "\n", "___\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Correlation\n", "\n", "Correlation is typically used to describe a linear relationship between two variables.\n", "However, in a more broad sense, it can be used to describe any kind of dependence or\n", "association between two variables.\n", "\n", "Probably the most common measure of linear correlation is the Pearson's correlation coefficient, which is defined by $\\rho_{X,Y} = \\frac{\\operatorname{cov}(X,Y)}{\\sigma_X \\sigma_Y}$. Let's see how the correlation changes for some randomly generated data based on the number of samples used, the spread in the data, and the angle...\n", "\n", "First, we need to import a few packages..." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import math\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import scipy.stats\n", "\n", "from ipywidgets import interact, interactive, fixed, interact_manual\n", "import ipywidgets as widgets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll define some functions that will create a randomly sampled line along the x-axis with a given y-axis width (again, randomly displaced)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def makeLine(nsamples, width):\n", " line = np.random.random(size=(nsamples, 2))\n", " scale = np.array([[10.0,0],[0,width]])\n", " shift = np.dot(np.dot(np.ones((nsamples,2)), np.identity(2)), np.diag([-5, -width/2.0]))\n", " linecoords = np.dot(line, scale) + shift\n", " return(linecoords)\n", " \n", "def rotateLine(line, angle):\n", " theta = angle / 180.0 * np.pi\n", " R = np.array([[math.cos(theta), -math.sin(theta)], [math.sin(theta), math.cos(theta)]])\n", " rotated_line = np.transpose(np.dot(R, np.transpose(line)))\n", " return(rotated_line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we create some functions (and global data) to let us manipulate the line" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "line = []\n", "rotated_line = []\n", "\n", "def plotLine():\n", " global rotated_line\n", " plt.plot(rotated_line[:,0], rotated_line[:,1], 'r.')\n", " plt.xlim([-8,8])\n", " plt.ylim([-8,8])\n", " (r,p) = scipy.stats.pearsonr(rotated_line[:,0], rotated_line[:,1])\n", " text = 'r = %.4f' % r\n", " plt.figtext(0.45, 0.0, text)\n", " plt.show()\n", "\n", "def adjustLine(nsamples, width):\n", " global line\n", " global rotated_line\n", " line = makeLine(nsamples, width)\n", " rotated_line = line\n", " \n", "def adjustAngle(angle):\n", " global line\n", " global rotated_line\n", " rotated_line = rotateLine(line, angle)\n", " plotLine()\n", " \n", "def adjustAll(nsamples, width, angle):\n", " adjustLine(nsamples, width)\n", " adjustAngle(angle)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we create an interactive plot showing the random data along with the Pearson correlation coefficient. Adjust the slides to change the number of samples drawn from the distribution, the width of the line, and the angle. Look at the r-value shown below the graph." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c75dcf02de7a42369af906fd8991d652", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(IntSlider(value=10, description='nsamples', max=500, min=5), FloatSlider(value=2.0, desc…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "interact(adjustAll, nsamples=widgets.IntSlider(min=5, max=500, step=1, value=10),\n", " width=widgets.FloatSlider(min=0.0, max=10.0, step=0.1,value=2.0),\n", " angle=widgets.FloatSlider(min=0.0, max=359.0, step=1.0, value=45.0));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Circular Distributions\n", "\n", "Now, let's see what happens with data drawn from a different distribution -- a circle. This data reveals the limits of using a Pearson coefficient: clearly there's a relationship between the x-value and the resulting y-value, but the correlation coefficients *should be 0*. What kinds of correlation values do you measure when you have a small amount of data? Play with the sliders, and watch the r value below the graph.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def circleDistribution(nsamples, radius, width):\n", " t = np.random.uniform(0.0, 2.0 * math.pi, size=nsamples)\n", " x = np.cos(t)*radius + (np.random.uniform(size=nsamples)*2-1)*width\n", " y = np.sin(t)*radius + (np.random.uniform(size=nsamples)*2-1)*width\n", " return(x,y)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def plotCircle(npts=1000, radius=10, width=1):\n", " (x,y) = circleDistribution(npts, radius, width)\n", " plt.plot(x,y, 'r.')\n", " (r,p) = scipy.stats.pearsonr(x,y)\n", " plt.figtext(0.45,0.0, 'r=%.3f' % r)\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's creat the plot. We're going to start with a small sample size and it's possible that you will see a correlation coefficient that suggests some degree of a linear relationship. Remember, each time you adjust a slider, a new distribution is created, so you can wiggle them around and get different r-values for comparable slider settings.\n", "\n", "Next, adjust the number of points upwards. You'll find that even though there's a clear relationship between the two variables, the r-value will be almost 0." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d85c565abea545b7a1335917670547a7", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(IntSlider(value=10, description='npts', max=500, min=5, step=10), FloatSlider(value=5.0,…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "interact(plotCircle,\n", " npts=widgets.IntSlider(min=5, max=500, step=10, value=10),\n", " radius=widgets.FloatSlider(min=5.0, max=50., step=1.0, value=5.0),\n", " width=widgets.FloatSlider(min=0.0, max=10.0, step=0.1, value=1.0));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parabolic Distributions\n", "\n", "Now, let's look at data drawn from a parabolic distribution. Here, the data will be symmetric in X about 0, so again the Pearson r-value *should be 0*. What do you actually get? Could you conclude there's a linear relationship? Play with the sliders and find out: every time you move the sliders, it will generate a new data set." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def parabolicDistribution(nsamples, length, width):\n", " x = np.random.uniform(size=nsamples) * 2.0 * length - length\n", " y = x**2 + (np.random.uniform(size=nsamples) * 2.0 - 1.0) * width\n", " return(x,y)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def plotParabolicDistribution(npts, length, width):\n", " (x,y) = parabolicDistribution(npts, length, width)\n", " (r,p) = scipy.stats.pearsonr(x,y)\n", " plt.plot(x,y,'r.')\n", " plt.figtext(0.45, 0.0, 'r=%.3f' % r)\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2af746d9300e47958b7a3ccab404ed51", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(IntSlider(value=10, description='npts', max=500, min=5, step=10), FloatSlider(value=20.0…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "interact(plotParabolicDistribution,\n", " npts = widgets.IntSlider(min=5, max=500, step=10, value=10),\n", " length = widgets.FloatSlider(min=1.0, max=20.0, step=0.1, value=20.0),\n", " width = widgets.FloatSlider(min=0.0, max=100.0, step=0.1, value=0.0));" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.1" } }, "nbformat": 4, "nbformat_minor": 2 }