{ "cells": [ { "cell_type": "markdown", "id": "fb9d7f6e", "metadata": {}, "source": [ "(correlation)=\n", "# Correlation\n", "\n", "\n", "```{admonition} Important Readings\n", ":class: seealso\n", "- {cite}`freedman2007statistics`, Chapters 8 and 9\n", "```\n", "\n", "## Scatter Plots\n", "\n", "The relationship between two quantitative variables can be explored in a scatter plot. \n", "You need data that is paired, like in the father and son data below. Each $x,y$ pair is plotted and you might notice a trend from the shape of the scatter of points.\n", "\n", "```{list-table} Father and Son Height Data\n", ":header-rows: 1\n", ":widths: auto\n", "\n", "* - Father's Height (inches)\n", " - Son's Height (inches)\n", "* - 65.0\n", " - 59.8\n", "* - 63.3\n", " - 63.2\n", "* - 65.0\n", " - 63.3\n", "```\n", "\n", "The relationship is called an **association**. When the scatter of points slopes upward, that shows a positive association. If the scatter slopes downward, that shows a negative association. A strong association indicates that one variable can be used to predict the others. \n", "\n", "```{figure} images/fathersonscatter.svg\n", ":width: 90%\n", ":name: fathersonscatter\n", "\n", "There is a positive association between a father's height and his son's height. The orange points are for the three pairs shown in the table above. \n", "```\n", "\n", "\n", "The variable on the $x$-axis is thought of as the **independent variable** and the $y$-axis variable is the **dependent variable**. This language suggests that the $x$ variable is the one influencing the $y$, suggesting (1) $y$ is predicted from $x$ or (2) there's a direction of causality being hypothesized. In this class, we have learned to be cautious in claiming causality. Go ahead and choose one variable as the independent variable and the other as the dependent without too much angst or hesitation. Just be ready to be humbled if you ultimately find the there is no causal relationship or if you get the direction backwards.\n", "\n", "\n", "\n", "For each of these two pairs, which variable would you choose as the independent variable? \n", "\n", "1. Hours of sleep and test performance on the following day\n", "2. Public assistance (like welfare) and poverty\n", "3. Attractiveness and income\n", "\n", "\n", "```{dropdown} Sleep and test performance\n", "\n", "Sleep should be the independent variable, especially given the temporal precedence.\n", "\n", "```\n", "\n", "```{dropdown} Public assistance and poverty \n", "\n", "Either choice is justified. First, let's consider public assistance as the independent variable. You don't have to be taking a stance on there being a positive or negative relationship. Statistician Udny Yule considered spending on public assistance as the independent variable and poverty as the dependent variable in his \"investigation into the causes of changes in pauperism in England,\" finding a positive relationship ({cite}`yule1899investigation`). This is the association you would expect if you believe that assistance promotes dependency and increases poverty. Public assistance might also break cycles of poverty, leading to a negative relationship. This can still be compatible with Yule's data if you recognize that it's important to consider the other direction. \n", "\n", "Societies that experience more poverty might spend more on public assistance. Given that spending responds to poverty levels, it can also be appropriate to consider poverty as the independent variable. \n", "\n", "```\n", "\n", "```{dropdown} Attractiveness and income\n", "\n", "\n", "Either choice is justified. {cite}`monk2021beholding` considers attractiveness as the independent variable and earnings as the dependent variable in examining the \"returns to physical attractiveness.\" However, as the figure below suggests, the effect can go both ways. \n", "\n", "```{figure} images/TomBradyIsHandsome.png\n", ":width: 61%\n", "\n", "```\n", "\n", "\n", "\n", "## Correlation Coefficient\n", "\n", "The correlation coefficient, denoted $r$, is a units-free measure of linear association, or clustering around a line.\n", "The coefficient falls between -1 and 1. It doesn't matter which of the two variables is thought of as the independent or dependent variable. The correlation coefficient shouldn't be confused with the slope of a trend line-it's more about the tightness of the scatter around some line. The correlation will be one if $x$ and $y$ are perfectly linearly related with a slope of 0.2 or a slope of 200. A correlation coefficient of 1 or -1 means that one of the two variables will perfectly predict the other in the data. More moderate correlation coefficients indicate a less predictive relationship. \n", "\n", "```{figure} images/slopecorrcomparison.svg\n", ":width: 89%\n", ":name: slopecorrcomparison\n", "\n", "The trend line on the left has a greater slope but the correlation coefficient is lower because $y$ is less predictable given $x$. \n", "```\n", "\n", "### Interactives \n", "\n", "You can drag around the data points and delete individual points in the plot below to see how the correlation coefficient responds. Notice some limitations: \n", "\n", "1. A correlation coefficient of 0.8 doesn't indicate an association twice as strong as a correlation of 0.4 in any natural sense. \n", "2. A correlaton coefficient of 0.8 doesn't mean 80% of the points fall on a trend line. For now, we don't say anything much deeper than 0.8 represents a stronger linear association than 0.4. \n" ] }, { "cell_type": "code", "execution_count": 1, "id": "f99a9a93", "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "data": { "application/javascript": [ "(function(root) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " const force = true;\n", "\n", " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", " root._bokeh_onload_callbacks = [];\n", " root._bokeh_is_loading = undefined;\n", " }\n", "\n", "const JS_MIME_TYPE = 'application/javascript';\n", " const HTML_MIME_TYPE = 'text/html';\n", " const EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", " const CLASS_NAME = 'output_bokeh rendered_html';\n", "\n", " /**\n", " * Render data to the DOM node\n", " */\n", " function render(props, node) {\n", " const script = document.createElement(\"script\");\n", " node.appendChild(script);\n", " }\n", "\n", " /**\n", " * Handle when an output is cleared or removed\n", " */\n", " function handleClearOutput(event, handle) {\n", " function drop(id) {\n", " const view = Bokeh.index.get_by_id(id)\n", " if (view != null) {\n", " view.model.document.clear()\n", " Bokeh.index.delete(view)\n", " }\n", " }\n", "\n", " const cell = handle.cell;\n", "\n", " const id = cell.output_area._bokeh_element_id;\n", " const server_id = cell.output_area._bokeh_server_id;\n", "\n", " // Clean up Bokeh references\n", " if (id != null) {\n", " drop(id)\n", " }\n", "\n", " if (server_id !== undefined) {\n", " // Clean up Bokeh references\n", " const cmd_clean = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", " cell.notebook.kernel.execute(cmd_clean, {\n", " iopub: {\n", " output: function(msg) {\n", " const id = msg.content.text.trim()\n", " drop(id)\n", " }\n", " }\n", " });\n", " // Destroy server and session\n", " const cmd_destroy = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", " cell.notebook.kernel.execute(cmd_destroy);\n", " }\n", " }\n", "\n", " /**\n", " * Handle when a new output is added\n", " */\n", " function handleAddOutput(event, handle) {\n", " const output_area = handle.output_area;\n", " const output = handle.output;\n", "\n", " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", " if ((output.output_type != \"display_data\") || (!Object.prototype.hasOwnProperty.call(output.data, EXEC_MIME_TYPE))) {\n", " return\n", " }\n", "\n", " const toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", "\n", " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", " // store reference to embed id on output_area\n", " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", " }\n", " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", " const bk_div = document.createElement(\"div\");\n", " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", " const script_attrs = bk_div.children[0].attributes;\n", " for (let i = 0; i < script_attrs.length; i++) {\n", " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", " toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n", " }\n", " // store reference to server id on output_area\n", " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", " }\n", " }\n", "\n", " function register_renderer(events, OutputArea) {\n", "\n", " function append_mime(data, metadata, element) {\n", " // create a DOM node to render to\n", " const toinsert = this.create_output_subarea(\n", " metadata,\n", " CLASS_NAME,\n", " EXEC_MIME_TYPE\n", " );\n", " this.keyboard_manager.register_events(toinsert);\n", " // Render to node\n", " const props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", " render(props, toinsert[toinsert.length - 1]);\n", " element.append(toinsert);\n", " return toinsert\n", " }\n", "\n", " /* Handle when an output is cleared or removed */\n", " events.on('clear_output.CodeCell', handleClearOutput);\n", " events.on('delete.Cell', handleClearOutput);\n", "\n", " /* Handle when a new output is added */\n", " events.on('output_added.OutputArea', handleAddOutput);\n", "\n", " /**\n", " * Register the mime type and append_mime function with output_area\n", " */\n", " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", " /* Is output safe? */\n", " safe: true,\n", " /* Index of renderer in `output_area.display_order` */\n", " index: 0\n", " });\n", " }\n", "\n", " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", " if (root.Jupyter !== undefined) {\n", " const events = require('base/js/events');\n", " const OutputArea = require('notebook/js/outputarea').OutputArea;\n", "\n", " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", " register_renderer(events, OutputArea);\n", " }\n", " }\n", " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", " root._bokeh_timeout = Date.now() + 5000;\n", " root._bokeh_failed_load = false;\n", " }\n", "\n", " const NB_LOAD_WARNING = {'data': {'text/html':\n", " \"
\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"\\n\"+\n",
" \"\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"\\n\"+\n",
" \"\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"