{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Logistic regression" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "Think Bayes, Second Edition\n", "\n", "Copyright 2020 Allen B. Downey\n", "\n", "License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# If we're running on Colab, install empiricaldist\n", "# https://pypi.org/project/empiricaldist/\n", "\n", "import sys\n", "IN_COLAB = 'google.colab' in sys.modules\n", "\n", "if IN_COLAB:\n", " !pip install empiricaldist" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Get utils.py\n", "\n", "import os\n", "\n", "if not os.path.exists('utils.py'):\n", " !wget https://github.com/AllenDowney/ThinkBayes2/raw/master/code/soln/utils.py" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "from utils import set_pyplot_params\n", "set_pyplot_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generational Changes\n", "\n", "As a second example of logistic regression, we'll use data from the [General Social Survey](https://gss.norc.org/) (GSS) to describe generational changes in support for legalization of marijuana.\n", "\n", "Since 1972 the GSS has surveyed a representative sample of adults in the U.S., asking about issues like \"national spending priorities, crime and punishment, intergroup relations, and confidence in institutions\".\n", "\n", "I have selected a subset of the GSS data, resampled it to correct for stratified sampling, and made the results available in an HDF file." ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "The following cell downloads the data." ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Load the data file\n", "\n", "import os\n", "\n", "datafile = 'gss_eda.hdf5'\n", "if not os.path.exists(datafile):\n", " !wget https://github.com/AllenDowney/ThinkBayes2/raw/master/data/gss_eda.hdf5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use Pandas to load the data." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(64814, 169)" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gss = pd.read_hdf(datafile, 'gss')\n", "gss.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a `DataFrame` with one row for each respondent and one column for each variable.\n", "\n", "The primary variable we'll explore is `grass`, which encodes each respondent's answer to this question ([details here](https://gssdataexplorer.norc.org/variables/285/vshow)):\n", "\n", "> \"Do you think the use of marijuana should be made legal or not?\"\n", "\n", "This question was asked during most years of the survey starting in 1973, so it provides a useful view of changes in attitudes over almost 50 years.\n", "\n", "Here are is the distributions of responses:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NaN 27268\n", "2.0 25662\n", "1.0 11884\n", "Name: grass, dtype: int64" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gss['grass'].value_counts(dropna=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The value 1.0 represents \"yes\"; 2.0 represents \"no\"; `NaN` represents peope who were not asked the question and a small number of respondents who did not respond or said \"I don't know\".\n", "\n", "To explore generational changes in the responses, we will look at the level of support for legalization as a function of birth year, which is encoded in a variable called `cohort`. Here's a summary of this variable." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 64586.000000\n", "mean 1948.846069\n", "std 21.262659\n", "min 1883.000000\n", "25% 1934.000000\n", "50% 1951.000000\n", "75% 1964.000000\n", "max 2000.000000\n", "Name: cohort, dtype: float64" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gss['cohort'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The oldest GSS respondent was born in 1883; the youngest was born in 2000.\n", "\n", "Before we analyze this data, I will select the subset of respondents with valid data for `grass` and `cohort`:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(37427, 169)" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid = gss.dropna(subset=['grass', 'cohort']).copy()\n", "valid.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are about 37,000 respondents with the data we need.\n", "\n", "I'll recode the values of `grass` so `1` means yes and `0` means no." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0 25572\n", "1.0 11855\n", "Name: y, dtype: int64" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid['y'] = valid['grass'].replace(2, 0)\n", "valid['y'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, for this problem, I'm going to represent the data in a different format. Rather than one row for each respondent, I am going to group the respondents by birth year and record the number of respondents in each group, `count`, and the number who support legalization, `sum`." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | sum | \n", "count | \n", "
---|---|---|
cohort | \n", "\n", " | \n", " |
1884.0 | \n", "0.0 | \n", "1 | \n", "
1886.0 | \n", "0.0 | \n", "3 | \n", "
1887.0 | \n", "1.0 | \n", "9 | \n", "
1888.0 | \n", "0.0 | \n", "3 | \n", "
1889.0 | \n", "1.0 | \n", "14 | \n", "
... | \n", "... | \n", "... | \n", "
1996.0 | \n", "40.0 | \n", "47 | \n", "
1997.0 | \n", "28.0 | \n", "41 | \n", "
1998.0 | \n", "11.0 | \n", "17 | \n", "
1999.0 | \n", "13.0 | \n", "17 | \n", "
2000.0 | \n", "9.0 | \n", "13 | \n", "
116 rows × 2 columns
\n", "\n", " | sum | \n", "count | \n", "x | \n", "
---|---|---|---|
cohort | \n", "\n", " | \n", " | \n", " |
1884.0 | \n", "0.0 | \n", "1 | \n", "-64.724243 | \n", "
1886.0 | \n", "0.0 | \n", "3 | \n", "-62.724243 | \n", "
1887.0 | \n", "1.0 | \n", "9 | \n", "-61.724243 | \n", "
1888.0 | \n", "0.0 | \n", "3 | \n", "-60.724243 | \n", "
1889.0 | \n", "1.0 | \n", "14 | \n", "-59.724243 | \n", "
Intercept | \n", "-0.950 | \n", "-0.946 | \n", "-0.942 | \n", "-0.938 | \n", "-0.934 | \n", "-0.930 | \n", "-0.926 | \n", "-0.922 | \n", "-0.918 | \n", "-0.914 | \n", "... | \n", "-0.786 | \n", "-0.782 | \n", "-0.778 | \n", "-0.774 | \n", "-0.770 | \n", "-0.766 | \n", "-0.762 | \n", "-0.758 | \n", "-0.754 | \n", "-0.750 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Slope | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
0.0250 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "... | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "
0.0252 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "... | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "
0.0254 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "... | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "
0.0256 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "... | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "
0.0258 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "... | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "0.000384 | \n", "
5 rows × 51 columns
\n", "