{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Political Alignment and Polarization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "This is the second in a series of notebooks that make up a [case study in exploratory data analysis](https://allendowney.github.io/PoliticalAlignmentCaseStudy/).\n",
    "This case study is part of the [*Elements of Data Science*](https://allendowney.github.io/ElementsOfDataScience/) curriculum."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This chapter and the next make up a case study that uses data from the General Social Survey (GSS) to explore political beliefs and political alignment (conservative, moderate, or liberal) in the United States.\n",
    "\n",
    "In this chapter, we:\n",
    "\n",
    "1. Compare the distributions of political alignment from 1974 and 2021.\n",
    "\n",
    "2. Plot the mean and standard deviation of responses over time as a way of quantifying shifts in political alignment and polarization.\n",
    "\n",
    "3. Use local regression to plot a smooth line through noisy data.\n",
    "\n",
    "4. Use cross tabulation to compute the fraction of respondents in each category over time.\n",
    "\n",
    "5. Plot the results using a custom color palette.\n",
    "\n",
    "As an exercise, you will look at changes in political party affiliation over the same period.\n",
    "\n",
    "In the next chapter, we'll use the same dataset to explore the relationship between political alignment and other attitudes and beliefs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "source": [
    "The following cell installs the `empiricaldist` library if necessary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "try:\n",
    "    import empiricaldist\n",
    "except ImportError:\n",
    "    !pip install empiricaldist"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "source": [
    "If everything we need is installed, the following cell should run without error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "from empiricaldist import Pmf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "## Loading the data\n",
    "\n",
    "In the previous notebook, we downloaded GSS data, loaded and cleaned it, resampled it to correct for stratified sampling, and then saved the data in an HDF file, which is much faster to load.  In this and the following notebooks, we'll download the HDF file and load it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "source": [
    "The following cell downloads the file if necessary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "from os.path import basename, exists\n",
    "\n",
    "\n",
    "def download(url):\n",
    "    filename = basename(url)\n",
    "    if not exists(filename):\n",
    "        from urllib.request import urlretrieve\n",
    "\n",
    "        local, _ = urlretrieve(url, filename)\n",
    "        print(\"Downloaded \" + local)\n",
    "\n",
    "\n",
    "download(\n",
    "    \"https://github.com/AllenDowney/PoliticalAlignmentCaseStudy/raw/master/gss_pacs_resampled.hdf\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the the repository for this book, you'll find an HDF file that contains the GSS data, which I have cleaned and resampled to correct for stratified sampling.\n",
    "The file that contains three resamplings;  we'll use the first, `gss0`, to get started."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "datafile = \"gss_pacs_resampled.hdf\"\n",
    "gss = pd.read_hdf(datafile, \"gss0\")\n",
    "gss.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Political Alignment\n",
    "\n",
    "The people surveyed as part of the GSS were asked about their \"political alignment\", which is where they place themselves on a spectrum from liberal to conservative.\n",
    "\n",
    "The variable `polviews` contains responses to the following question (see <https://gssdataexplorer.norc.org/variables/178/vshow>):\n",
    "\n",
    "> We hear a lot of talk these days about liberals and conservatives. \n",
    "I'm going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal--point 1--to extremely conservative--point 7. Where would you place yourself on this scale?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are the valid responses:\n",
    "\n",
    "```\n",
    "1\tExtremely liberal\n",
    "2\tLiberal\n",
    "3\tSlightly liberal\n",
    "4\tModerate\n",
    "5\tSlightly conservative\n",
    "6\tConservative\n",
    "7\tExtremely conservative\n",
    "```\n",
    "\n",
    "To see how the responses have changed over time, we'll inspect them at the beginning and end of the observation period.\n",
    "First we'll select the column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "polviews = gss[\"polviews\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we can compute a Boolean Series that's `True` for responses from 1974."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "year74 = gss[\"year\"] == 1974"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can select the responses from 1974."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "polviews74 = polviews[year74]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll use the following function to count the number of times each response occurs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def values(series):\n",
    "    \"\"\"Count the values and sort.\n",
    "\n",
    "    series: pd.Series\n",
    "\n",
    "    returns: series mapping from values to frequencies\n",
    "    \"\"\"\n",
    "    return series.value_counts().sort_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are the responses from 1974."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "values(polviews74)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And here are the responses from 2021."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "year21 = gss[\"year\"] == 2021\n",
    "polviews21 = polviews[year21]\n",
    "values(polviews21)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at a table of counts, we can get a sense of what the distribution looks like, but in the next section we'll get a better sense by plotting it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualizing Distributions\n",
    "\n",
    "To visualize these distributions, we'll use a Probability Mass Function (PMF), which is similar to a histogram, but there are two differences:\n",
    "\n",
    "* In a histogram, values are often put in bins, with more than one value in each bin. In a PMF each value gets its own bin.\n",
    "\n",
    "* A histogram computes a count, that is, how many times each value appears; a PMF computes a probability, that is, what fraction of the time each value appears. \n",
    "\n",
    "We'll use the `Pmf` class from `empiricaldist` to compute a PMF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from empiricaldist import Pmf\n",
    "\n",
    "pmf74 = Pmf.from_seq(polviews74)\n",
    "pmf74"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "source": [
    "The following cell defines a function I use to decorate the axes in plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "def decorate(**options):\n",
    "    \"\"\"Decorate the current axes.\n",
    "\n",
    "    Call decorate with keyword arguments like\n",
    "    decorate(title='Title',\n",
    "             xlabel='x',\n",
    "             ylabel='y')\n",
    "\n",
    "    The keyword arguments can be any of the axis properties\n",
    "    https://matplotlib.org/api/axes_api.html\n",
    "    \"\"\"\n",
    "    ax = plt.gca()\n",
    "    ax.set(**options)\n",
    "    handles, labels = ax.get_legend_handles_labels()\n",
    "    if handles:\n",
    "        ax.legend(handles, labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's the distribution from 1974:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "pmf74.bar(label=\"1974\", color=\"C0\", alpha=0.7)\n",
    "\n",
    "decorate(\n",
    "    xlabel=\"Political view on a 7-point scale\",\n",
    "    ylabel=\"Fraction of respondents\",\n",
    "    title=\"Distribution of political views\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And from 2021:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "pmf21 = Pmf.from_seq(polviews21)\n",
    "pmf21.bar(label=\"2021\", color=\"C1\", alpha=0.7)\n",
    "\n",
    "decorate(\n",
    "    xlabel=\"Political view on a 7-point scale\",\n",
    "    ylabel=\"Fraction of respondents\",\n",
    "    title=\"Distribution of political views\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In both cases, the most common response is `4`, which is the code for \"moderate\".  Few respondents describe themselves as \"extremely\" liberal or conservative.\n",
    "So maybe we're not so polarized after all."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "To make it easier to compare the distributions, I'll plot them side by side.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "outputs": [],
   "source": [
    "d = dict(pmf74=pmf74, pmf21=pmf21)\n",
    "\n",
    "df = pd.DataFrame(d)\n",
    "df.plot(kind=\"bar\")\n",
    "\n",
    "decorate(\n",
    "    xlabel=\"Political view on a 7-point scale\",\n",
    "    ylabel=\"Fraction of respondents\",\n",
    "    title=\"Distribution of political views\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "Now we can see the changes in the distribution more clearly.  It looks like the number of people at the extremes (1 and 7) has increased, and the fraction of slightly liberal (3) and slightly conservative (5) has decreased."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** To summarize these changes, we can compare the mean and standard deviation of `polviews` in 1974 and 2021.\n",
    "\n",
    "The mean of the responses measures the balance of people in the population with liberal or conservative leanings.  If the mean increases over time, that might indicate a shift in the population toward conservatism.\n",
    "\n",
    "The standard deviation measures the dispersion of views in the population; if it increases over time, that might indicate an increase in polarization.\n",
    "\n",
    "Compute the mean and standard deviation of `polviews74` and `polviews21`.\n",
    "\n",
    "What do they indicate about changes over this interval?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plotting a Time Series\n",
    "\n",
    "At this point we have looked at the endpoints, 1974 and 2021, but we don't know what happened in between.\n",
    "To see how the distribution changed over time, we can group by year and compute the mean of `polviews` during each year.\n",
    "We can use `groupby` to group the respondents by year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "gss_by_year = gss.groupby(\"year\")\n",
    "gss_by_year"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a `DataFrameGroupBy` object that represents a collection of groups."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "We can loop through the groups and display the number of respondents in each:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "outputs": [],
   "source": [
    "for year, group in gss_by_year:\n",
    "    print(year, len(group))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In many ways the `DataFrameGroupBy` behaves like a `DataFrame`.  We can use the bracket operator to select a column: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "polviews_by_year = gss_by_year[\"polviews\"]\n",
    "polviews_by_year"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A column from a `DataFrameGroupBy` is a `SeriesGroupBy`.  If we invoke `mean` on it, the results is a series that contains the mean of `polviews` for each year of the survey."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean_series = polviews_by_year.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And here's what it looks like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean_series.plot(color=\"C2\", label=\"polviews\")\n",
    "decorate(xlabel=\"Year\", ylabel=\"Mean (7 point scale)\", title=\"Mean of polviews\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like the mean increased between 1974 and 2000, decreased since then, and ended up almost where it started.\n",
    "The difference between the highest and lowest points is only 0.3 points on a 7-point scale, which is a modest effect. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean_series.max() - mean_series.min()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** The standard deviation quantifies the spread of the distribution, which is one way to measure polarization.\n",
    "Plot standard deviation of `polviews` for each year of the survey from 1972 to 2021.\n",
    "Does it show evidence of increasing polarization?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Smoothing the Curve\n",
    "\n",
    "In the previous section we plotted mean and standard deviation of `polviews` over time.\n",
    "In both plots, the values are highly variable from year to year.\n",
    "We can use **local regression** to compute a smooth line through these data points.  \n",
    "\n",
    "The following function takes a Pandas Series and uses an algorithm called LOWESS to compute a smooth line.\n",
    "LOWESS stands for \"locally weighted scatterplot smoothing\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "from statsmodels.nonparametric.smoothers_lowess import lowess\n",
    "\n",
    "\n",
    "def make_lowess(series):\n",
    "    \"\"\"Use LOWESS to compute a smooth line.\n",
    "\n",
    "    series: pd.Series\n",
    "\n",
    "    returns: pd.Series\n",
    "    \"\"\"\n",
    "    y = series.values\n",
    "    x = series.index.values\n",
    "\n",
    "    smooth = lowess(y, x)\n",
    "    index, data = np.transpose(smooth)\n",
    "\n",
    "    return pd.Series(data, index=index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll use the following function to plot data points and the smoothed line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_series_lowess(series, color):\n",
    "    \"\"\"Plots a series of data points and a smooth line.\n",
    "\n",
    "    series: pd.Series\n",
    "    color: string or tuple\n",
    "    \"\"\"\n",
    "    series.plot(linewidth=0, marker=\"o\", color=color, alpha=0.5)\n",
    "    smooth = make_lowess(series)\n",
    "    smooth.plot(label=\"\", color=color)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following figure shows the mean of `polviews` and a smooth line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean_series = gss_by_year[\"polviews\"].mean()\n",
    "plot_series_lowess(mean_series, \"C2\")\n",
    "decorate(ylabel=\"Mean (7 point scale)\", title=\"Mean of polviews\", xlabel=\"Year\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One reason the PMFs for 1974 and 2021 did not look very different is that the mean seems to have gone up (more conservative) and then down again (more liberal).\n",
    "Generally, it looks like the U.S. has been trending toward liberal for the last 20 years, or more, at least in the sense of how people describe themselves."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** Use `plot_series_lowess` to plot the standard deviation of `polviews` with a smooth line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross Tabulation\n",
    "\n",
    "In the previous sections, we treated `polviews` as a numerical quantity, so we were able to compute means and standard deviations.\n",
    "But the responses are really categorical, which means that each value represents a discrete category, like \"liberal\" or \"conservative\".\n",
    "In this section, we'll treat `polviews` as a categorical variable.  Specifically, we'll compute the number of respondents in each category for each year, and plot changes over time.\n",
    "\n",
    "Pandas provides a function called `crosstab` that computes a **cross tabulation**, which is like a two-dimensional PMF.\n",
    "It takes two `Series` objects as arguments and returns a `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "year = gss[\"year\"]\n",
    "column = gss[\"polviews\"]\n",
    "\n",
    "xtab = pd.crosstab(year, column)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are the first few lines from the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "xtab.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It contains one row for each value of `year` and one column for each value of `polviews`.  Reading the first row, we see that in 1974, 31 people gave response 1, \"extremely liberal\", 201 people gave response 2, \"liberal\", and so on.\n",
    "\n",
    "The number of respondents varies from year to year, so we need to normalize the results, which means computing for each year the *fraction* of respondents in each category, rather than the count.\n",
    "\n",
    "`crosstab` takes an optional argument that normalizes each row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "xtab_norm = pd.crosstab(year, column, normalize=\"index\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's what that looks like for the 7-point scale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "xtab_norm.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the numbers in the table, it's hard to see what's going on.\n",
    "In the next section, we'll plot the results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "To make the results easier to interpret, I'm going to replace the numeric codes 1-7 with strings.  First I'll make a dictionary that maps from numbers to strings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "outputs": [],
   "source": [
    "# recode the 7 point scale with words\n",
    "d7 = {\n",
    "    1: \"Extremely liberal\",\n",
    "    2: \"Liberal\",\n",
    "    3: \"Slightly liberal\",\n",
    "    4: \"Moderate\",\n",
    "    5: \"Slightly conservative\",\n",
    "    6: \"Conservative\",\n",
    "    7: \"Extremely conservative\",\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "Then we can use the `replace` function like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "outputs": [],
   "source": [
    "polviews7 = gss[\"polviews\"].replace(d7)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "We can use `values` to confirm that the values in `polviews7` are strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "outputs": [],
   "source": [
    "values(polviews7)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "If we make the cross tabulation again, we can see that the column names are strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "outputs": [],
   "source": [
    "xtab_norm = pd.crosstab(year, polviews7, normalize=\"index\")\n",
    "xtab_norm.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "We are almost ready to plot the results, but first we need some colors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "## Color Palettes\n",
    "\n",
    "Seaborn provides a variety of color palettes,  [which you can read about here](https://seaborn.pydata.org/tutorial/color_palettes.html).\n",
    "\n",
    "To represent political views, I'll use a diverging palette from blue to red."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "outputs": [],
   "source": [
    "palette = sns.color_palette(\"RdBu_r\", 7)\n",
    "sns.palplot(palette)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "Here's the modified diverging palette  with purple in the middle."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "outputs": [],
   "source": [
    "palette[3] = \"purple\"\n",
    "sns.palplot(palette)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "A feature of this color map is that the colors are meaningful, at least in countries that use blue, purple, and red for these points on the political spectrum.\n",
    "A drawback of this color map is that some some of the colors are indistinguishable to people who are [color blind](https://davidmathlogic.com/colorblind).\n",
    "\n",
    "Now I'll make a dictionary that maps from the responses to the corresponding colors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "outputs": [],
   "source": [
    "columns = [\n",
    "    \"Extremely liberal\",\n",
    "    \"Liberal\",\n",
    "    \"Slightly liberal\",\n",
    "    \"Moderate\",\n",
    "    \"Slightly conservative\",\n",
    "    \"Conservative\",\n",
    "    \"Extremely conservative\",\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "outputs": [],
   "source": [
    "color_map = dict(zip(columns, palette))\n",
    "\n",
    "for key, value in color_map.items():\n",
    "    print(key, value)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plotting a Cross Tabulation\n",
    "\n",
    "To see how the fraction of people with each political alignment has changed over time,\n",
    "we'll use `plot_series_lowess` to plot the columns from `xtab_norm`.\n",
    "\n",
    "Here are the 7 categories plotted as a function of time.\n",
    "The `bbox_to_anchor` argument passed to `plt.legend` puts the legend outside the axes of the figure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "for name, column in xtab_norm.items():\n",
    "    plot_series_lowess(column, color_map[name])\n",
    "\n",
    "decorate(\n",
    "    xlabel=\"Year\",\n",
    "    ylabel=\"Proportion\",\n",
    "    title=\"Fraction of respondents with each political view\",\n",
    ")\n",
    "\n",
    "plt.legend(bbox_to_anchor=(1.02, 1.02))\n",
    "None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This way of looking at the results suggests that changes in political alignment during this period have generally been slow and small.\n",
    "The fraction of self-described moderates has not changed substantially.\n",
    "The fraction of conservatives increased, but seems to be decreasing now; the number of liberals seems to be increasing.\n",
    "\n",
    "The fraction of people at the extremes has increased, but it is hard to see clearly in this figure.\n",
    "We can get a better view by plotting just the extremes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "selected_columns = [\"Extremely liberal\", \"Extremely conservative\"]\n",
    "\n",
    "for name, column in xtab_norm.items():\n",
    "    if name not in selected_columns:\n",
    "        continue\n",
    "    plot_series_lowess(column, color_map[name])\n",
    "\n",
    "decorate(\n",
    "    xlabel=\"Year\",\n",
    "    ylabel=\"Proportion\",\n",
    "    ylim=[0, 0.057],\n",
    "    title=\"Fraction of respondents with extreme political views\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I used `ylim` to set the limits of the y-axis so it starts at zero, to avoid making the changes seem bigger than they are.\n",
    "\n",
    "This figure shows that the fraction of people who describe themselves as \"extreme\" has increased from about 2.5% to about 5%.\n",
    "In relative terms, that's a big increase.  But in absolute terms these tails of the distribution are still small."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** Let's do a similar analysis with `partyid`, which encodes responses to the question:\n",
    "\n",
    ">Generally speaking, do you usually think of yourself as a Republican, Democrat, Independent, or what?\n",
    "\n",
    "The valid responses are:\n",
    "\n",
    "```\n",
    "0\tStrong democrat\n",
    "1\tNot str democrat\n",
    "2\tInd,near dem\n",
    "3\tIndependent\n",
    "4\tInd,near rep\n",
    "5\tNot str republican\n",
    "6\tStrong republican\n",
    "7\tOther party\n",
    "```\n",
    "\n",
    "You can read the codebook for `partyid`  at <https://gssdataexplorer.norc.org/variables/141/vshow>.\n",
    "In the notebook for this chapter, there are some suggestions to get you started."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "Here are the steps I suggest:\n",
    "\n",
    "1) If you have not already saved this notebook, you should do that first.  If you are running on Colab, select \"Save a copy in Drive\" from the File menu.\n",
    "\n",
    "2) Now, before you modify this notebook, make *another* copy and give it an appropriate name.\n",
    "\n",
    "3) Search and replace `polviews` with `partyid` (use \"Edit->Find and replace\").\n",
    "\n",
    "4) Run the notebook from the beginning and see what other changes you have to make.\n",
    "\n",
    "You will have to make changes in `d7` and `columns`.  Otherwise you might get a message like \n",
    "\n",
    "`TypeError: '<' not supported between instances of 'float' and 'str'`\n",
    "\n",
    "Also, you might have to drop \"Other party\" or change the color palette.\n",
    "\n",
    "And you should change the titles of the figures.\n",
    "\n",
    "\n",
    "What changes in party affiliation do you see over the last 50 years?  Are things going in the directions you expected?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "This chapter uses some tools we have seen before, like the `Pmf` object and the `groupby` function.\n",
    "And it introduces two new tools: local regression for computing a smooth curve through noisy data, and cross tabulation for counting the number of people, or fraction, in each group over time.\n",
    "\n",
    "Now that we have a sense of how political alignment as changed, in the next chapter we'll explore the relationship between political alignment and other beliefs and attitudes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "source": [
    "Political Alignment Case Study\n",
    "\n",
    "Copyright 2020 Allen B. Downey\n",
    "\n",
    "License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}