{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# How popular is the President?\n", "> \"Experimenting with a Gaussian Process to model presidential popularity across time\"\n", "\n", "- toc: true\n", "- badges: true\n", "- comments: true\n", "- author: Alexandre Andorra\n", "- categories: [popularity, Macron, Gaussian processes, polls]\n", "- image: images/gp-popularity.png" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Note: This is a blog post detailing how the GP model of popularity is built. To access the interactive dashboard of the model's predictions, click [here](https://share.streamlit.io/alexandorra/pollsposition_website/main/gp-popularity-app.py).\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "I like working on time series, because they usually relate to something concrete. I've also long been intrigued by [Gaussian Processes](https://en.wikipedia.org/wiki/Gaussian_process) -- they have a mathematical beauty and versatility that I've always found intriguing, if only because you can parametrize the model in ways where you can interpret it.\n", "\n", "But... they are hard to fit -- the number of gradient computation scales with the cube of the number of data points. And in the Bayesian framework, we're trying to estimate the _whole_ distribution of outcomes, not only _one_ single point, which adds to the challenge.\n", "\n", "One thing I learned so far in my open-source programming journey is not to be afraid of what you're afraid of -- to what you'll legitimately answer: \"wait, what??\". Let me rephrase: if a method scares you, the best way to understand it is to work on an example where you need it. This will dissipate (part of) the magic behind it and help you cross a threshold in your understanding.\n", "\n", "So that's what I did with Gaussian Processes! That all came from a simple question: how does the popularity of French presidents evolve within term and across terms? I often hear people frenetically commenting the latest popularity poll (frequently the same people who later will complain that \"polls are always wrong\", but that's another story), and in these cases I'm always worried that we're reacting to noise -- maybe it's just natural that a president experiences a dip in popularity at the middle of his term?\n", "\n", "To answer this question, I compiled all the popularity opinion polls of French presidents since the term limits switched to 5 years (in 2002). Let's see what the data look like, before diving into the Gaussian Process model.\n", "\n", "## Show me the data!\n", "\n", "Here are the packages we'll need:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import arviz as az\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pymc3 as pm\n", "import xarray as xr\n", "from scipy.special import expit as logistic" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# hide\n", "RANDOM_SEED = 8927\n", "np.random.seed(RANDOM_SEED)\n", "az.style.use(\"arviz-darkgrid\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's load up the data in a dataframe called `d`. You'll notice that, in addition to the polling data, the dataframe also contains the quarterly unemployment rate in France (downloaded from the [French statistical office](https://www.insee.fr/fr/statistiques/serie/001688526#Telechargement)). As this variable is usually well correlated with politicians' and parties' popularity, we will use it as a predictor in our model.\n", "\n", "As I'll explain below, we're computing the popularity every month, but since unemployment data are released quarterly, we just forward-fill the unemployment values when they are missing -- which is, I insist, an assumption, and as such it should be tested and played with, to check its impact on the model's inferences (there is a Binder and Google Collab link at the top of the page, so feel free to do so!). I could also use more intricate techniques to forecast unemployment, but that would be an overkill for our purpose here." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/alex_andorra/opt/anaconda3/envs/elections-models/lib/python3.9/site-packages/pandas/core/frame.py:3191: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " self[k1] = value[k2]\n" ] } ], "source": [ "# hide\n", "PARTIES = {\n", " \"chirac2\": \"right\",\n", " \"sarkozy\": \"right\",\n", " \"hollande\": \"left\",\n", " \"macron\": \"center\",\n", "}\n", "\n", "# collapse-show\n", "all_presidents = pd.read_csv(\n", " \"https://raw.githubusercontent.com/AlexAndorra/pollsposition_models/master/data/raw_popularity_presidents.csv\",\n", " header=0,\n", " index_col=0,\n", " parse_dates=True,\n", ")\n", "\n", "# restrict data to after the switch to 5-year term\n", "d = all_presidents.loc[all_presidents.index >= pd.to_datetime(\"2002-05-05\")]\n", "\n", "# convert to proportions\n", "d[[\"approve_pr\", \"disapprove_pr\"]] = d[[\"approve_pr\", \"disapprove_pr\"]].copy() / 100\n", "d = d.rename(columns={\"approve_pr\": \"p_approve\", \"disapprove_pr\": \"p_disapprove\"})\n", "\n", "# raw monthly average to get fixed time intervals\n", "# TODO: replace by poll aggregation\n", "d = d.groupby(\"president\").resample(\"M\").mean().reset_index(level=0).sort_index()\n", "\n", "d[\"party\"] = d.president.replace(PARTIES)\n", "\n", "ELECTION_FLAGS = (\n", " (d.index.year == 2002) & (d.index.month == 5)\n", " | (d.index.year == 2007) & (d.index.month == 5)\n", " | (d.index.year == 2012) & (d.index.month == 5)\n", " | (d.index.year == 2017) & (d.index.month == 5)\n", ")\n", "d[\"election_flag\"] = 0\n", "d.loc[ELECTION_FLAGS, \"election_flag\"] = 1\n", "\n", "# convert to nbr of successes\n", "d[\"N_approve\"] = d.samplesize * d[\"p_approve\"]\n", "d[\"N_disapprove\"] = d.samplesize * d[\"p_disapprove\"]\n", "d[[\"N_approve\", \"N_disapprove\"]] = d[[\"N_approve\", \"N_disapprove\"]].round().astype(int)\n", "\n", "# compute total trials\n", "d[\"N_total\"] = d.N_approve + d.N_disapprove" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# hide\n", "unemp = pd.read_csv(\n", " \"https://raw.githubusercontent.com/AlexAndorra/pollsposition_models/master/data/predictors/chomage_national_trim.csv\",\n", " sep=\";\",\n", " skiprows=2,\n", ").iloc[:, [0, 1]]\n", "unemp.columns = [\"date\", \"unemployment\"]\n", "unemp = unemp.sort_values(\"date\")\n", "\n", "# as timestamps variables:\n", "unemp.index = pd.period_range(start=unemp.date.iloc[0], periods=len(unemp), freq=\"Q\")\n", "unemp = unemp.drop(\"date\", axis=1)\n", "\n", "d = d.reset_index()\n", "\n", "# add quarters to main dataframe:\n", "d.index = pd.DatetimeIndex(d[\"index\"].values).to_period(\"Q\")\n", "d.index.name = \"quarter\"\n", "\n", "# merge with unemployment:\n", "d = d.join(unemp).reset_index().set_index(\"index\")\n", "d.index.name = \"month\"\n", "d[\"unemployment\"] = d.unemployment.fillna(method=\"ffill\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | level_0 | \n", "president | \n", "samplesize | \n", "p_approve | \n", "p_disapprove | \n", "party | \n", "election_flag | \n", "N_approve | \n", "N_disapprove | \n", "N_total | \n", "unemployment | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
month | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2002-05-31 | \n", "2002Q2 | \n", "chirac2 | \n", "964.250000 | \n", "0.502500 | \n", "0.442500 | \n", "right | \n", "1 | \n", "485 | \n", "427 | \n", "912 | \n", "7.5 | \n", "
2002-06-30 | \n", "2002Q2 | \n", "chirac2 | \n", "970.000000 | \n", "0.505000 | \n", "0.425000 | \n", "right | \n", "0 | \n", "490 | \n", "412 | \n", "902 | \n", "7.5 | \n", "
2002-07-31 | \n", "2002Q3 | \n", "chirac2 | \n", "947.333333 | \n", "0.533333 | \n", "0.406667 | \n", "right | \n", "0 | \n", "505 | \n", "385 | \n", "890 | \n", "7.5 | \n", "
2002-08-31 | \n", "2002Q3 | \n", "chirac2 | \n", "1028.000000 | \n", "0.520000 | \n", "0.416667 | \n", "right | \n", "0 | \n", "535 | \n", "428 | \n", "963 | \n", "7.5 | \n", "
2002-09-30 | \n", "2002Q3 | \n", "chirac2 | \n", "1017.500000 | \n", "0.525000 | \n", "0.420000 | \n", "right | \n", "0 | \n", "534 | \n", "427 | \n", "961 | \n", "7.5 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2020-12-31 | \n", "2020Q4 | \n", "macron | \n", "1156.500000 | \n", "0.388333 | \n", "0.590000 | \n", "center | \n", "0 | \n", "449 | \n", "682 | \n", "1131 | \n", "7.7 | \n", "
2021-01-31 | \n", "2021Q1 | \n", "macron | \n", "1184.181818 | \n", "0.380000 | \n", "0.589091 | \n", "center | \n", "0 | \n", "450 | \n", "698 | \n", "1148 | \n", "7.7 | \n", "
2021-02-28 | \n", "2021Q1 | \n", "macron | \n", "1128.625000 | \n", "0.405000 | \n", "0.576250 | \n", "center | \n", "0 | \n", "457 | \n", "650 | \n", "1107 | \n", "7.7 | \n", "
2021-03-31 | \n", "2021Q1 | \n", "macron | \n", "1100.545455 | \n", "0.375455 | \n", "0.582727 | \n", "center | \n", "0 | \n", "413 | \n", "641 | \n", "1054 | \n", "7.7 | \n", "
2021-04-30 | \n", "2021Q2 | \n", "macron | \n", "1001.666667 | \n", "0.350000 | \n", "0.596667 | \n", "center | \n", "0 | \n", "351 | \n", "598 | \n", "949 | \n", "7.7 | \n", "
228 rows × 11 columns
\n", "\n", " | mean | \n", "sd | \n", "hdi_3% | \n", "hdi_97% | \n", "mcse_mean | \n", "mcse_sd | \n", "ess_bulk | \n", "ess_tail | \n", "r_hat | \n", "
---|---|---|---|---|---|---|---|---|---|
baseline | \n", "-0.54 | \n", "0.11 | \n", "-0.75 | \n", "-0.33 | \n", "0.00 | \n", "0.00 | \n", "5069.93 | \n", "4597.28 | \n", "1.0 | \n", "
honeymoon | \n", "0.37 | \n", "0.14 | \n", "0.12 | \n", "0.63 | \n", "0.00 | \n", "0.00 | \n", "12074.93 | \n", "5840.39 | \n", "1.0 | \n", "
log_unemp_effect | \n", "-0.11 | \n", "0.07 | \n", "-0.23 | \n", "0.02 | \n", "0.00 | \n", "0.00 | \n", "5899.44 | \n", "5286.19 | \n", "1.0 | \n", "
amplitude_trend | \n", "0.43 | \n", "0.06 | \n", "0.32 | \n", "0.56 | \n", "0.00 | \n", "0.00 | \n", "2604.24 | \n", "3248.72 | \n", "1.0 | \n", "
ls_trend | \n", "6.25 | \n", "1.01 | \n", "4.43 | \n", "8.14 | \n", "0.02 | \n", "0.01 | \n", "2835.64 | \n", "4307.11 | \n", "1.0 | \n", "
theta_offset | \n", "51.70 | \n", "6.70 | \n", "39.10 | \n", "64.32 | \n", "0.07 | \n", "0.05 | \n", "8635.55 | \n", "5894.67 | \n", "1.0 | \n", "
\n", " | president | \n", "sondage | \n", "samplesize | \n", "method | \n", "p_approve | \n", "p_disapprove | \n", "
---|---|---|---|---|---|---|
2002-05-15 | \n", "chirac2 | \n", "Ifop | \n", "924 | \n", "phone | \n", "0.51 | \n", "0.44 | \n", "
2002-05-20 | \n", "chirac2 | \n", "Kantar | \n", "972 | \n", "face to face | \n", "0.50 | \n", "0.48 | \n", "
2002-05-23 | \n", "chirac2 | \n", "BVA | \n", "1054 | \n", "phone | \n", "0.52 | \n", "0.37 | \n", "
2002-05-26 | \n", "chirac2 | \n", "Ipsos | \n", "907 | \n", "phone | \n", "0.48 | \n", "0.48 | \n", "
2002-06-16 | \n", "chirac2 | \n", "Ifop | \n", "974 | \n", "phone | \n", "0.49 | \n", "0.43 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2021-03-29 | \n", "macron | \n", "Kantar | \n", "1000 | \n", "internet | \n", "0.36 | \n", "0.58 | \n", "
2021-03-30 | \n", "macron | \n", "Yougov | \n", "1068 | \n", "internet | \n", "0.30 | \n", "0.61 | \n", "
2021-04-07 | \n", "macron | \n", "Elabe | \n", "1003 | \n", "internet | \n", "0.33 | \n", "0.63 | \n", "
2021-04-10 | \n", "macron | \n", "Ipsos | \n", "1002 | \n", "internet | \n", "0.37 | \n", "0.58 | \n", "
2021-04-26 | \n", "macron | \n", "Kantar | \n", "1000 | \n", "internet | \n", "0.35 | \n", "0.58 | \n", "
1083 rows × 6 columns
\n", "