{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import csv\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn import (\n", " linear_model, metrics, neural_network, pipeline, model_selection\n", ")\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "import statsmodels.api as sm\n", "import linearmodels\n", "\n", "import seaborn as sns\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "In this project, I'm further developing my ECON 490 thesis paper, in which I found that centrist Democratic candidates to the House of Representatives were much more likely, and very liberal candidates were much less likely, to win their elections than the baseline group. \n", "\n", "What I'm interested in now is not statistical significance, but seeing if I can generalize the models, test for overfitting, and get a good balance of variables to include to balance the various prediction metrics.\n", "\n", "It will be structured in the following manner:\n", "\n", "\n", "1. Introducing the dataset\n", " - Discussion of variables\n", " \n", "2. Creating the initial models\n", " - Linear probability model\n", " - Logit probability model\n", " - The \"Unfrazzled no-hassle lasso progressive regression digression\"\n", " - Testing accuracy\n", " - Testing precision\n", " - Testing recall\n", " \n", "3. Improving the models\n", " - Region\n", " - Race\n", " - Candidate\n", " - Urbanization\n", " - Testing accuracy\n", " - Picking the best model\n", "\n", "4. Counterfactual Predictions\n", " - What if all Democrats ran as centrists?\n", " - What if all Democrats ran as liberals?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Phase 1: Introducing the Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The final dataset created for my paper was `a_m_d_11`, standing for \"aligned model dataset version 11.\" It included data from multiple sources, ranging from an MIT Election Data + Science Lab dataset (acting as the core), to economic and demographic data from the US Census, to political leaning data from the Cook Partisan Voter Index, urbanization data, and others. It is printed below, with a list of columns. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statespodistcandcand_votestot_votesleaningmedian_incomemean_incomeunemployment_1...DICREXRICDINRICDEXDEXRINDINREXDEXREXDINRINmedian_income_stdtwo_party_votetwo_party_vote_share
0ALABAMAAL1ROBERT KENNEDY JR.892262426171547952660856.4...0100000-0.8975182424540.368012
1ALABAMAAL2TABITHA ISNER869312262301646817639397.0...0100000-0.9639252258100.384974
2ALABAMAAL3MALLORY HAGAN839962319151646576629546.9...0100000-0.9780262317660.362417
\n", "

3 rows × 62 columns

\n", "
" ], "text/plain": [ " states po dist cand cand_votes tot_votes leaning \\\n", "0 ALABAMA AL 1 ROBERT KENNEDY JR. 89226 242617 15 \n", "1 ALABAMA AL 2 TABITHA ISNER 86931 226230 16 \n", "2 ALABAMA AL 3 MALLORY HAGAN 83996 231915 16 \n", "\n", " median_income mean_income unemployment_1 ... DICREX RICDIN RICDEX \\\n", "0 47952 66085 6.4 ... 0 1 0 \n", "1 46817 63939 7.0 ... 0 1 0 \n", "2 46576 62954 6.9 ... 0 1 0 \n", "\n", " DEXRIN DINREX DEXREX DINRIN median_income_std two_party_vote \\\n", "0 0 0 0 0 -0.897518 242454 \n", "1 0 0 0 0 -0.963925 225810 \n", "2 0 0 0 0 -0.978026 231766 \n", "\n", " two_party_vote_share \n", "0 0.368012 \n", "1 0.384974 \n", "2 0.362417 \n", "\n", "[3 rows x 62 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a_m_d_11 = pd.read_csv(\"https://docs.google.com/spreadsheets/d/e/2PACX-1vRy5Ir4TrmGh77L131Xhkb22GIVWVF-4PF1eNjPOcOOpgtKyfyDcp4P2d8X4olqfifyJqo4SwTqSdH1/pub?gid=422096178&single=true&output=csv\")\n", "\n", "a_m_d_11.head(3)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "win = a_m_d_11[\"win\"]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['states', 'po', 'dist', 'cand', 'cand_votes', 'tot_votes', 'leaning',\n", " 'median_income', 'mean_income', 'unemployment_1', 'unemployment_2',\n", " 'white_pct', 'black_pct', 'native_pct', 'asian_pct', 'hispanic_pct',\n", " 'blue_dog', 'justice_dem', 'dem_vote_pct', 'win', 'south', 'north_east',\n", " 'mid_west', 'west', 'Incumbent', 'Open_Seat', 'urb_cluster',\n", " 'Very low density', 'Low density', 'Medium density\\t', 'High density',\n", " 'pure_rural', 'rural_suburban_mix', 'sparse_suburban', 'dense_suburban',\n", " 'urban_suburban_mix', 'pure_urban', 'high_ed_pct', 'state_dist', 'stcd',\n", " 'inc', 'pwin', 'fr', 'po1', 'po2', 'redist', 'dexp', 'rexp',\n", " 'total_spending', 'dem_spending_share', 'abnormal', 'DICRIN', 'DICREX',\n", " 'RICDIN', 'RICDEX', 'DEXRIN', 'DINREX', 'DEXREX', 'DINRIN',\n", " 'median_income_std', 'two_party_vote', 'two_party_vote_share'],\n", " dtype='object')" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a_m_d_11.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, much of the dataset is vestigial and not used in the actual model. Some features and variables are worth explaining:\n", "\n", "- The data collection level is at the district and candidate level. Most variables describe the congressional district, but some such as `Incumbent` and the set of onces like `DICRIN`, `DICREX`, etc, describe whether the Democrat in question is an incumbent representative, and the political experience level of them and their Republican opponent. \n", "\n", "- `blue_dog` and `justice_dem` are my indicators of political ideology. This was the main point of interest for my paper - did candidates' political ideology being extreme hurt or help them in the general election? `blue_dog` equalled 1 if a candidate was a member of the Blue Dog Coalition (a group of centrist Democrats) or 0 if they didn't, while `justice_dem` equalled 1 if they were a member of the Justice Democrats (a very liberal political organization) and 0 if not. None were members of both; those who were members of neither were assumed to lie in between. \n", "\n", "Some evidence of this lies below in a plot created for my paper: \n", "\n", "Party and Caucus on DW-NOMINATE.png\n", "\n", "![Party and Caucus](https://i.imgur.com/hDqMzie.png)\n", "\n", "The main separation between the Democratic and Republican parties is the horizontal axis, showing how liberal or conservative (ie. left or right wing) candidates are on issues of economic redistribution. As can be seen, Blue Dogs are the right-most Democrats, while Justice Democrat are among the left-most. \n", "\n", "Four Justice Democrats are actually assigned moderate positions, but this is due to them voting against the Democratic Party, from the left but with Republicans, making them get assigned moderate spaces. [The people who created DW-NOMINATE, the scaling system used here, consider it a limitation with the model.](https://voteview.com/articles/ocasio_cortez)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- `leaning` is taken from the Cook Partisan Voting Index, which takes each district's average vote between the Democratic and Republican presidential candidates over the past two elections and averages it out, to give approximate partisan leaning. It controls for how liberal or conservative a district is on average. Positive values indicate a leaning towards Republicans, negative values indicate towards Democrats. \n", "\n", "- `median_income_std` is household median income by district, standardized. \n", "\n", "- `incumbent` equals 1 if the Democrat is an incumbent representative and 0 if not. Later, I will use the `DICREX`, `DICRIN` etc. variables instead, which control for every combination of Democrat or Republican incumbent, and the experience level of either or both challengers (if an open seat). \n", "\n", "- I will also introduce variables covering `white_pct`, `black_pct`, and `hispanic_pct`, which indicate the racial breakdown of a district\n", "\n", "- `west`, `south`, and `mid_west` indicate the federal region of a district. `north_east` is omitted to prevent perfect multicollinearity.\n", "\n", "- Lastly, the `density` variables indicate how much of a district is composed of different levels of den" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Phase 2: Creating the Initial Models\n", "\n", "In this project, I'm further developing my ECON 490 thesis paper, in which I found that centrist Democratic candidates to the House of Representatives were much more likely, and very liberal candidates were much less likely, to win their elections than the baseline group. \n", "\n", "What I'm interested in now is not statistical significance, but seeing if I can generalize the models, test for overfitting, and get a good balance of variables to include to balance the various prediction metrics." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Train/Test subsets of data, testing for overfitting" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Making X with urban rates values\n", "X = pd.DataFrame({\"blue_dog\": a_m_d_11[\"blue_dog\"],\n", " \"justice_dem\": a_m_d_11[\"justice_dem\"],\n", " \"leaning\": a_m_d_11[\"leaning\"],\n", " \"median_income_std\": a_m_d_11[\"median_income_std\"],\n", " \"incumbent\": a_m_d_11[\"Incumbent\"]\n", "})\n", "\n", "X = sm.add_constant(X)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, win, test_size=0.50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I will be running a linear probability model (an LPM, using OLS), a logistic probability model (logit), and a linear probability model using Lasso regression.\n", "\n", "Note: The division between training and testing datasets is random every time. Your results may vary slightly from what is printed in the cells, though I have made an effort to insert the variable where possible. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "const 0.000000\n", "blue_dog 0.284747\n", "justice_dem -0.179746\n", "leaning -0.017555\n", "median_income_std 0.085229\n", "incumbent 0.297261\n", "dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lpm_train = linear_model.LinearRegression()\n", "lpm_train.fit(X_train, Y_train)\n", "lpm_coefs_1 = pd.Series(dict(zip(list(X_train), lpm_train.coef_)))\n", "lpm_coefs_1" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 1.77021401e-05, 1.23859734e+00, -3.08421434e-01, -4.41878976e-01,\n", " 7.28653367e-01, 9.68214737e-01])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logit_train = linear_model.LogisticRegression(solver=\"lbfgs\")\n", "logit_train.fit(X_train, Y_train)\n", "logit_coefs_1 = pd.Series(dict(zip(list(X_train), logit_train.coef_.round(3))))\n", "\n", "\n", "def ltc_returner(x):\n", " for number in x:\n", " return number\n", "\n", "\n", "logit_coef = ltc_returner(logit_train.coef_)\n", "logit_coef" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Unfrazzled, No-Hassle, Lasso Progressive Regression Digression\n", "\n", "In the interest of making one very simple model to compare the more elaborate specifications against, I've run a linear probability model using Lasso regression, which shrinks effect sizes and in practice sets many coefficients equal to zero. I run it below as `lasso_model_1`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "const 0.000000\n", "blue_dog 0.000000\n", "justice_dem -0.000000\n", "leaning -0.020769\n", "median_income_std 0.000000\n", "incumbent 0.000000\n", "dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lasso_model_1 = linear_model.Lasso()\n", "lasso_model_1.fit(X_train, Y_train)\n", "\n", "lasso_coefs_1 = pd.Series(dict(zip(list(X_train), lasso_model_1.coef_)))\n", "lasso_coefs_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing Results" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lpmlogitlasso
const0.0000000.0000180.000000
blue_dog0.2847471.2385970.000000
justice_dem-0.179746-0.308421-0.000000
leaning-0.017555-0.441879-0.020769
median_income_std0.0852290.7286530.000000
incumbent0.2972610.9682150.000000
\n", "
" ], "text/plain": [ " lpm logit lasso\n", "const 0.000000 0.000018 0.000000\n", "blue_dog 0.284747 1.238597 0.000000\n", "justice_dem -0.179746 -0.308421 -0.000000\n", "leaning -0.017555 -0.441879 -0.020769\n", "median_income_std 0.085229 0.728653 0.000000\n", "incumbent 0.297261 0.968215 0.000000" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "coefs = [\"const\", \"blue_dog\", \"justice_dem\", \"leaning\", \"median_income_std\", \"incumbent\"]\n", "\n", "coef_results = pd.DataFrame({#\"coefs\": coefs,\n", " \"lpm\": lpm_coefs_1,\n", " \"logit\": logit_coef,\n", " #\"logit\": [-1.0000e-04, 1.7362e+00, -3.2600e-01, -4.8070e-01, 6.8980e-01, 8.0640e-01],\n", " \"lasso\": lasso_coefs_1})\n", "coef_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Results thus far?\n", "\n", "1. LPM and Logit models do show coefficient results on all but `const`\n", "2. BUT: Lasso shows only `leaning` as having any coefficient\n", "\n", "### How do their accuracies look?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LPM training dataset accuracy is: 0.7120267695136987\n", "LPM testing dataset accuracy is: 0.6805910921563088\n" ] } ], "source": [ "lpm_train_accuracy = lpm_train.score(X_train, Y_train)\n", "lpm_test_accuracy = lpm_train.score(X_test, Y_test)\n", "\n", "print(f\"LPM training dataset accuracy is: {lpm_train_accuracy}\")\n", "print(f\"LPM testing dataset accuracy is: {lpm_test_accuracy}\")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Logit training dataset accuracy is: 0.9489795918367347\n", "Logit testing dataset accuracy is: 0.9693877551020408\n" ] } ], "source": [ "logit_train_accuracy = logit_train.score(X_train, Y_train)\n", "logit_test_accuracy = logit_train.score(X_test, Y_test)\n", "\n", "print(f\"Logit training dataset accuracy is: {logit_train_accuracy}\")\n", "print(f\"Logit testing dataset accuracy is: {logit_test_accuracy}\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lasso training dataset accuracy is: 0.6069187590068059\n", "Lasso testing dataset accuracy is: 0.5832168022216864\n" ] } ], "source": [ "lasso_train_accuracy = lasso_model_1.score(X_train, Y_train)\n", "lasso_test_acuracy = lasso_model_1.score(X_test, Y_test)\n", "\n", "\n", "print(f\"Lasso training dataset accuracy is: {lasso_train_accuracy}\")\n", "print(f\"Lasso testing dataset accuracy is: {lasso_test_acuracy}\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LPMLogitLasso
Type
Training0.6670.9440.555
Testing0.7250.9690.636
\n", "
" ], "text/plain": [ " LPM Logit Lasso\n", "Type \n", "Training 0.667 0.944 0.555\n", "Testing 0.725 0.969 0.636" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy_df = pd.DataFrame({\"Type\": [\"Training\", \"Testing\"],\n", " \"LPM\": [0.667, 0.725],\n", " \"Logit\": [0.944, 0.969],\n", " \"Lasso\": [0.555, 0.636]})\n", "accuracy_df = accuracy_df.set_index(\"Type\")\n", "accuracy_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question: What is the accuracy of just guessing the most likely outcome for both datasets?\n", "\n", "Because this is a binary dependent variable, it's worth asking what your baseline accuracy is if you simply guess the more common outcome variable every single time." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average value in the training dataset is: 0.5051020408163265\n", "Average value in the testing dataset is: 0.5051020408163265\n" ] } ], "source": [ "training_mean = Y_train.mean()\n", "testing_mean = Y_test.mean()\n", "\n", "print(f\"Average value in the training dataset is: {training_mean}\")\n", "print(f\"Average value in the testing dataset is: {testing_mean}\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Set up so that if modal_training_value >= 0.5, go with it, if not, go with 1 - modal_training_value\n", "# likewise for modal_testing_value\n", "\n", "modal_training_value = Y_train.mean()\n", "modal_testing_value = Y_test.mean()\n", "\n", "\n", "if training_mean < 0.5:\n", " modal_training_value = 1 - modal_training_value\n", " \n", "if modal_testing_value < 0.5:\n", " modal_testing_value = 1 - modal_testing_value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, if 1 makes up 49.5% of the training results, guessing 0 every time will give you an accuracy of (1 - 0.495) = 0.505. Because we want to improve on just guessing the most common result every time, we need to get at least this accuracy value for a model to be any use. \n", "\n", "I'll plot the \"guess the same value every time\" accuracy levels and those of the models, and see what we get." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "Types = [\"LPM Train\", \"LPM Test\", \n", " \"Logit Train\", \"Logit Test\",\n", " \"Lasso Train\", \"Lasso Test\"]\n", "\n", "Accuracy = [lpm_train_accuracy, lpm_test_accuracy, logit_train_accuracy, logit_test_accuracy, lasso_train_accuracy, lasso_test_acuracy]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "\n", "plt.stem(Accuracy)\n", "plt.xticks(ticks = range(len(Types)), \n", " labels = [\"LPM Train\", \"LPM Test\", \"Logit Train\", \"Logit Test\", \"Lasso Train\", \"Lasso Test\"], \n", " )\n", "plt.ylabel(\"Accuracy Rating\")\n", "plt.xlabel(\"Models\")\n", "plt.title(\"Accuracy Ratings of LPM, Logit, and Lasso models on Training, Testing Datasets\")\n", "plt.ylim([0.4, 1])\n", "\n", "# Green Line is accuracy of guessing most common result (0) in Training dataset\n", "plt.axhline(y=modal_training_value, color='g', linestyle='-')\n", "\n", "# Yellow line is accuracy of guessing most common result (1) in Testing dataset\n", "plt.axhline(y=modal_testing_value, color='y', linestyle='-')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Green Line is accuracy of guessing most common result in the Training dataset\n", "#### Yellow line is accuracy of guessing most common result in the Testing dataset\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A note on false positives versus false negatives\n", "\n", "Due to the nature of this experiment, we're wanting to predict if candidates are elected or not. A bit more than half were: `win.mean()` equals ~0.505. The breakdowns for the testing and training datasets will vary slightly, however. \n", "\n", "- A False Positive is when the model predicts 1 but the result is 0; ie. predicting a win but the candidate lost. \n", "- A False Negative is when the model predicts 0 but the result is 1; ie. predicting a loss but the candidate won.\n", "\n", "In my opinion, we aren't particularly concerned with one of these more than the other. We aren't trying to catch a small number of disease cases out of a large number of healthy subjects (in which case a False Negative could kill someone!) or, say, deciding where to invest a lot of money to excavate a valuable resource (where a false positive causes a large amount of waste). \n", "\n", "Because of this, I think a good way to approach this is to make as accurate a model as we can, without overfitting. As said in the Classification lecture (\\# 22), when testing accuracy is equivalent or greater to training accuracy, we may be underfitting, and can use a more powerful model. \n", "\n", "Because of this, the next thing we do will be developing the `LPM` and `Logit` models, adding some more variables to increase accuracy without overfitting. Our \"unfrazzled no-hassle lasso progressive regression digression\" was fun, but `Lasso` performed little better than just guessing the most common result. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Precision and Recall\n", "\n", "While I think Accuracy should be the main focus, there's no harm in seeing if there's any notable trend in other classification metrics.\n", "\n", "Precision: Number of true positive predictions divided by the total number of positive predictions. In other words, of the times the model guesses positive, how often was it correct?\n", "\n", "Recall: Number of true positives divided by number of actual positives. In other words, of all positive result cases, how many did we predict?\n", "\n", "\n", "I will assign \"Positive\" predictions as those with predicted probability equal to or greater than 0.5, and \"Negative\" predictions as those with predicted probability less than 0.5." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "pred_round = []\n", "\n", "\n", "def round_series(x):\n", " X_round = []\n", " for prediction in x:\n", " if float(prediction) < 0.5:\n", " X_round.append(0)\n", " elif float(prediction) >= 0.5:\n", " X_round.append(1)\n", " return X_round" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "lpm_train_round = []\n", "lpm_train_round = round_series(lpm_train.predict(X_train))\n", "lpm_test_round = round_series(lpm_train.predict(X_test))\n", "\n", "\n", "logit_train_round = []\n", "logit_train_round = round_series(logit_train.predict(X_train))\n", "logit_test_round = round_series(logit_train.predict(X_test))\n", "\n", "lasso_train_round = []\n", "lasso_train_round = round_series(lasso_model_1.predict(X_train))\n", "lasso_test_round = round_series(lasso_model_1.predict(X_test))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " Loss 0.89 0.96 0.93 97\n", " Win 0.96 0.89 0.92 99\n", "\n", " accuracy 0.92 196\n", " macro avg 0.93 0.92 0.92 196\n", "weighted avg 0.93 0.92 0.92 196\n", "\n" ] } ], "source": [ "report_lpm = metrics.classification_report(\n", " Y_train, lpm_train_round,\n", " target_names = [\"Loss\", \"Win\"])\n", "\n", "print(report_lpm)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " Loss 0.94 0.96 0.95 97\n", " Win 0.96 0.94 0.95 99\n", "\n", " accuracy 0.95 196\n", " macro avg 0.95 0.95 0.95 196\n", "weighted avg 0.95 0.95 0.95 196\n", "\n" ] } ], "source": [ "report_log = metrics.classification_report(\n", " Y_train, logit_train_round,\n", " target_names = [\"Loss\", \"Win\"])\n", "\n", "print(report_log)\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " Loss 0.86 0.98 0.92 97\n", " Win 0.98 0.85 0.91 99\n", "\n", " accuracy 0.91 196\n", " macro avg 0.92 0.91 0.91 196\n", "weighted avg 0.92 0.91 0.91 196\n", "\n" ] } ], "source": [ "report_lasso = metrics.classification_report(\n", " Y_train, lasso_train_round,\n", " target_names = [\"Loss\", \"Win\"])\n", "\n", "print(report_lasso)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### So, when we round our predictions, `lpm` and `lasso` both do a lot better. \n", "\n", "However, `logit` still gets the highest precision and recall values; therefore also the greateset f1-score. \n", "\n", "Due to the poor showings of the `lasso` and `lpm` models in un-rounded accuracy tests, I'm going to discard them for now and move forward with just the `logit` model. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusions from Phase 2:\n", "\n", "- Logit model is far more accurate than the LPM, which is more accurate than the Lasso model\n", "\n", "- All three have training accuracy values slightly lower than their testing dataset accuracy levels: this implies they could be slightly underfitted.\n", "\n", "- When un-rounded, the Lasso model is barely any more accurate than just guessing the most common outcome variable!\n", "\n", "- Rounding predictions to 0 or 1 improves results for `lpm` and `lasso`, but `logit` is still reliably the best." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Phase 3: Improving on `logit` performance\n", "\n", "\n", "In my project, I used a more extensive specification for my models, also counting a variety of democraphic controls. Including all of these would lead to problems with a singular matrix when splitting into testing and training datasets - while my sample includes nearly all House of Representatives elections in the 2018 Midterms, splitting it up makes the sample size too small. It also relies on a large number of dummy variables, which can become linearly dependent easily. \n", "\n", "So, I'm going to take the above specification, and add one set of controls at a time. These are:\n", "\n", "- Federal Region\n", "- Urbanization\n", "- Racial composition\n", "- An expanded set of candidate traits\n", "\n", "We're going to see how the additions impact prediction quality. I want to increase Accuracy on both the testing and training datasets, but not have testing accuracy significantly exceed training accuracy. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First setting up the dataframes, splitting into training/testing sets" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "X_reg = pd.DataFrame({\"blue_dog\": a_m_d_11[\"blue_dog\"],\n", " \"justice_dem\": a_m_d_11[\"justice_dem\"],\n", " \"leaning\": a_m_d_11[\"leaning\"],\n", " \"median_income_std\": a_m_d_11[\"median_income_std\"],\n", " \"incumbent\": a_m_d_11[\"Incumbent\"],\n", " \"west\": a_m_d_11[\"west\"],\n", " \"south\": a_m_d_11[\"south\"],\n", " \"mid_west\": a_m_d_11[\"mid_west\"]\n", "})\n", "\n", "X_reg = sm.add_constant(X_reg)\n", "\n", "X_race = pd.DataFrame({\"blue_dog\": a_m_d_11[\"blue_dog\"],\n", " \"justice_dem\": a_m_d_11[\"justice_dem\"],\n", " \"leaning\": a_m_d_11[\"leaning\"],\n", " \"incumbent\": a_m_d_11[\"Incumbent\"],\n", " \"median_income_std\": a_m_d_11[\"median_income_std\"],\n", " \"white_pct\": a_m_d_11[\"white_pct\"],\n", " \"black_pct\": a_m_d_11[\"black_pct\"],\n", " \"hispanic_pct\": a_m_d_11[\"hispanic_pct\"]\n", "})\n", "\n", "X_race = sm.add_constant(X_race)\n", "\n", "X_inc = pd.DataFrame({\"blue_dog\": a_m_d_11[\"blue_dog\"],\n", " \"justice_dem\": a_m_d_11[\"justice_dem\"],\n", " \"leaning\": a_m_d_11[\"leaning\"],\n", " \"median_income_std\": a_m_d_11[\"median_income_std\"],\n", " \"DICRIN\": a_m_d_11[\"DICRIN\"],\n", " \"DICREX\": a_m_d_11[\"DICREX\"],\n", " \"RICDIN\": a_m_d_11[\"RICDIN\"],\n", " \"RICDEX\": a_m_d_11[\"RICDEX\"],\n", " \"DEXRIN\": a_m_d_11[\"DEXRIN\"],\n", " \"DINREX\": a_m_d_11[\"DINREX\"],\n", " \"DEXREX\": a_m_d_11[\"DEXREX\"]\n", "})\n", "\n", "X_inc = sm.add_constant(X_inc)\n", "\n", "\n", "X_urb_rates = pd.DataFrame({\"blue_dog\": a_m_d_11[\"blue_dog\"],\n", " \"justice_dem\": a_m_d_11[\"justice_dem\"],\n", " \"leaning\": a_m_d_11[\"leaning\"],\n", " \"median_income_std\": a_m_d_11[\"median_income_std\"],\n", " \"incumbent\": a_m_d_11[\"Incumbent\"],\n", " \"very_low_urb\": a_m_d_11[\"Very low density\"],\n", " \"low_urb\": a_m_d_11[\"Low density\"],\n", " \"medium_urb\": a_m_d_11[\"Medium density\\t\"],\n", "})\n", "\n", "X_urb_rates = sm.add_constant(X_urb_rates)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = model_selection.train_test_split(X_reg, win, test_size=0.50)\n", "X_race_train, X_race_test, Y_race_train, Y_race_test = model_selection.train_test_split(X_race, win, test_size=0.50)\n", "X_inc_train, X_inc_test, Y_inc_train, Y_inc_test = model_selection.train_test_split(X_inc, win, test_size=0.50)\n", "X_urb_train, X_urb_test, Y_urb_train, Y_urb_test = model_selection.train_test_split(X_urb_rates, win, test_size=0.50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next, running each model" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
constblue_dogjustice_demleaningmedian_income_stdincumbentwestsouthmid_west
3371.00033-0.6573410010
151.00017-1.3054390010
1951.00021-1.0042960010
\n", "
" ], "text/plain": [ " const blue_dog justice_dem leaning median_income_std incumbent \\\n", "337 1.0 0 0 33 -0.657341 0 \n", "15 1.0 0 0 17 -1.305439 0 \n", "195 1.0 0 0 21 -1.004296 0 \n", "\n", " west south mid_west \n", "337 0 1 0 \n", "15 0 1 0 \n", "195 0 1 0 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_reg_train.head(3)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 1.14732251e-05, 2.17198287e+00, -6.22078636e-01, -4.64422689e-01,\n", " 7.31777926e-01, 4.32037303e-01, 5.22654345e-01, -2.51564305e-01,\n", " -5.52370326e-02])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logit_reg = linear_model.LogisticRegression(solver=\"lbfgs\", max_iter = 10000)\n", "logit_reg.fit(X_reg_train, Y_reg_train)\n", "logit_reg_coefs_1 = pd.Series(dict(zip(list(X_reg_train), logit_reg.coef_.round(3))))\n", "\n", "logit_reg_coef = ltc_returner(logit_reg.coef_)\n", "logit_reg_coef" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-2.23021598e-04, 1.66864934e+00, -1.91795510e-02, -7.36959710e-01,\n", " 1.21627510e-01, 8.73802471e-02, -2.08008741e-01, -3.18615176e-01,\n", " -6.70188931e-02])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logit_race = linear_model.LogisticRegression(solver=\"lbfgs\", max_iter = 10000)\n", "logit_race.fit(X_race_train, Y_race_train)\n", "logit_race_coefs_1 = pd.Series(dict(zip(list(X_race_train), logit_race.coef_.round(3))))\n", "\n", "logit_race_coef = ltc_returner(logit_race.coef_)\n", "logit_race_coef" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-1.25785253e-05, 1.62711114e+00, -4.82173774e-01, -5.36744134e-01,\n", " 1.11570439e+00, 4.15048688e-01, 1.91961162e-02, -9.64318399e-01,\n", " 9.19636266e-02, 1.29191678e-01, 2.15100091e-01, -9.18830610e-03])" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logit_inc = linear_model.LogisticRegression(solver=\"lbfgs\", max_iter = 10000)\n", "logit_inc.fit(X_inc_train, Y_inc_train)\n", "logit_inc_coefs_1 = pd.Series(dict(zip(list(X_inc_train), logit_inc.coef_.round(3))))\n", "\n", "logit_inc_coef = ltc_returner(logit_inc.coef_)\n", "logit_inc_coef" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 4.08474405e-04, 2.06151737e+00, -6.03152616e-01, -3.83338636e-01,\n", " 7.36228519e-01, 7.10858951e-01, 9.76413930e-02, -5.84026819e-01,\n", " -3.70695454e-02])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logit_urb = linear_model.LogisticRegression(solver=\"lbfgs\", max_iter = 10000)\n", "logit_urb.fit(X_urb_train, Y_urb_train)\n", "logit_urb_coefs_1 = pd.Series(dict(zip(list(X_urb_train), logit_urb.coef_.round(3))))\n", "\n", "logit_urb_coef = ltc_returner(logit_urb.coef_)\n", "logit_urb_coef" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## With models fitted, I find their accuracies." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Logit Region training dataset accuracy is: 0.9693877551020408\n", "Logit Region testing dataset accuracy is: 0.9489795918367347\n" ] } ], "source": [ "reg_train_accuracy = logit_reg.score(X_reg_train, Y_reg_train)\n", "reg_test_accuracy = logit_reg.score(X_reg_test, Y_reg_test)\n", "\n", "print(f\"Logit Region training dataset accuracy is: {reg_train_accuracy}\")\n", "print(f\"Logit Region testing dataset accuracy is: {reg_test_accuracy}\")" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Logit Race training dataset accuracy is: 0.9591836734693877\n", "Logit Race testing dataset accuracy is: 0.9489795918367347\n" ] } ], "source": [ "race_train_accuracy = logit_race.score(X_race_train, Y_race_train)\n", "race_test_accuracy = logit_race.score(X_race_test, Y_race_test)\n", "\n", "print(f\"Logit Race training dataset accuracy is: {race_train_accuracy}\")\n", "print(f\"Logit Race testing dataset accuracy is: {race_test_accuracy}\")" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Logit Incumbent training dataset accuracy is: 0.9693877551020408\n", "Logit Incumbent testing dataset accuracy is: 0.9591836734693877\n" ] } ], "source": [ "inc_train_accuracy = logit_inc.score(X_inc_train, Y_inc_train)\n", "inc_test_accuracy = logit_inc.score(X_inc_test, Y_inc_test)\n", "\n", "print(f\"Logit Incumbent training dataset accuracy is: {inc_train_accuracy}\")\n", "print(f\"Logit Incumbent testing dataset accuracy is: {inc_test_accuracy}\")" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Logit Urban training dataset accuracy is: 0.9336734693877551\n", "Logit Urban testing dataset accuracy is: 0.9744897959183674\n" ] } ], "source": [ "urb_train_accuracy = logit_urb.score(X_urb_train, Y_urb_train)\n", "urb_test_accuracy = logit_urb.score(X_urb_test, Y_urb_test)\n", "\n", "print(f\"Logit Urban training dataset accuracy is: {urb_train_accuracy}\")\n", "print(f\"Logit Urban testing dataset accuracy is: {urb_test_accuracy}\")" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "Types = [\"Region Tr.\", \"Region Test\", \n", " \"Race Tr.\", \"Race Test\",\n", " \"Candidate Tr.\", \"Candidate Test\",\n", " \"Urbanization Tr.\", \"Urbanization Test\"]\n", "\n", "Accuracy = [reg_train_accuracy, reg_test_accuracy, race_train_accuracy, race_test_accuracy, \n", " inc_train_accuracy, inc_test_accuracy, urb_train_accuracy, urb_test_accuracy]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Accuracy Ratings of Regional, Race Proportion, Candidate Characteristics, and Urbanization \\n models on Training, Testing Datasets')" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.stem(Accuracy)\n", "plt.xticks(ticks = range(len(Types)), \n", " labels = Types, rotation = 45)\n", "\n", "plt.ylim([0.9, 1])\n", "plt.ylabel(\"Accuracy Rating\")\n", "plt.xlabel(\"Models\")\n", "plt.title(\"Accuracy Ratings of Regional, Race Proportion, Candidate Characteristics, and Urbanization \\n models on Training, Testing Datasets\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the axis is from 0.9 to 1: the accuracy ratings are all closer than they appear. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Results will not impress: accuracy values are likely slightly higher on average than the basic logit model, but not much, and many will likely have higher accuracy values in the training than testing case. This implies the extra variables are just causing the model to overfit to the variance in the training data. \n", "\n", "Most likely, there's just not much more accuracy to gain from adding more variables, as the variables already included in the basic logit function already have high predictive power. \n", "\n", "So, I'm going to just stick with the basic logit model from this point onwards." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part Four: Counterfactuals\n", "\n", "\n", "A big motivator for the paper was seeing if there was a causal effect of candidate ideology (ie. if a Democratic candidate was very moderate and centrist, or very liberal, or in between) on election win probability. In my paper, I found large and significant coefficients on the `Blue_Dog` and `Justice_Democrat` variables (signifying if a candidate was a centrist or a liberal, respectively), and of comparable size to other results found in the literature. I will assume this change in win probability was causal.\n", "\n", "So, now I want to play around with it, and see what differences we might expect in the US House of Representatives if more candidate had run as liberals or centrists. " ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The number of Democrats who win in the real election was: 198\n" ] } ], "source": [ "print(\"The number of Democrats who win in the real election was:\", win.sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Prediction: If all Democrats ran as Blue Dogs or Justice Democrats\n", "\n", "First, I will take the full dataset for `X`, set all `blue_dog` variables to 1, and set all `justice_dem` variables to 0. This implies the whole Democratic Party running as centrists.\n", "\n", "In order for `predict` to work, there have to be some rows with different values in each colum. I set row 390's Blue Dog value to 0, and row 391's Blue Dog value to 0 and Justice Democrat value to 1, so that we don't have perfect multicollearity. Both of these were in fairly Republican-leaning districts, so I wouldn't expect either to be winnable. " ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "X_BD = pd.DataFrame({\"blue_dog\": 1,\n", " \"justice_dem\": 0,\n", " \"leaning\": a_m_d_11[\"leaning\"],\n", " \"median_income_std\": a_m_d_11[\"median_income_std\"],\n", " \"incumbent\": a_m_d_11[\"Incumbent\"]\n", "})\n", "\n", "X_BD.at[391, \"blue_dog\"] = 0\n", "X_BD.at[390, \"blue_dog\"] = 0\n", "X_BD.at[390, \"justice_dem\"] = 1\n", "\n", "X_BD = sm.add_constant(X_BD)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
constblue_dogjustice_demleaningmedian_income_stdincumbent
3871.010130.4102610
3881.0108-0.2244960
3891.0108-0.4842150
3901.0017-0.1828970
3911.00025-0.0599120
\n", "
" ], "text/plain": [ " const blue_dog justice_dem leaning median_income_std incumbent\n", "387 1.0 1 0 13 0.410261 0\n", "388 1.0 1 0 8 -0.224496 0\n", "389 1.0 1 0 8 -0.484215 0\n", "390 1.0 0 1 7 -0.182897 0\n", "391 1.0 0 0 25 -0.059912 0" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_BD.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "logit_bd_predictions = logit_train.predict(X_BD)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "logit_bd_predictions = logit_train.predict(X_BD)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "pred_bd_wins = logit_bd_predictions.mean() * 392" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "dem_wins = win.mean()*392" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, for the liberal case, I just do the inverse of the above, and set all candidates as Justice Democrats. The same candidates (index values 390 and 391) are changed again to make the prediction work." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "X_JD = pd.DataFrame({\"blue_dog\": 0,\n", " \"justice_dem\": 1,\n", " \"leaning\": a_m_d_11[\"leaning\"],\n", " \"median_income_std\": a_m_d_11[\"median_income_std\"],\n", " \"incumbent\": a_m_d_11[\"Incumbent\"]\n", "})\n", "\n", "X_JD.at[391, \"blue_dog\"] = 1\n", "X_JD.at[390, \"justice_dem\"] = 0\n", "X_JD.at[391, \"justice_dem\"] = 0\n", "\n", "X_JD = sm.add_constant(X_JD)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
constblue_dogjustice_demleaningmedian_income_stdincumbent
3871.001130.4102610
3881.0018-0.2244960
3891.0018-0.4842150
3901.0007-0.1828970
3911.01025-0.0599120
\n", "
" ], "text/plain": [ " const blue_dog justice_dem leaning median_income_std incumbent\n", "387 1.0 0 1 13 0.410261 0\n", "388 1.0 0 1 8 -0.224496 0\n", "389 1.0 0 1 8 -0.484215 0\n", "390 1.0 0 0 7 -0.182897 0\n", "391 1.0 1 0 25 -0.059912 0" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_JD.tail()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "logit_jd_predictions = logit_train.predict(X_JD)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "pred_jd_wins = logit_jd_predictions.mean() * 392" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The number of winning Democrats in the sample is: 198\n", "The number of predicted Democrat wins, if all ran as centrists, is 209.0\n", "The number of predicted Democrat wins, if all ran as liberals, is 182.0\n" ] } ], "source": [ "print(f\"The number of winning Democrats in the sample is: {round(dem_wins)}\")\n", "print(f\"The number of predicted Democrat wins, if all ran as centrists, is {pred_bd_wins}\")\n", "print(f\"The number of predicted Democrat wins, if all ran as liberals, is {pred_jd_wins}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The real-life outcome of the election was the Democrats winning 235 of 435 seats. The sample was reduced to 392 because of 43 seats with abnormal elections - candidates running unopposed, against a weak independent or third party, or with many candidates running under a [Jungle Primary ruleset](https://en.wikipedia.org/wiki/Nonpartisan_blanket_primary). These were removed; most had lopsided results and would likely not change.\n", "\n", "\n", "You can observe how many seats the Demcrats in the sample are expected to win. I'd expect that the centrist case increases their hold on the House, but doesn't reach a supermajority, while the liberal case weakens their hold but won't result in enough losses to lose them the chamber. Simply subtract 198 from the number of Centrist/Liberal case wins to see what the difference will be.\n", "\n", "Due to the heavily nationalized media environment, candidates often struggle to distinguish themselves from their party. I would expect that the whole Democratic Party moving towards the center to have a much larger effect due to this. The predictions here essentially take the effect for a candidate on the margin, if he changed from a normal Democrat to a centrist, and apply it to everyone. The party changing how it's perceived by voters would cause an additional increase in electability.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }