{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Failure of Machine Learning to infer causal effects\n\nMachine Learning models are great for measuring statistical associations.\nUnfortunately, unless we're willing to make strong assumptions about the data,\nthose models are unable to infer causal effects.\n\nTo illustrate this, we will simulate a situation in which we try to answer one\nof the most important questions in economics of education: **what is the causal\neffect of earning a college degree on hourly wages?** Although the answer to\nthis question is crucial to policy makers, [Omitted-Variable Biases](https://en.wikipedia.org/wiki/Omitted-variable_bias) (OVB) prevent us from\nidentifying that causal effect.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The dataset: simulated hourly wages\n\nThe data generating process is laid out in the code below. Work experience in\nyears and a measure of ability are drawn from Normal distributions; the\nhourly wage of one of the parents is drawn from Beta distribution. We then\ncreate an indicator of college degree which is positively impacted by ability\nand parental hourly wage. Finally, we model hourly wages as a linear function\nof all the previous variables and a random component. Note that all variables\nhave a positive effect on hourly wages.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\nimport pandas as pd\n\nn_samples = 10_000\nrng = np.random.RandomState(32)\n\nexperiences = rng.normal(20, 10, size=n_samples).astype(int)\nexperiences[experiences < 0] = 0\nabilities = rng.normal(0, 0.15, size=n_samples)\nparent_hourly_wages = 50 * rng.beta(2, 8, size=n_samples)\nparent_hourly_wages[parent_hourly_wages < 0] = 0\ncollege_degrees = (\n 9 * abilities + 0.02 * parent_hourly_wages + rng.randn(n_samples) > 0.7\n).astype(int)\n\ntrue_coef = pd.Series(\n {\n \"college degree\": 2.0,\n \"ability\": 5.0,\n \"experience\": 0.2,\n \"parent hourly wage\": 1.0,\n }\n)\nhourly_wages = (\n true_coef[\"experience\"] * experiences\n + true_coef[\"parent hourly wage\"] * parent_hourly_wages\n + true_coef[\"college degree\"] * college_degrees\n + true_coef[\"ability\"] * abilities\n + rng.normal(0, 1, size=n_samples)\n)\n\nhourly_wages[hourly_wages < 0] = 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Description of the simulated data\n\nThe following plot shows the distribution of each variable, and pairwise\nscatter plots. Key to our OVB story is the positive relationship between\nability and college degree.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import seaborn as sns\n\ndf = pd.DataFrame(\n {\n \"college degree\": college_degrees,\n \"ability\": abilities,\n \"hourly wage\": hourly_wages,\n \"experience\": experiences,\n \"parent hourly wage\": parent_hourly_wages,\n }\n)\n\ngrid = sns.pairplot(df, diag_kind=\"kde\", corner=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next section, we train predictive models and we therefore split the\ntarget column from over features and we split the data into a training and a\ntesting set.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n\ntarget_name = \"hourly wage\"\nX, y = df.drop(columns=target_name), df[target_name]\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Income prediction with fully observed variables\n\nFirst, we train a predictive model, a\n:class:`~sklearn.linear_model.LinearRegression` model. In this experiment,\nwe assume that all variables used by the true generative model are available.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\nfrom sklearn.metrics import r2_score\n\nfeatures_names = [\"experience\", \"parent hourly wage\", \"college degree\", \"ability\"]\n\nregressor_with_ability = LinearRegression()\nregressor_with_ability.fit(X_train[features_names], y_train)\ny_pred_with_ability = regressor_with_ability.predict(X_test[features_names])\nR2_with_ability = r2_score(y_test, y_pred_with_ability)\n\nprint(f\"R2 score with ability: {R2_with_ability:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This model predicts well the hourly wages as shown by the high R2 score. We\nplot the model coefficients to show that we exactly recover the values of\nthe true generative model.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nmodel_coef = pd.Series(regressor_with_ability.coef_, index=features_names)\ncoef = pd.concat(\n [true_coef[features_names], model_coef],\n keys=[\"Coefficients of true generative model\", \"Model coefficients\"],\n axis=1,\n)\nax = coef.plot.barh()\nax.set_xlabel(\"Coefficient values\")\nax.set_title(\"Coefficients of the linear regression including the ability features\")\n_ = plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Income prediction with partial observations\n\nIn practice, intellectual abilities are not observed or are only estimated\nfrom proxies that inadvertently measure education as well (e.g. by IQ tests).\nBut omitting the \"ability\" feature from a linear model inflates the estimate\nvia a positive OVB.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "features_names = [\"experience\", \"parent hourly wage\", \"college degree\"]\n\nregressor_without_ability = LinearRegression()\nregressor_without_ability.fit(X_train[features_names], y_train)\ny_pred_without_ability = regressor_without_ability.predict(X_test[features_names])\nR2_without_ability = r2_score(y_test, y_pred_without_ability)\n\nprint(f\"R2 score without ability: {R2_without_ability:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The predictive power of our model is similar when we omit the ability feature\nin terms of R2 score. We now check if the coefficient of the model are\ndifferent from the true generative model.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model_coef = pd.Series(regressor_without_ability.coef_, index=features_names)\ncoef = pd.concat(\n [true_coef[features_names], model_coef],\n keys=[\"Coefficients of true generative model\", \"Model coefficients\"],\n axis=1,\n)\nax = coef.plot.barh()\nax.set_xlabel(\"Coefficient values\")\n_ = ax.set_title(\"Coefficients of the linear regression excluding the ability feature\")\nplt.tight_layout()\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To compensate for the omitted variable, the model inflates the coefficient of\nthe college degree feature. Therefore, interpreting this coefficient value\nas a causal effect of the true generative model is incorrect.\n\n## Lessons learned\n\nMachine learning models are not designed for the estimation of causal\neffects. While we showed this with a linear model, OVB can affect any type of\nmodel.\n\nWhenever interpreting a coefficient or a change in predictions brought about\nby a change in one of the features, it is important to keep in mind\npotentially unobserved variables that could be correlated with both the\nfeature in question and the target variable. Such variables are called\n[Confounding Variables](https://en.wikipedia.org/wiki/Confounding). In\norder to still estimate causal effect in the presence of confounding,\nresearchers usually conduct experiments in which the treatment variable (e.g.\ncollege degree) is randomized. When an experiment is prohibitively expensive\nor unethical, researchers can sometimes use other causal inference techniques\nsuch as [Instrumental Variables](https://en.wikipedia.org/wiki/Instrumental_variables_estimation) (IV)\nestimations.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }