{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Common pitfalls in the interpretation of coefficients of linear models\n\nIn linear models, the target value is modeled as a linear combination of the\nfeatures (see the `linear_model` User Guide section for a description of a\nset of linear models available in scikit-learn). Coefficients in multiple linear\nmodels represent the relationship between the given feature, $X_i$ and the\ntarget, $y$, assuming that all the other features remain constant\n([conditional dependence](https://en.wikipedia.org/wiki/Conditional_dependence)). This is different\nfrom plotting $X_i$ versus $y$ and fitting a linear relationship: in\nthat case all possible values of the other features are taken into account in\nthe estimation (marginal dependence).\n\nThis example will provide some hints in interpreting coefficient in linear\nmodels, pointing at problems that arise when either the linear model is not\nappropriate to describe the dataset, or when features are correlated.\n\n
Keep in mind that the features $X$ and the outcome $y$ are in\n general the result of a data generating process that is unknown to us.\n Machine learning models are trained to approximate the unobserved\n mathematical function that links $X$ to $y$ from sample data. As\n a result, any interpretation made about a model may not necessarily\n generalize to the true data generating process. This is especially true when\n the model is of bad quality or when the sample data is not representative of\n the population.
Why does the plot above suggest that an increase in age leads to a\n decrease in wage? Why the `initial pairplot\n