{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Ordinary Least Squares Example\n\nThis example shows how to use the ordinary least squares (OLS) model\ncalled :class:`~sklearn.linear_model.LinearRegression` in scikit-learn.\n\nFor this purpose, we use a single feature from the diabetes dataset and try to\npredict the diabetes progression using this linear model. We therefore load the\ndiabetes dataset and split it into training and test sets.\n\nThen, we fit the model on the training set and evaluate its performance on the test\nset and finally visualize the results on the test set.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Data Loading and Preparation\n\nLoad the diabetes dataset. For simplicity, we only keep a single feature in the data.\nThen, we split the data and target into training and test sets.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import load_diabetes\nfrom sklearn.model_selection import train_test_split\n\nX, y = load_diabetes(return_X_y=True)\nX = X[:, [2]]  # Use only one feature\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=20, shuffle=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Linear regression model\n\nWe create a linear regression model and fit it on the training data. Note that by\ndefault, an intercept is added to the model. We can control this behavior by setting\nthe `fit_intercept` parameter.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.linear_model import LinearRegression\n\nregressor = LinearRegression().fit(X_train, y_train)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Model evaluation\n\nWe evaluate the model's performance on the test set using the mean squared error\nand the coefficient of determination.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.metrics import mean_squared_error, r2_score\n\ny_pred = regressor.predict(X_test)\n\nprint(f\"Mean squared error: {mean_squared_error(y_test, y_pred):.2f}\")\nprint(f\"Coefficient of determination: {r2_score(y_test, y_pred):.2f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Plotting the results\n\nFinally, we visualize the results on the train and test data.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n\nfig, ax = plt.subplots(ncols=2, figsize=(10, 5), sharex=True, sharey=True)\n\nax[0].scatter(X_train, y_train, label=\"Train data points\")\nax[0].plot(\n    X_train,\n    regressor.predict(X_train),\n    linewidth=3,\n    color=\"tab:orange\",\n    label=\"Model predictions\",\n)\nax[0].set(xlabel=\"Feature\", ylabel=\"Target\", title=\"Train set\")\nax[0].legend()\n\nax[1].scatter(X_test, y_test, label=\"Test data points\")\nax[1].plot(X_test, y_pred, linewidth=3, color=\"tab:orange\", label=\"Model predictions\")\nax[1].set(xlabel=\"Feature\", ylabel=\"Target\", title=\"Test set\")\nax[1].legend()\n\nfig.suptitle(\"Linear Regression\")\n\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Conclusion\n\nThe trained model corresponds to the estimator that minimizes the mean squared error\nbetween the predicted and the true target values on the training data. We therefore\nobtain an estimator of the conditional mean of the target given the data.\n\nNote that in higher dimensions, minimizing only the squared error might lead to\noverfitting. Therefore, regularization techniques are commonly used to prevent this\nissue, such as those implemented in :class:`~sklearn.linear_model.Ridge` or\n:class:`~sklearn.linear_model.Lasso`.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.21"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}