{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Measurement Errors in Linear Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lets use simulated data here. \n", "First we will model the distance of 100 supernovae (for a particular cosmology) as a function of redshift.\n", "\n", "We rely on that astroML has a common API with scikit-learn, extending the functionality of the latter.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import numpy as np\n", "from matplotlib import pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from astropy.cosmology import LambdaCDM\n", "from astroML.datasets import generate_mu_z\n", "\n", "z_sample, mu_sample, dmu = generate_mu_z(100, random_state=0)\n", "\n", "cosmo = LambdaCDM(H0=70, Om0=0.30, Ode0=0.70, Tcmb0=0)\n", "z = np.linspace(0.01, 2, 1000)\n", "mu_true = cosmo.distmod(z)\n", "\n", "plt.errorbar(z_sample, mu_sample, dmu, fmt='.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simple linear regression\n", "\n", "Regression defined as the relation between a dependent variable, $y$, and a set of independent variables, $x$, \n", "that describes the expectation value of y given x: $ E[y|x] $\n", "\n", "We will start with the most familiar linear regression, a straight-line fit to data.\n", "A straight-line fit is a model of the form\n", "$$\n", "y = ax + b\n", "$$\n", "where $a$ is commonly known as the *slope*, and $b$ is commonly known as the *intercept*.\n", "\n", "We can use Scikit-Learn's LinearRegression estimator to fit this data and construct the best-fit line:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression as LinearRegression_sk \n", "\n", "linear_sk = LinearRegression_sk()\n", "linear_sk.fit(z_sample[:,None], mu_sample)\n", "\n", "mu_fit_sk = linear_sk.predict(z[:, None])\n", "\n", "#------------------------------------------------------------\n", "# Plot the results\n", "fig = plt.figure(figsize=(8, 6))\n", "\n", "ax = fig.add_subplot(111)\n", "\n", "ax.plot(z, mu_fit_sk, '-k')\n", "ax.plot(z, mu_true, '--', c='gray')\n", "ax.errorbar(z_sample, mu_sample, dmu, fmt='.k', ecolor='gray', lw=1)\n", "\n", "ax.set_xlim(0.01, 1.8)\n", "ax.set_ylim(36.01, 48)\n", "\n", "ax.set_ylabel(r'$\\mu$')\n", "ax.set_xlabel(r'$z$')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Measurement errors\n", "\n", "Modifications to LinearRegression in astroML take measurement errors into account on the dependent variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from astroML.linear_model import LinearRegression\n", "\n", "linear = LinearRegression()\n", "linear.fit(z_sample[:,None], mu_sample, dmu)\n", "\n", "mu_fit = linear.predict(z[:, None])\n", "\n", "#------------------------------------------------------------\n", "# Plot the results\n", "fig = plt.figure(figsize=(8, 6))\n", "\n", "ax = fig.add_subplot(111)\n", "\n", "ax.plot(z, mu_fit_sk, '-k')\n", "ax.plot(z, mu_fit, '-k', color='red')\n", "\n", "ax.plot(z, mu_true, '--', c='gray')\n", "ax.errorbar(z_sample, mu_sample, dmu, fmt='.k', ecolor='gray', lw=1)\n", "\n", "ax.set_xlim(0.01, 1.8)\n", "ax.set_ylim(36.01, 48)\n", "\n", "ax.set_ylabel(r'$\\mu$')\n", "ax.set_xlabel(r'$z$')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basis function regression\n", "\n", "If we consider a function in terms of the sum of bases (this can be polynomials, Gaussians, quadratics, cubics) then we can solve for the coefficients using regression. \n", "\n", "### Polynomial basis functions\n", "\n", "polynomial regression: $$𝑦=π‘Ž_0+π‘Ž_1π‘₯+π‘Ž_2π‘₯^2+π‘Ž_3π‘₯^3+β‹―$$\n", "\n", "Notice that this is still a linear modelβ€”the linearity refers to the fact that the coefficients $π‘Ž_𝑛$ never multiply or divide each other. What we have effectively done here is to take our one-dimensional $π‘₯$ values and projected them into a higher dimension, so that a linear fit can fit more complicated relationships between $π‘₯$ and $𝑦$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from astroML.linear_model import PolynomialRegression\n", "\n", "# 2nd degree polynomial regression\n", "polynomial = PolynomialRegression(degree=2)\n", "polynomial.fit(z_sample[:,None], mu_sample, dmu)\n", "\n", "mu_fit_poly = polynomial.predict(z[:, None])\n", "\n", "#------------------------------------------------------------\n", "# Plot the results\n", "fig = plt.figure(figsize=(8, 6))\n", "\n", "ax = fig.add_subplot(111)\n", "\n", "ax.plot(z, mu_fit_poly, '-k', color='red')\n", "\n", "ax.plot(z, mu_true, '--', c='gray')\n", "ax.errorbar(z_sample, mu_sample, dmu, fmt='.k', ecolor='gray', lw=1)\n", "\n", "ax.set_xlim(0.01, 1.8)\n", "ax.set_ylim(36.01, 48)\n", "\n", "ax.set_ylabel(r'$\\mu$')\n", "ax.set_xlabel(r'$z$')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Gaussian basis functions\n", "\n", "Of course, other basis functions are possible.\n", "For example, one useful pattern is to fit a model that is not a sum of polynomial bases, but a sum of Gaussian bases. E.g. we could substitute $π‘₯^2$ for Gaussians (where we fix $𝜎$ and $πœ‡$ and fit for the amplitude) as long as the attribute we are fitting for is linear. This is called basis function regression.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from astroML.linear_model import BasisFunctionRegression\n", "\n", "#------------------------------------------------------------\n", "# Define our number of Gaussians\n", "nGaussians = 10\n", "basis_mu = np.linspace(0, 2, nGaussians)[:, None]\n", "basis_sigma = 3 * (basis_mu[1] - basis_mu[0])\n", "\n", "gauss_basis = BasisFunctionRegression('gaussian', mu=basis_mu, sigma=basis_sigma)\n", "gauss_basis.fit(z_sample[:,None], mu_sample, dmu)\n", "\n", "mu_fit_gauss = gauss_basis.predict(z[:, None])\n", "\n", "#------------------------------------------------------------\n", "# Plot the results\n", "fig = plt.figure(figsize=(8, 6))\n", "\n", "ax = fig.add_subplot(111)\n", "\n", "ax.plot(z, mu_fit_gauss, '-k', color='red')\n", "\n", "ax.plot(z, mu_true, '--', c='gray')\n", "ax.errorbar(z_sample, mu_sample, dmu, fmt='.k', ecolor='gray', lw=1)\n", "\n", "ax.set_xlim(0.01, 1.8)\n", "ax.set_ylim(36.01, 48)\n", "\n", "ax.set_ylabel(r'$\\mu$')\n", "ax.set_xlabel(r'$z$')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regularization and cross-validation\n", "\n", "As the complexity of a model increases the data points fit the model more and more closely.\n", "\n", "This does not result in a better fit to the data. We are overfitting the data (the model has high variance - a small change in a training point can change the model dramatically).\n", "\n", "We can evaluate this using a training set (50-70% of sample), a cross-validation set (15-25%) and a test set (15-25%). See booko sub-chapter 8.11 and e.g. figures [8.13](http://www.astroml.org/book_figures/chapter8/fig_cross_val_B.html) and [8.14](http://www.astroml.org/book_figures/chapter8/fig_cross_val_C.html).\n", "\n", "For cases where we are concerned with overfitting we can apply constraints (usually of smoothness, number of coefficients, size of coefficients). See book sub-chapter 8.3, and e.g. figure [8.4](http://www.astroml.org/book_figures/chapter8/fig_rbf_ridge_mu_z.html)\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 2 }