{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Measurement errors in both dependent and independent variables\n", "\n", "Regression defined as the relation between a dependent variable, $\\eta_i$, and a set of independent \n", "variables, $\\xi_i$, that describes the expectation value of $\\eta_i$ given $\\xi_i$. \n", "There may be intrinsic scatter, $\\epsilon_i$, too.\n", "\n", "In most cases, however, what we observe are values $x_i$ and $y_i$ and \n", "associated measurement errors, $\\epsilon_{x,i}$ and $\\epsilon_{y,i}$. \n", "Following [Kelly 2007](https://iopscience.iop.org/article/10.1086/519947/pdf), \n", "we can write the regression relationship as:\n", "\n", "$$ \\eta_i = \\alpha + \\beta \\xi_i + \\epsilon_i $$ and \n", "$$ x_i = \\xi_i + \\epsilon_{x,i}$$\n", "$$y_i = \\eta_i + \\epsilon_{y,i}$$" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Data sets used in the examples below\n", "\n", "Use simulation data from [Kelly 2007](https://iopscience.iop.org/article/10.1086/519947/pdf). \n", "This simulator, called `simulation_kelly` is available from `astroML.datasets`.\n", "\n", "The function returns the $\\xi_i$, $\\eta_i$, $x_i$, $y_i$, $\\epsilon_{x,i}$, $\\epsilon_{y,i}$ and \n", "the input regression coefficients $\\alpha$ and $\\beta$ and intrinsic scatter $\\epsilon$. \n", "A total of ``size`` values generated, measurement errors are scaled by parameters ``scalex`` and\n", " ``scaley`` following secion 7.1 in [Kelly 2007](https://iopscience.iop.org/article/10.1086/519947/pdf)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "from astroML.datasets import simulation_kelly\n", "\n", "ksi, eta, xi, yi, xi_error, yi_error, alpha_in, beta_in = simulation_kelly(size=100, scalex=0.2, scaley=0.2,\n", " alpha=2, beta=1, epsilon=(0, 0.75))" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Plot the generated data\n", "\n", "Plot both the simulated $\\xi$, $\\eta$ dataset as well the generated observed values of $x_i$ and $y_i$ \n", "including their measurement errors for different values of the scale parameters ``scalex`` and ``scaley``." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "%matplotlib notebook\n", "\n", "import numpy as np\n", "from matplotlib import pyplot as plt\n", "\n", "ksi_0 = np.arange(np.min(ksi) - 0.5, np.max(ksi) + 0.5)\n", "eta_0 = alpha_in + ksi_0 * beta_in\n", "\n", "figure = plt.figure(figsize=(15, 12))\n", "figure.subplots_adjust(left=0.1, right=0.95,\n", " bottom=0.1, top=0.95,\n", " hspace=0.1, wspace=0.15)\n", "ax = figure.add_subplot(221)\n", "ax.scatter(ksi, eta)\n", "ax.set_xlabel(r'$\\xi$')\n", "ax.set_ylabel(r'$\\eta$')\n", "\n", "ax.plot(ksi_0, eta_0, color='orange')\n", "\n", "for scalex, scaley, axn in [(0.1, 0.1, 222), (0.3, 0.5, 224), (0.2, 0.2, 223)]:\n", " _, _, xi, yi, xi_error, yi_error, _, _ = simulation_kelly(size=100, scalex=scalex, scaley=scaley,\n", " alpha=alpha_in, beta=beta_in, ksi=ksi, eta=eta) \n", " ax = figure.add_subplot(axn)\n", "\n", " ax.scatter(xi[0], yi, alpha=0.5)\n", " ax.errorbar(xi[0], yi, xerr=xi_error[0], yerr=yi_error, alpha=0.3, ls='')\n", " ax.set_xlabel(f'x, error scaler: {scalex}')\n", " ax.set_ylabel(f'y, error scaler: {scaley}')\n", "\n", " x0 = np.arange(np.min(xi) - 0.5, np.max(xi) + 0.5)\n", " y0 = alpha_in + x0 * beta_in\n", " ax.plot(x0, y0, color='orange')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Linear regression with uncertainties in both dependent and independent axes\n", " \n", "The class ``LinearRegressionwithErrors`` can be used to take into account measurement errors in \n", "both the dependent and independent variables. \n", "The implementation relies on the [PyMC3](https://docs.pymc.io/) and [Theano](http://deeplearning.net/software/theano/)\n", "packages. \n", "\n", "Note: The first initialization of the fitter is expected to take a couple of minutes, as ``Theano`` \n", "performs some code compilation for the underlying model. Sampling for consecutive runs is expected\n", "to start up significantly faster.\n", " \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "from astroML.linear_model import LinearRegressionwithErrors" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": true }, "scrolled": true }, "outputs": [], "source": [ "linreg_xy_err = LinearRegressionwithErrors()\n", "linreg_xy_err.fit(xi, yi, yi_error, xi_error)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot the results\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "from astroML.plotting import plot_regressions, plot_regression_from_trace" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_regressions(ksi, eta, xi[0], yi, xi_error[0], yi_error, add_regression_lines=True, alpha_in=alpha_in, beta_in=beta_in)\n", "plot_regression_from_trace(linreg_xy_err, (xi, yi, xi_error, yi_error), ax=plt.gca(), chains=50)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multivariate regression\n", "For multivariate data (where we fit a hyperplane rather than a straight line) we simply extend the description\n", "of the regression function to multiple dimensions. The formalism used in the previous example becomes:\n", "\n", "\n", "$$ \\eta_i = \\alpha + \\beta^T \\xi_i + \\epsilon_i $$ \n", "\n", "where both $\\beta^T$ and $\\xi_i$ are now N-element vectors.\n", "\n", "### Generate a dataset:\n", "\n", "We use the same function as above to generate 100 datapoints in 2 dimensions. Note that the size of the ``beta`` \n", "parameter needs to match the dimensions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "ksi2, eta2, xi2, yi2, xi_error2, yi_error2, alpha_in2, beta_in2 = simulation_kelly(size=100, scalex=0.2, scaley=0.2,\n", " alpha=2, beta=[0.5, 1],\n", " multidim=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The previously used ``LinearRegressionwithErrors`` class can be used with multidimensional data, thus the fitting is done the exact same way as before:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "linreg_xy_err2 = LinearRegressionwithErrors()\n", "linreg_xy_err2.fit(xi2, yi2, yi_error2, xi_error2)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "source": [ "There are several ways to explore the fits, in the following we show a few ways to plot this dataset. As in this example the fitted hyperplane was 2D, we can use a 3D plot to show both the fit and the underlying regession we used to generate the data from. Other plotting libraries can also be used to e.g. create pairplots of the parameters (e.g. Arviz' ``plot_pair`` function, or Seaborn's ``jointplot``)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x0 = np.linspace(np.min(xi2[0])-0.5, np.max(xi2[0])+0.5, 50)\n", "x1 = np.linspace(np.min(xi2[1])-0.5, np.max(xi2[1])+0.5, 50)\n", "x0, x1 = np.meshgrid(x0, x1)\n", "\n", "y0 = alpha_in + x0 * beta_in2[0] + x1 * beta_in2[1]\n", "\n", "y_fitted = linreg_xy_err2.coef_[0] + x0 * linreg_xy_err2.coef_[1][0] + x1 * linreg_xy_err2.coef_[1][1]\n", "\n", "import matplotlib.pyplot as plt\n", "from mpl_toolkits.mplot3d import Axes3D\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111, projection='3d')\n", "ax.set_xlabel('xi2[0] ')\n", "ax.set_ylabel('xi2[1] ')\n", "ax.set_zlabel('yi2 ')\n", "ax.scatter(xi2[0], xi2[1], yi2, s=20)\n", "ax.plot_surface(x0, x1, y0, alpha=0.2, facecolor='blue', label='True regression')\n", "ax.plot_surface(x0, x1, y_fitted, alpha=0.2, facecolor='red', label='Fitted regression')\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sampler statistics and traceplots\n", "\n", "The PyMC3 trace is available in the ``.trace`` attribute of the class instances, after we performed the fit. This can be then used for checking for convergence, and generating statistics for the samples. We would refer to use the tools provided by PyMC3, e.g. the ``traceplot()`` that takes the trace object as its input." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pymc3 as pm\n", "pm.traceplot(linreg_xy_err2.trace)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" }, "pycharm": { "stem_cell": { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [] } } }, "nbformat": 4, "nbformat_minor": 2 }