{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Basic Frequentist Fitting\n",
"\n",
"This introductory tutorial on frequentist fitting focuses on _parameter estimation by maximum likelihood model fitting_ and should be completed with reference to [these companion slides](https://github.com/capprogram/2021bootcamp/blob/master/BasicStatsIII.pdf). The slides also discuss interpreting χ2, which you may wish to explore further by completing the [Interpreting Chi-Squared](https://github.com/capprogram/2021bootcamp/blob/master/interpchi2/README.md) tutorial before this one (optional).\n",
"\n",
"Author: Sheila Kannappan (with prior major contributions from Kathleen Eckert, Rohan Isaac, and Amy Oldenberg)
\n",
"Last Modified: June 2021"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Why do we fit models to data?\n",
"\n",
"- To understand the underlying distribution of the data\n",
"- To test hypotheses or competing theoretical models\n",
"- To predict values for future observations\n",
"\n",
"In general, any of these goals will often involve _parameter estimation_. In this tutorial we will go through the basics of parameter estimation using frequentist \"maximum likelihood\" model fitting.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Least Squares Fitting\n",
"\n",
"\"Least squares\" fitting is based on the assumption that all the measurement uncertainties σ in a data set are the same, i.e., follow the same Gaussian distribution. In the case of a linear model, the least squares fit gives the slope & y-intercept parameters that minimize the mean square residuals between the data and the model, where the residuals are measured in the y-direction in the case of \"forward\" fitting. Here the mean square residuals are given by:
where xi is the independent variable, yi is the dependent variable, and α and β are the slope and y-intercept parameter values. Minimizing the mean square residuals is equivalent to minimizing the rms (root mean square) residuals, and also equivalent to minimizing the χ2, because of the fact that all σ's have been assumed to be identical.\n",
"\n",
"Least squares fitting falls within the broader category of parameter estimation known as *Maximum Likelihood Estimation* (MLE). In this method, we measure the likelihood for a given model, typically using the χ2 statistic:\n",
"\n",
"The likelihood is given by:
where
\n",
"\n",
"To find the MLE solution to our model, we maximize the likelihood function by taking the derivative with respect to each parameter (the slope and y-intercept in the case of a linear fit) and by solving for where each derivative is equal to 0. To simplify the math, we first take the natural log of the likelihood function. For least squares fitting (wherein all σ values are taken to be equal), it is possible to obtain an analytic solution for the slope and y-intercept of a linear model, as shown below."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Analytic solution to the linear least squares fitting problem\n",
"\n",
"Take the natural log of the likelihood function
\n",
"\n",
"Take the derivatives of ln(L) with respect to α and β and set those equations to 0:
\n",
"
and
.\n",
"\n",
"If we assume the $\\sigma$i's are all the same, then we have two equations for two unknowns to solve:\n",
"\n",
"**eqn 1:**
\n",
"\n",
"**eqn 2:**
\n",
"\n",
"Multiply eqn 1 by N and multiply eqn 2 by
to get:\n",
"\n",
"**eqn 1**
\n",
"\n",
"**eqn 2**
\n",
"\n",
"Now we can set these two equations equal to each other:\n",
"\n",
"
\n",
"\n",
"Solving for α and dividing the top and bottom by N2:\n",
"\n",
"
\n",
"\n",
"where the bar over the variable signifies the mean value of that quantity.\n",
"\n",
"Now we can go back and solve for β:\n",
"\n",
"from **eqn 2**
\n",
"\n",
"____\n",
"\n",
"For more complicated functions or if the uncertainties are not uniform, setting the derivatives of the likelihood equal to zero may not lead to equations that are so easily solved analytically. Thus for more complicated MLE problems we typically use programs such as `np.polyfit` or `mpfit` to determine parameters numerically."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tutorial:\n",
"\n",
"In the code below we create fake data with random errors around a line of known slope and y-intercept. We then compute the maximum likelihood estimated slope and y-intercept for the fake data. Fill in lines ending with \"?\" and answer questions by putting your own comments in the code."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import numpy.random as npr\n",
"# ipython \"magic\" to enable static plot output directly to notebook\n",
"%matplotlib inline\n",
"\n",
"# Generate fake data set to start with\n",
"alphatrue=2. # slope\n",
"betatrue=5. # intercept\n",
"errs=2.5 # sigma (amplitude of errors)\n",
"\n",
"narr=50 # number of data points\n",
"xvals = np.arange(narr) + 1. # xvals range from 1-51\n",
"yvals = alphatrue*xvals + betatrue + npr.normal(0,errs,narr) # yvals \n",
"# What aspect of a real data set does npr.normal emulate here?\n",
"\n",
"# What assumption is made here in the unweighted least squares approach?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Plot fake data\n",
"plt.plot(xvals,yvals,'b*',markersize=10)\n",
"plt.xlabel(\"x-values\")\n",
"plt.ylabel(\"y-values\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Analytic Linear Fit\n",
"Read over the derivation of the analytic solution to the linear least squares fitting problem (above). Add code below to compute the estimated slope and y-intercept based on the fake data set. Plot the maximum likelihood (\"best fit\") solution on top of the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Determine slope & y-intercept using least squares analytic solution \n",
"\n",
"#alphaest=(np.mean(xvals)*np.mean(yvals)-np.mean(xvals*yvals)) / \\\n",
"# (np.mean(xvals)**2 -np.mean(xvals**2)) # from derivation\n",
"#betaest= ? # calculate estimate of y-intercept from derivation\n",
"\n",
"# Why must we use alphaest rather than alphatrue above?\n",
"\n",
"# The MLE values of the slope and y-intercept are equivalent to the \"least\n",
"# squares\" fit results.\n",
"print(\"analytical MLE slope = %0.7f\" %alphaest)\n",
"print(\"analytical MLE y-intercept = %0.7f\" %?)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Replot fake data\n",
"plt.plot(xvals,yvals,'b*',markersize=10)\n",
"plt.xlabel(\"x-values\")\n",
"plt.ylabel(\"y-values\")\n",
"\n",
"# Overplot the MLE (\"best fit\") solution\n",
"yfitvals=xvals*alphaest+betaest\n",
"plt.plot(xvals,yfitvals,'r')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Analytic uncertainties\n",
"For linear least squares fitting, we can obtain analytic formulae for the uncertainties on the slope and y-intercept estimates, which have been provided below. (See http://mathworld.wolfram.com/LeastSquaresFitting.html for the full derivation.)
\n",
"\\right&space;)^2}{(N-2)\\sum(x_i-\\bar{x})^2}}\"/)
\n",
"\\right)^2}{N-2}\\right)&space;\\left(\\frac{1}{N}+\\frac{(\\bar{x})^2}{\\sum(x_i-\\bar{x})^2}\\right)}\"/)
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Compute analytic uncertainties for the slope and y-intercept\n",
"\n",
"alphaunc = np.sqrt(np.sum((yvals - (alphaest*xvals+betaest))**2) / ((narr-2.)*(np.sum((xvals-np.mean(xvals))**2))))\n",
"betaunc = np.sqrt((np.sum((yvals - (alphaest*xvals+betaest))**2) / (narr-2.)) * ((1./narr) + (np.mean(xvals)**2)/np.sum((xvals-np.mean(xvals))**2)) )\n",
"\n",
"print(\"analytic MLE uncertainty on alpha is %0.7f\" % (alphaunc))\n",
"print(\"analytic MLE uncertainty on beta is %0.7f\" % (betaunc))\n",
"\n",
"print(\"fractional uncertainty on alpha is %0.7f\" % (alphaunc/alphaest))\n",
"print(\"fractional uncertainty on beta is %0.7f\" % (betaunc/betaest))\n",
"\n",
"# Which parameter has larger fractional uncertainty?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What if you don't want to do all that algebra (or you can't)?\n",
"\n",
"Most MLE problems do not have analytic solutions, so must be treated numerically. Read up on [np.polyfit](http://docs.scipy.org/doc/numpy/reference/generated/numpy.polynomial.polynomial.polyfit.html) to learn how to fit a polynomial. For more complicated functions, you may wish to try [scipy.optimize.curve_fit](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) or the [python version of mpfit](https://code.google.com/archive/p/astrolibpy/downloads) (original documentation [here](https://pages.physics.wisc.edu/~craigm/idl/fitqa.html))."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Use `np.polyfit` to compute the slope and y-offset for the same fake \n",
"# data using numerical maximum likelihood estimation. \n",
"\n",
"# third parameter is order of fit, 1 for linear\n",
"pfit = np.polyfit(xvals,yvals,1) # returns coeff. of highest order term first\n",
"\n",
"print(\" \") # whitespace for readability\n",
"print(\"np.polyfit MLE slope = %0.7f\" %pfit[0])\n",
"print(\"np.polyfit MLE y-intercept = %0.7f\" %pfit[1])\n",
"\n",
"# Do you get the same result as in the analytic case? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: In the above example we assumed that the σ values on all data points were the same. This simplified assumption is often not the case. If the uncertainties differ, then we must include each data point's uncertainty within the MLE calculation. Although `np.polyfit` does not automatically take an array of uncertainties on the y value, we can input an optional weight vector: `fit=np.polyfit(xvals, yvals, 1, w=1/sig)` where `sig` is an array containing the uncertainty on each data point. Here the input is `1/sig` rather than `1/sig**2` as you might expect. The `np.polyfit` function squares the weight value within the source code."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Numerically estimating uncertainties\n",
"\n",
"A quick method to determine uncertainties numerically is to have `np.polyfit` compute the covariance matrix:\n",
"
which is the inverse of the Hessian Matrix, consisting of second derivatives of the log likelihood with respect to different model parameters. When these matrices get tricky to compute, we can resort to even more approximate numerical techniques such as the bootstrap (which is another tutorial in the bootcamp).\n",
"\n",
"Below we add `cov=\"True\"` to the `np.polyfit` function call to compute the covariance matrix numerically. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# calculate and print parameter uncertainties from the diagonal terms\n",
"# of the covariance matrix, which is the inverse of the Hessian matrix\n",
"# and can be computed in np.polyfit by setting cov='True'\n",
"\n",
"pfit,covp = np.polyfit(xvals, yvals, 1, cov='True') # returns coeff. of highest power first\n",
"\n",
"print(\"slope is %0.7f +- %0.7f\" % (pfit[0], np.sqrt(covp[0,0])))\n",
"print(\"intercept is %0.7f +- %0.7f\" % (pfit[1], np.sqrt(covp[1,1])))\n",
"\n",
"# How are the errors related to the terms of the covariance matrix?\n",
"\n",
"# Are the uncertainties the same as in the analytic solution?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, let's compare the analytically and numerically derived uncertainties for different sizes of data set. Try narr=10 and narr=100 in the code below, which recaps our calculations thus far and computes the fractional difference in uncertainties (if any -- this depends on your version of numpy)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Generate fake data set to start with\n",
"alphatrue=2. # slope\n",
"betatrue=5. # intercept\n",
"errs=2.5 # sigma (amplitude of errors)\n",
"\n",
"narr=50 # number of data points\n",
"xvals = np.arange(narr) + 1. # xvals range from 1-51\n",
"yvals = alphatrue*xvals + betatrue + npr.normal(0,errs,narr) # yvals \n",
"\n",
"# Compute analytic parameters and uncertainties\n",
"alphaest=(np.mean(xvals)*np.mean(yvals)-np.mean(xvals*yvals)) / \\\n",
" (np.mean(xvals)**2 -np.mean(xvals**2)) # from derivation\n",
"betaest= np.mean(yvals) - alphaest * np.mean(xvals) # calculate estimate of y-intercept from derivation\n",
"alphaunc = np.sqrt(np.sum((yvals - (alphaest*xvals+betaest))**2) / ((narr-2.)*(np.sum((xvals-np.mean(xvals))**2))))\n",
"betaunc = np.sqrt((np.sum((yvals - (alphaest*xvals+betaest))**2) / (narr-2.)) * ((1./narr) + (np.mean(xvals)**2)/np.sum((xvals-np.mean(xvals))**2)) )\n",
"\n",
"print(\"analytic slope is %0.7f +- %0.7f\" % (alphaest, alphaunc))\n",
"print(\"analytic y-intercept is %0.7f +- %0.7f\" % (betaest, betaunc))\n",
"\n",
"# Compute numerical parameters and uncertainties \n",
"pfit,covp = np.polyfit(xvals, yvals, 1, cov='True') # returns coeff. of highest power first\n",
"\n",
"print(\"numerical slope is %0.7f +- %0.7f\" % (pfit[0], np.sqrt(covp[0,0])))\n",
"print(\"numerical y-intercept is %0.7f +- %0.7f\" % (pfit[1], np.sqrt(covp[1,1])))\n",
"\n",
"# Try changing narr to 10 or 100 in the code above. Print out the fractional \n",
"# difference in the analytically and numerically derived uncertainties.\n",
"\n",
"fracdiffalpha = (np.sqrt(covp[0,0]) - alphaunc)/alphaunc\n",
"fracdiffbeta = (np.sqrt(covp[1,1]) - betaunc)/betaunc\n",
"\n",
"print(\"fractional difference in uncertainty for slope %0.7f and intercept %0.7f\" % (fracdiffalpha,fracdiffbeta))\n",
"\n",
"# What happens to the uncertainties if you increase/decrease the number of points used in the fit (try N=100, N=10) ?\n",
"\n",
"# What happens to the percentage difference between the analytical and numerical methods for computing the uncertanties if you increase/decrease the number of points (try N=100, N=10)?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Solutions to the tutorial (with answers to the questions) can be loaded by executing the cell below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load paramfit1_soln"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}