{ "cells": [ { "cell_type": "markdown", "id": "798a4df1", "metadata": {}, "source": [ "Sascha Spors,\n", "Professorship Signal Theory and Digital Signal Processing,\n", "Institute of Communications Engineering (INT),\n", "Faculty of Computer Science and Electrical Engineering (IEF),\n", "University of Rostock,\n", "Germany\n", "\n", "# Data Driven Audio Signal Processing - A Tutorial with Computational Examples\n", "\n", "Winter Semester 2023/24 (Master Course #24512)\n", "\n", "- lecture: https://github.com/spatialaudio/data-driven-audio-signal-processing-lecture\n", "- tutorial: https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise\n", "\n", "Feel free to contact lecturer frank.schultz@uni-rostock.de" ] }, { "cell_type": "markdown", "id": "01338f2a", "metadata": {}, "source": [ "# Linear Regression with Ordinary Least Squares (OLS)\n", "\n", "- toy example for linear regression using ordinary least squares (OLS)\n", "- we use package `statsmodels`\n", "- we check these results with manual calculations\n", "- very useful textbooks that might be referenced in the code below\n", " - [FHT96] Ludwig Fahrmeir, Alfred Hamerle, Gerhard Tutz (1996): \"Multivariate statistische Verfahren\", 2nd ed., de Gruyter. https://doi.org/10.1515/9783110816020\n", " - [FKLM21] Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian D. Marx (2021): \"Regression\", 2nd ed., Springer. https://doi.org/10.1007/978-3-662-63882-8\n", " - [DB18] Annette J. Dobson, Adrian G. Barnett (2018): \"An Introduction to Generalized Linear Models\", 4th ed., CRC Press. https://doi.org/10.1201/9781315182780\n", " - [MT11] Henrik Madsen, Poul Thyregod (2011): \"Introduction to General and Generalized Linear Models\", CRC Press. https://doi.org/10.1201/9781439891148\n", " - [Agresti15] Alan Agresti (2015): \"Foundations of Linear and Generalized Models\", Wiley. https://www.wiley.com/en-us/Foundations+of+Linear+and+Generalized+Linear+Models-p-9781118730034\n" ] }, { "cell_type": "code", "execution_count": null, "id": "f12c6bfd", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from numpy.linalg import inv, pinv\n", "from scipy.stats import f, t\n", "from scipy.linalg import svdvals\n", "from statsmodels.api import OLS\n", "from statsmodels.graphics.gofplots import qqplot\n", "\n", "\n", "rng = np.random.default_rng(1234) # to reproduce results" ] }, { "cell_type": "code", "execution_count": null, "id": "27f61068", "metadata": {}, "outputs": [], "source": [ "def my_r2(X, theta_hat, y):\n", " \"\"\"(Adjusted) coefficient of determination R^2.\n", "\n", " also known as empirical correlation coefficient between y and yhat\n", " \"\"\"\n", " N = X.shape[0] # number of samples / observations\n", " p = X.shape[1] # number of model parameters (including intercept beta[0])\n", "\n", " yhat = X @ theta_hat # do regression / prediction\n", "\n", " # sum of squares T.otal:\n", " SST = np.sum((y - np.mean(y)) ** 2)\n", " # sum of squares due to R.egression:\n", " SSR = np.sum((yhat - np.mean(y)) ** 2)\n", " # sum of squared E.rrors:\n", " SSE = np.sum((y - yhat) ** 2)\n", " # SST = SSR + SSE holds (numerical errors might corrupt this equality)\n", "\n", " # R2 = SSR / SST is the ratio between regression and total, 0<=R2<=1\n", " # rearranged R2 = (SST - SSE) / SST = 1**2 - SSE / SST\n", " # p.58 in [MT11], p.111 in [DB18], p.108 (1.34)-(1.36) in [FHT96],\n", " # p.125 in [FKLM21], Ch.2.4.6 in [Agresti15]\n", " R2 = 1**2 - SSE / SST\n", "\n", " # R2 should be adjusted by number of samples and model complexity\n", " # note that this equation holds for models that include an intercept term\n", " # p.58 in [MT11], p.163 in [FKLM21], Ch.2.4.6 in [Agresti15]\n", " # R2adj = 1**2 - (1-R2)*(n-1)/(n-d)\n", " # or rewritten to see the adjustments in a more convenient way\n", " R2adj = 1**2 - (SSE / (N - p)) / (SST / (N - 1))\n", " return (R2, R2adj)\n", "\n", "\n", "def my_deviance_for_normal_pdf(X, beta_hat, y):\n", " \"\"\"Scaled deviance for normally distributed data and errors.\n", "\n", " - is basically the sum of squared errors (SSE)\n", " - cf. p.89ff in [DB18], p.133 in [Agresti15], [MT11]\n", " - note that deviances are different for other distributions, but\n", " the same statistical concepts using deviances hold\n", " - so instead of thinking this as SSE we should get used to deviances\n", " \"\"\"\n", " D = np.sum((y - X @ beta_hat) ** 2) # SSE\n", " # SSE can also be calculated with the dot product:\n", " # tmp = y - X @ beta_hat\n", " # D = np.squeeze(tmp[:, None].T @ tmp[:, None])\n", " return D\n", "\n", "\n", "alpha = 0.05 # standard alpha error for H0 rejection" ] }, { "cell_type": "markdown", "id": "2440b232", "metadata": {}, "source": [ "## Create some data of the real world and add measurement noise\n", "\n", "- create design matrix X with constant intercept term, x, x^2, x^3. This yield the\n", " - independent variables, aka\n", " - exog\n", " - right hand side\n", " - regressors\n", " - design\n", " - explanatory variable\n", " - **features**\n", "- define some true beta coefficients for the **real** world phenomenon\n", "- create outcome y as linear combination of columns in X with **added noise**. This yield the\n", " - dependent variables, aka\n", " - endog\n", " - left hand side\n", " - regressand\n", " - outcome\n", " - response variable\n", "\n", "$$\\mathbf{y} = \\mathbf{X} \\mathbf{\\beta} + \\mathbf{noise}$$\n", "\n", "- note that vector $\\mathbf{noise}$ might partly *live* in the column space of $\\mathbf{X}$ and if so it will influence the estimated solution $\\hat{\\mathbf{\\beta}}$" ] }, { "cell_type": "code", "execution_count": null, "id": "b0a8dc36", "metadata": {}, "outputs": [], "source": [ "N = 2**5 # number of observations / samples\n", "\n", "noise = rng.normal(loc=0, scale=1.5, size=N)\n", "print(\"noise: mean=\", np.mean(noise), \", std=\", np.std(noise, ddof=1))\n", "\n", "x = np.linspace(1e-16, 200, N)\n", "X = np.column_stack((x, 500 * np.abs(np.sin(x)), np.abs(np.log(x))))\n", "print(X.shape)\n", "beta = np.array([1e0, 1e-1, 1e-2, 1e-3]) # some nice numbers for true beta\n", "# add bias column to design matrix\n", "X = np.hstack((np.ones((X.shape[0], 1)), X))\n", "hasconst = True\n", "\n", "\n", "debug_flag = False\n", "if debug_flag: # works well for N=2**4\n", " X = np.column_stack((x**1, x**3, x**5))\n", " print(X.shape)\n", " beta = np.array([0, 1e0, 0, 1e-2])\n", " X = np.hstack((np.ones((X.shape[0], 1)), X))\n", " noise = rng.normal(loc=0, scale=0.5, size=N)\n", "\n", "\n", "# generate 'real world' data with design matrix, add noise\n", "y = np.dot(X, beta) + noise\n", "print(X.shape)\n", "print(y.shape)\n", "\n", "\n", "if debug_flag:\n", " print(\"debug_flag\", debug_flag)\n", " # we might want to get another design matrix to debug F/p and t/p in detail\n", " X = np.column_stack((x**0.125, x**0.25, x**0.75))\n", " X = np.hstack((np.ones((X.shape[0], 1)), X))\n", "\n", "print(svdvals(X))" ] }, { "cell_type": "markdown", "id": "d3c658e8", "metadata": {}, "source": [ "## Choose a model type, define the model, train the model with above data\n", "\n", "Choosing the model type here is a bit pointless, as we deal here intentionally with OLS. However, we could us ask whether weighted least squares, or LS with L1, L2 regularization types or other GLMs do a good job here.\n", "\n", "We assume that the outcome originates from above data synthesis\n", "$$\\mathbf{y} = \\mathbf{X} \\mathbf{\\beta} + \\mathbf{noise}$$\n", "We could set up a design matrix $\\mathbf{X}_f$ for the linear model\n", "$$\\hat{\\mathbf{y}} = \\mathbf{X}_f \\hat{\\mathbf{\\beta}}$$\n", "such that\n", "$$||\\mathbf{y} - \\hat{\\mathbf{y}}||_2^2$$\n", "becomes minimized by choosing / calculating the 'right' coefficients $\\hat{\\mathbf{\\beta}}$.\n", "Note, the distinction between $\\mathbf{X}$ (the real world features) and $\\mathbf{X}_f$ (the features that we think will explain the world and we have chosen carefully in advance). Below, we will use $\\mathbf{X} = \\mathbf{X}_f$ as starting point.\n", "\n", "There are two major concepts of solving the problem \n", "$$\\mathrm{min}_{\\mathrm{wrt}\\, \\beta}||\\mathbf{y} - \\hat{\\mathbf{y}}||_2^2$$\n", "- one with assuming that data, estimators, residuals and so on are normally distributed\n", "- one with not assuming this and hoping that our results provide good results\n", "\n", "The first concepts solves the problem by maximum likelihood estimation (MLE), for the second we can trust linear algebra and solve an over-determined set of linear equations, precisely known as OLS. It turns out, that MLE yields same results as OLS under normal distribution assumptions. This means, that if our data approximately fulfills normal distribution assumptions and we derive the solution via OLS, we are very close to the MLE. We should read in the above mentioned text books!\n", "\n", "The solution (we assume full column rank) is given as\n", "$$\\hat{\\mathbf{\\beta}} = (\\mathbf{X}^\\mathrm{T}\\mathbf{X})^{-1}\\mathbf{X}^\\mathrm{T} \\mathbf{y}$$\n", "\n", "The matrix $(\\mathbf{X}^\\mathrm{T}\\mathbf{X})^{-1}\\mathbf{X}^\\mathrm{T}$ is the left inverse of $\\mathbf{X}$, i.e. left multiplied $\\mathbf{X}$ yields\n", "$$(\\mathbf{X}^\\mathrm{T}\\mathbf{X})^{-1}\\mathbf{X}^\\mathrm{T} \\mathbf{X} = \\mathbf{I}$$\n", "\n", "For the predictions\n", "$$\\hat{\\mathbf{y}} = \\mathbf{X}\\cdot\\hat{\\mathbf{\\beta}} = \\mathbf{X}\\cdot(\\mathbf{X}^\\mathrm{T}\\mathbf{X})^{-1}\\mathbf{X}^\\mathrm{T} \\mathbf{y} = \\hat{\\mathbf{y}}$$\n", "we find that the (hat)-matrix $\\mathbf{X}(\\mathbf{X}^\\mathrm{T}\\mathbf{X})^{-1}\\mathbf{X}^\\mathrm{T}$ is a projection matrix which projects $\\mathbf{y}$ to $\\hat{\\mathbf{y}}$, i.e. to the column space of $\\mathbf{X}$.\n", "\n", "The projection matrix $\\mathbf{I} - \\mathbf{X}(\\mathbf{X}^\\mathrm{T}\\mathbf{X})^{-1}\\mathbf{X}^\\mathrm{T}$ projects $\\mathbf{y}$ to the orthogonal left null space of $\\mathbf{X}$. This results in the residual vector \n", "$$\\mathbf{e} = \\mathbf{y} - \\hat{\\mathbf{y}} = (\\mathbf{I} - \\mathbf{X}(\\mathbf{X}^\\mathrm{T}\\mathbf{X})^{-1}\\mathbf{X}^\\mathrm{T}) \\mathbf{y}$$\n", "\n", "By that, we know $\\mathbf{e} \\perp \\hat{\\mathbf{y}}$. Very often $||\\mathbf{e}||_2^2$ is termed sum of squared errors (SSE) in the context of OLS. We see other sum of squares below.\n", "\n", "Training/Fitting...TBD...this is probably a nice homework assignment" ] }, { "cell_type": "code", "execution_count": null, "id": "cec971e2", "metadata": {}, "outputs": [], "source": [ "model = OLS(y, X, hasconst=hasconst)\n", "results = model.fit() # this solves the OLS problem, we fit / train the model\n", "# to estimate the unknown model parameters beta_hat\n", "print(results.summary()) # get some useful information about this model" ] }, { "cell_type": "markdown", "id": "ea1b6730", "metadata": {}, "source": [ "## Comparison of manual calculation vs. statsmodels results" ] }, { "cell_type": "code", "execution_count": null, "id": "287a3058", "metadata": {}, "outputs": [], "source": [ "# number of observations\n", "X.shape[0], results.nobs" ] }, { "cell_type": "code", "execution_count": null, "id": "363edf01", "metadata": {}, "outputs": [], "source": [ "# df of model parameters\n", "# we count all beta coefficients except beta[0], which is considered as const\n", "df_model = X.shape[1] - 1\n", "df_model, results.df_model" ] }, { "cell_type": "code", "execution_count": null, "id": "0f1e69b1", "metadata": {}, "outputs": [], "source": [ "# df of residuals\n", "# when we have 16 observations, we spend one df to calc the mean of y\n", "# and three df for the beta coeff 1,2,3. Thus, df of 12 is remaining\n", "df_resid = X.shape[0] - X.shape[1]\n", "df_resid, results.df_resid" ] }, { "cell_type": "code", "execution_count": null, "id": "0a15d3d8", "metadata": {}, "outputs": [], "source": [ "# check coefficients\n", "beta_hat = pinv(X) @ y\n", "print(beta_hat) # cf. with 'coef' in the summary table above\n", "np.allclose(beta_hat, results.params)" ] }, { "cell_type": "code", "execution_count": null, "id": "7842da25", "metadata": {}, "outputs": [], "source": [ "# check prediction\n", "yhat = X @ beta_hat # predict outcomes on design matrix data\n", "np.allclose(yhat, results.predict(X))" ] }, { "cell_type": "code", "execution_count": null, "id": "28a74e13", "metadata": {}, "outputs": [], "source": [ "plt.plot(y, \"C0o-\", label=\"y\")\n", "plt.plot(yhat, \"C1o:\", ms=3, label=\"yhat\")\n", "plt.xlabel(\"sample index\")\n", "plt.legend()\n", "plt.grid(True)" ] }, { "cell_type": "code", "execution_count": null, "id": "80e855cd", "metadata": {}, "outputs": [], "source": [ "# check residuals\n", "resid = y - yhat\n", "np.allclose(resid, results.resid)" ] }, { "cell_type": "code", "execution_count": null, "id": "06520707", "metadata": {}, "outputs": [], "source": [ "# check empirical correlation coefficient Rsquared between y and yhat\n", "# also known as coefficient of determination\n", "R2, R2adj = my_r2(X, beta_hat, y)\n", "print(\"R-squared:\", R2, \"Adj. R-squared:\", R2adj)\n", "print(\"R-squared:\", results.rsquared, \"Adj. R-squared:\", results.rsquared_adj)\n", "np.allclose(R2, results.rsquared), np.allclose(R2adj, results.rsquared_adj)" ] }, { "cell_type": "code", "execution_count": null, "id": "779007be", "metadata": {}, "outputs": [], "source": [ "# Overall-F-Test (Goodness of fit-Test)\n", "# central F distribution is used for hypothesis test\n", "# checks if there is a linear relationship between outcome and ANY of the\n", "# regressors (i.e. the explanatory variables, features), i.e. H0 is\n", "# beta_hat[1]=beta_hat[2]=beta_hat[...] = 0, thus under H0 we have the model\n", "# beta_hat[0] = np.mean(y)\n", "# H1: at least ONE regressor in X[:,1:] explains data well in terms of\n", "# statistical evaluation\n", "# see p.109 (1.38) in [FHT96], p.147 in [FKLM21]\n", "\n", "# do regression, calc usual sum of squares\n", "yhat = X @ beta_hat # do regression / prediction\n", "# sum of squares Total\n", "SST = np.sum((y - np.mean(y)) ** 2)\n", "# sum of squares due to Regression model\n", "# my_deviance_for_normal_pdf(X, beta_hat, np.mean(y))\n", "SSR = np.sum((yhat - np.mean(y)) ** 2)\n", "# sum of squared Errors\n", "SSE = np.sum((y - yhat) ** 2) # my_deviance_for_normal_pdf(X, beta_hat, y)\n", "\n", "# get the degrees of freedom\n", "p = X.shape[1]\n", "q = 1\n", "dfn, dfd = p - q, N - p # dfn->1 for intercept, dfd for residual\n", "\n", "# we should avoid this due to numerical precision issues:\n", "Fval = R2 / (1 - R2) * dfd / dfn\n", "# we better use the equivalent, numerical more stable\n", "# and probably more intuitive:\n", "Fval = SSR / SSE * dfd / dfn\n", "# or even better, because F-values are signal-to-noise ratios, where the SS\n", "# are normalized by their corresponding dfs:\n", "Fval = (SSR / dfn) / (SSE / dfd)\n", "\n", "probF = f.sf(Fval, dfn, dfd) # get the probability for this F value and the dfs\n", "print(\"F\", Fval, results.fvalue)\n", "print(\"probability for F\", probF, results.f_pvalue)\n", "np.allclose(Fval, results.fvalue), np.allclose(probF, results.f_pvalue)\n", "print(\"reject H0?\", probF < alpha) # if True we can reject H0" ] }, { "cell_type": "code", "execution_count": null, "id": "84d97a05", "metadata": {}, "outputs": [], "source": [ "# we could consider this Overall-F-Test (Goodness of fit-Test) as a very\n", "# special case of deviance statistics, which actually are likelihood ratio tests\n", "# cf. Ch.5.7.1 in [DB18], Ch.3.5 in [MT11]\n", "D0 = SST # perfect model, overfitted as it 'models' exactly the outcome data\n", "D1 = my_deviance_for_normal_pdf(X, beta_hat, y) # regression model\n", "Fval = ((D0 - D1) / (dfn)) / (D1 / (dfd))\n", "\n", "probF = f.sf(Fval, dfn, dfd) # get the probability for this F value and the dfs\n", "print(\"F\", Fval, results.fvalue)\n", "print(\"probability for F\", probF, results.f_pvalue)\n", "np.allclose(Fval, results.fvalue), np.allclose(probF, results.f_pvalue)" ] }, { "cell_type": "code", "execution_count": null, "id": "5622d4b5", "metadata": {}, "outputs": [], "source": [ "# log-likelihood\n", "# cf. https://github.com/statsmodels/statsmodels/blob/main/statsmodels/regression/linear_model.py\n", "# line 949ff\n", "# cf. p.54 eq. (2.6) in [FHT96] and p.89 in [Agresti15] for the log-likelihood case\n", "\n", "# here rather the profile log likelihood is given, as we don't know the true sigma\n", "# cf. p.89 in [Agresti15] last equation on this page, where likelihood is given\n", "# for estimated beta AND sigma\n", "nobs = X.shape[0]\n", "nobs2 = X.shape[0] / 2\n", "ssr = np.sum((y - yhat) ** 2)\n", "llf = -nobs2 * np.log(2 * np.pi) - nobs2 * np.log(ssr / nobs) - nobs2\n", "print(llf)\n", "np.allclose(llf, results.llf)" ] }, { "cell_type": "code", "execution_count": null, "id": "5fb52b00", "metadata": {}, "outputs": [], "source": [ "# Akaike information criterion\n", "# cf. p.146 in [Agresti15], p.165 in [DB18]\n", "# https://en.wikipedia.org/wiki/Akaike_information_criterion\n", "AIC = -2 * (llf - X.shape[1])\n", "print(AIC)\n", "np.allclose(AIC, results.aic)" ] }, { "cell_type": "code", "execution_count": null, "id": "18986152", "metadata": {}, "outputs": [], "source": [ "# Bayesian information criterion\n", "# cf. p.165 in [DB18], p.165 in [FKLM21]\n", "BIC = -2 * llf + X.shape[1] * np.log(X.shape[0])\n", "print(BIC)\n", "np.allclose(BIC, results.bic)" ] }, { "cell_type": "code", "execution_count": null, "id": "5eb74af7", "metadata": {}, "outputs": [], "source": [ "# standardized residuals\n", "# cf. Ch.2.5. in [Agresti15], p.136ff in [FKLM21], Ch.6.2.6 in [DB18]\n", "H = X @ pinv(X) # hat matrix, this is the matrix that maps y to yhat\n", "# i.e. projecting the y into the column space of design matrix X resulting in yhat\n", "\n", "e = y - yhat # residuals\n", "p = X.shape[1] # df for residual vector\n", "sg2_e = np.sum(e**2) / (N - p)\n", "sg_e = np.sqrt(sg2_e)\n", "\n", "std_res = e / (sg_e * np.sqrt(1 - np.diag(H))) # standardized residuals\n", "fig, ax = plt.subplots(figsize=(4, 4))\n", "qqplot(std_res, ax=ax, line=\"45\")\n", "ax.axis(\"equal\")\n", "ax.grid(True)\n", "ax.set_title(\"standardized residuals vs. standard normal PDF\")" ] }, { "cell_type": "code", "execution_count": null, "id": "4b8ec8d9", "metadata": {}, "outputs": [], "source": [ "print(results.summary())" ] }, { "cell_type": "code", "execution_count": null, "id": "6c07d9c5", "metadata": {}, "outputs": [], "source": [ "# test of significance for coefficient beta_hat[i]\n", "# H0: beta_hat[i]=0 H1: beta_hat[i] not 0\n", "e = y - yhat # residuals\n", "p = X.shape[1]\n", "sg2_e = np.sum(e**2) / (N - p) # unbiased estimator of residuals' variance\n", "sg_e = np.sqrt(sg2_e) # unbiased estimator of residuals' standard deviation\n", "\n", "# estimator for covariance matrix, cf. p.132 in [FKLM21]\n", "Cov_beta_hat = sg2_e * np.linalg.inv(X.T @ X)\n", "std_err = np.sqrt(np.diag(Cov_beta_hat))\n", "print(\"std_err\", std_err)\n", "tval = beta_hat / std_err\n", "print(\"\\nt values:\", tval)\n", "probt = t.sf(np.abs(tval), N - p) * 2\n", "print(\"\\nprop > |t|:\", probt)\n", "print(\"\\nreject H0?\", probt < alpha / 2) # if True we can reject H0\n", "np.allclose(sg2_e, results.scale), np.allclose(\n", " tval, results.tvalues\n", "), np.allclose(probt, results.pvalues)" ] }, { "cell_type": "code", "execution_count": null, "id": "3891bde4", "metadata": {}, "outputs": [], "source": [ "# TBD: CIs" ] }, { "cell_type": "markdown", "id": "4bf605e0", "metadata": {}, "source": [ "## Copyright\n", "\n", "- the notebooks are provided as [Open Educational Resources](https://en.wikipedia.org/wiki/Open_educational_resources)\n", "- the text is licensed under [Creative Commons Attribution 4.0](https://creativecommons.org/licenses/by/4.0/)\n", "- the code of the IPython examples is licensed under the [MIT license](https://opensource.org/licenses/MIT)\n", "- feel free to use the notebooks for your own purposes\n", "- please attribute the work as follows: *Frank Schultz, Data Driven Audio Signal Processing - A Tutorial Featuring Computational Examples, University of Rostock* ideally with relevant file(s), github URL https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise, commit number and/or version tag, year." ] } ], "metadata": { "kernelspec": { "display_name": "myddasp", "language": "python", "name": "myddasp" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }