{ "cells": [ { "cell_type": "markdown", "id": "4ff5888c", "metadata": {}, "source": [ "Sascha Spors,\n", "Professorship Signal Theory and Digital Signal Processing,\n", "Institute of Communications Engineering (INT),\n", "Faculty of Computer Science and Electrical Engineering (IEF),\n", "University of Rostock,\n", "Germany\n", "\n", "# Data Driven Audio Signal Processing - A Tutorial with Computational Examples\n", "\n", "Master Course #24512\n", "\n", "- lecture: https://github.com/spatialaudio/data-driven-audio-signal-processing-lecture\n", "- tutorial: https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise\n", "\n", "Feel free to contact lecturer frank.schultz@uni-rostock.de" ] }, { "cell_type": "markdown", "id": "dcd5fabd", "metadata": {}, "source": [ "# Principal Component Analysis (PCA)\n", "\n", "- example in 2D\n", "- data matrix with N samples in F=2 features" ] }, { "cell_type": "code", "execution_count": null, "id": "b3a028b6", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from scipy.linalg import svd, diagsvd\n", "from statsmodels.multivariate.pca import PCA" ] }, { "cell_type": "markdown", "id": "48443673", "metadata": {}, "source": [ "## Create Data and Plot in Original Coordinate System\n", "\n", "We intentionally create a mean-free data matrix **X**, then SVD and PCA exhibit the same concepts." ] }, { "cell_type": "code", "execution_count": null, "id": "09b253b3", "metadata": {}, "outputs": [], "source": [ "# be careful when changing the seed\n", "# SVD U and V vectors might in this case need reflections\n", "# to match with statsmodels' results\n", "# also when reflections occur the mapping of original data -> PC scores\n", "# is not anymore simply explained by just rotating the data set\n", "# hence, we should elaborate on the with_reflection = True case\n", "with_reflection = True\n", "if with_reflection:\n", " rng = np.random.default_rng(6)\n", " # index for specific data points to plot with specific colors\n", " # this helps to identify the reflection\n", " di1 = 12\n", " di2 = 80\n", "else:\n", " rng = np.random.default_rng(1)\n", " # index for specific data points to plot with specific colors\n", " di1 = 12\n", " di2 = 91\n", "\n", "# construct 2 features from normal PDF\n", "mean = [0, 0]\n", "cov = [[1.3, 0.99], [0.99, 1]]\n", "\n", "N = 200 # no of samples\n", "x, y = rng.multivariate_normal(mean, cov, N).T\n", "X = np.array([x, y]).T\n", "# the PCA assumes mean-free columns, so we explicitly design X this way\n", "X = X - np.mean(X, axis=0)\n", "\n", "# we could also normalize by standard deviation,\n", "# then each column of X has unit variance and total variance of X is 2: \n", "if False:\n", " X = X / np.std(X, axis=0, ddof=1)\n", " print(np.var(X, axis=0, ddof=1), np.sum(np.var(X, axis=0, ddof=1)) )\n", "\n", "F = X.shape[1]\n", "\n", "print(\"X.shape\", X.shape) # (200, 2)\n", "print(\"rank\", np.linalg.matrix_rank(X))\n", "print(\"condition number\", np.linalg.cond(X))\n", "\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "\n", "ax.plot(X[:, 0], X[:, 1], \"x\", color=\"gray\")\n", "ax.plot(X[di1, 0], X[di1, 1], \"C1x\", ms=10, mew=3)\n", "ax.plot(X[di2, 0], X[di2, 1], \"C3x\", ms=10, mew=3)\n", "\n", "ax.axis(\"square\")\n", "ax.set_xlim(-5, 5)\n", "ax.set_ylim(-5, 5)\n", "ax.set_xlabel(\"original feature 1\")\n", "ax.set_ylabel(\"original feature 2\")\n", "ax.set_title(\"data in original coordinate system\")\n", "ax.grid(True)\n", "plt.savefig('slides/pca_2d_original_data.pdf', dpi=600, bbox_inches='tight')" ] }, { "cell_type": "markdown", "id": "35eff801", "metadata": {}, "source": [ "## Calculate Principal Component Analysis (PCA)\n", "\n", "We use abbreviation PC for **principal component**." ] }, { "cell_type": "code", "execution_count": null, "id": "5ba3fd23", "metadata": {}, "outputs": [], "source": [ "# we work on matrix X directly (so standardize=False), and we already\n", "# made it mean free above (so demean=False), normalize=False to give us\n", "# data that is nicely connected to SVD data\n", "# PCA from statsmodels.multivariate.pca\n", "pca = PCA(X, ncomp=2, standardize=False, demean=False, normalize=False)\n", "# SVD\n", "[U, s, Vh] = svd(X, full_matrices=False)\n", "S = diagsvd(s, F, F)\n", "V = Vh.conj().T\n", "\n", "pcs = U @ S # known as PC, PC signals, PC factors, PC scores, PC features\n", "pcl = V # known as PC loadings, PC coefficients\n", "# note that sometimes V.T is called loadings, coefficients\n", "# check if statsmodels pca and our manual SVD-based PCA produce same results:\n", "print(np.allclose(pca.scores, pcs))\n", "print(np.allclose(pca.coeff.T, V))" ] }, { "cell_type": "markdown", "id": "7c67af6f", "metadata": {}, "source": [ "## Covariance Matrix Metrics\n", "\n", "For mean free data matrix **X** whole PCA information is in the SVD, but can also be derived with a covariance matrix mindset." ] }, { "cell_type": "code", "execution_count": null, "id": "6b0ee307", "metadata": {}, "outputs": [], "source": [ "covX = X.T @ X\n", "covX_N_1 = covX / (N-1)\n", "\n", "# eigval_covX are not necessarily sorted, eigvec_covX might exhibit reflections\n", "# eigval_covX_N_1 are not necessarily sorted, eigvec_covX_N_1 might exhibit reflections\n", "\n", "[eigval_covX, eigvec_covX] = np.linalg.eig(covX)\n", "[eigval_covX_N_1, eigvec_covX_N_1] = np.linalg.eig(covX_N_1)" ] }, { "cell_type": "code", "execution_count": null, "id": "d26f32b8", "metadata": {}, "outputs": [], "source": [ "# should be similar vectors but with potential reflections\n", "eigvec_covX_N_1, eigvec_covX, V" ] }, { "cell_type": "code", "execution_count": null, "id": "bf4a4f1e", "metadata": {}, "outputs": [], "source": [ "# squared singular values == not necessarily sorted eig values of covX:\n", "s**2, eigval_covX" ] }, { "cell_type": "markdown", "id": "f6933ac4", "metadata": {}, "source": [ "## Variances of PC Scores" ] }, { "cell_type": "markdown", "id": "7547342f", "metadata": {}, "source": [ "We get variances of the PC scores / PC signals." ] }, { "cell_type": "code", "execution_count": null, "id": "504c6d59", "metadata": {}, "outputs": [], "source": [ "print(np.var(pcs[:, 0], ddof=1))\n", "print(np.var(pcs[:, 1], ddof=1))\n", "var_pcs = np.var(pcs, axis=0, ddof=1)\n", "# variance of PC scores == sorted eig values of covX_N_1:\n", "var_pcs, eigval_covX_N_1" ] }, { "cell_type": "markdown", "id": "296bc4d3", "metadata": {}, "source": [ "We compare with the variances of the original features." ] }, { "cell_type": "code", "execution_count": null, "id": "cf66919f", "metadata": {}, "outputs": [], "source": [ "print(np.var(X[:, 0], ddof=1))\n", "print(np.var(X[:, 1], ddof=1))\n", "var_X = np.var(X, axis=0, ddof=1)" ] }, { "cell_type": "markdown", "id": "d2c1c5ae", "metadata": {}, "source": [ "We don't lose variance, we just distribute it over the signals (vectors) in another way." ] }, { "cell_type": "code", "execution_count": null, "id": "043c8d2b", "metadata": {}, "outputs": [], "source": [ "np.sum(var_pcs), np.sum(var_X)" ] }, { "cell_type": "markdown", "id": "f9102f1c", "metadata": {}, "source": [ "1st PC signal explains about 94 % of total variance." ] }, { "cell_type": "code", "execution_count": null, "id": "118f2da4", "metadata": {}, "outputs": [], "source": [ "var_pcs[0] / np.sum(var_pcs) * 100" ] }, { "cell_type": "markdown", "id": "0ffc492b", "metadata": {}, "source": [ "2nd PC signal explains the remaining about 6 % of total variance." ] }, { "cell_type": "code", "execution_count": null, "id": "2607502c", "metadata": {}, "outputs": [], "source": [ "var_pcs[1] / np.sum(var_pcs) * 100" ] }, { "cell_type": "markdown", "id": "54c0056b", "metadata": {}, "source": [ "Subsequently explained variance by PC scores" ] }, { "cell_type": "code", "execution_count": null, "id": "c5ef215b", "metadata": {}, "outputs": [], "source": [ "np.cumsum(var_pcs) / np.sum(var_pcs) * 100" ] }, { "cell_type": "markdown", "id": "3a625cb0", "metadata": {}, "source": [ "PC scores exhibit **sorted variances** compared to the original features.\n", "In this 2D example this is not too obvious, as original features also seem to be sorted from high to low variance." ] }, { "cell_type": "code", "execution_count": null, "id": "8ae0d7ef", "metadata": {}, "outputs": [], "source": [ "var_X / np.sum(var_pcs) * 100" ] }, { "cell_type": "markdown", "id": "c6447f10", "metadata": {}, "source": [ "The standard deviation - i.e. sqrt(variance) - of the orthogonal PC scores is helpful for the following." ] }, { "cell_type": "code", "execution_count": null, "id": "631cfe02", "metadata": {}, "outputs": [], "source": [ "std_pcs = np.std(pcs, axis=0, ddof=1)\n", "std_pcs, np.sqrt(var_pcs)" ] }, { "cell_type": "markdown", "id": "8f3ca3f8", "metadata": {}, "source": [ "## Directions of the PCs" ] }, { "cell_type": "code", "execution_count": null, "id": "d0cde702", "metadata": {}, "outputs": [], "source": [ "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "\n", "ax.plot(X[:, 0], X[:, 1], \"x\", color=\"gray\")\n", "ax.plot(X[di1, 0], X[di1, 1], \"C1x\", mew=3)\n", "ax.plot(X[di2, 0], X[di2, 1], \"C3x\", mew=3)\n", "\n", "# draw directions of PC axis\n", "# length follows 3*sigma rule for normal distribution\n", "# note that V not necessarily spans a right-hand system\n", "col = ['C1', 'C3']\n", "for i in range(F):\n", " ax.plot(\n", " [0, 3*std_pcs[i] * V[0, i]],\n", " [0, 3*std_pcs[i] * V[1, i]],\n", " col[i], lw=3,\n", " label=f\"{'direction of PC '}{i+1}\" f\"{' score, col '}{i+1}\" f\"{' in V, right sing vec '}{i+1}\")\n", "\n", "plt.axis(\"square\")\n", "plt.xlim(-5, 5)\n", "plt.ylim(-5, 5)\n", "plt.xlabel(\"original feature 1\")\n", "plt.ylabel(\"original feature 2\")\n", "plt.title(\"data in original coordinate system\")\n", "plt.legend()\n", "plt.grid(True)\n", "plt.savefig('slides/pca_2d_original_data_with_pcdir.pdf', dpi=600, bbox_inches='tight')" ] }, { "cell_type": "markdown", "id": "887835b8", "metadata": {}, "source": [ "## Plot in PC Coordinate System" ] }, { "cell_type": "code", "execution_count": null, "id": "90017cb1", "metadata": {}, "outputs": [], "source": [ "[U_pcs, s_pcs, Vh_pcs] = svd(pcs, full_matrices=False)\n", "S_pcs = diagsvd(s_pcs, F, F)\n", "V_pcs = Vh_pcs.conj().T\n", "\n", "# make sure that correct U/V pair for pcs holds by\n", "# introducing corresponding reflections\n", "for i in range(F):\n", " if not np.allclose((U_pcs @ S_pcs)[:, i], pcs[:, i]):\n", " U_pcs[:,i] *= -1\n", " V_pcs[:,i] *= -1\n", "# then V_pcs indicates the correct directions in the PC coordinate system\n", "print(np.allclose(U_pcs @ S_pcs, pcs))" ] }, { "cell_type": "code", "execution_count": null, "id": "17bb5b9c", "metadata": {}, "outputs": [], "source": [ "Nt = 2**6\n", "t = np.arange(Nt) / Nt * 2 * np.pi\n", "ellipse_x, ellipse_y = np.cos(t), np.sin(t)" ] }, { "cell_type": "code", "execution_count": null, "id": "250e1ab0", "metadata": {}, "outputs": [], "source": [ "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "\n", "plt.plot(pcs[:, 0], pcs[:, 1], \"x\", color=\"gray\")\n", "plt.plot(pcs[di1, 0], pcs[di1, 1], \"C1x\", mew=3)\n", "plt.plot(pcs[di2, 0], pcs[di2, 1], \"C3x\", mew=3)\n", "\n", "# draw directions of PC axis\n", "# length follows 3*sigma rule for normal distribution\n", "col = ['C1', 'C3']\n", "for i in range(F):\n", " ax.plot(\n", " [0, 3*std_pcs[i] * V_pcs[0, i]],\n", " [0, 3*std_pcs[i] * V_pcs[1, i]],\n", " col[i], lw=3,\n", " label=f\"{'direction of PC '}{i+1}\" f\"{', length = 3 std(PC '}{i+1}\" f\"{')'}\")\n", "\n", "plt.plot(1*std_pcs[0]*ellipse_x, 1*std_pcs[1] *\n", " ellipse_y, 'k--', label=r'3$\\sigma$ ellipse')\n", "plt.plot(3*std_pcs[0]*ellipse_x, 3*std_pcs[1] *\n", " ellipse_y, 'k:', label=r'1$\\sigma$ ellipse')\n", "\n", "plt.axis(\"square\")\n", "plt.xlim(-5, 5)\n", "plt.ylim(-5, 5)\n", "plt.xlabel(\"PC 1\")\n", "plt.ylabel(\"PC 2\")\n", "plt.title(\"data in PC coordinate system\")\n", "plt.legend()\n", "plt.grid(True)\n", "plt.savefig('slides/pca_2d_pc_data.pdf', dpi=600, bbox_inches='tight')" ] }, { "cell_type": "markdown", "id": "83515671", "metadata": {}, "source": [ "For rank R data matrix, the data is rotated (and potentially reflected in some axes) such that **along the axis of PC 1 most variance occurs** (most data spread), whereas along the axis of the R-th PC (last PC, here in the case it is PC 2) fewest variance occurs.\n", "Generally, var(PC 1)>var(PC 2)>...>var(PC R) just as the sorting of the singular values in the SVD. Recall how we calculated `pcs` and var of `pcs`..." ] }, { "cell_type": "markdown", "id": "8b19e684", "metadata": {}, "source": [ "## Truncated SVD\n", "\n", "We do **rank reduction / low-rank approximation**. Hence, we keep number of features, but the data is simplified in this feature space." ] }, { "cell_type": "code", "execution_count": null, "id": "ecd2c8d3", "metadata": {}, "outputs": [], "source": [ "r_des = 1\n", "\n", "# SVD mindset\n", "# sum of outer products, i.e. sum of rank-1 matrices\n", "X_rank_red = np.zeros((N, F))\n", "for i in range(r_des):\n", " X_rank_red += s[i] * np.outer(U[:, i], V[:, i])\n", "\n", "# PCA mindset\n", "# we might also use the PC signals and set intended PC loading column to zero\n", "X_rank_red2 = np.zeros((N, F))\n", "pcl_rank_red = np.copy(pcl)\n", "pcl_rank_red[:, r_des:] = 0\n", "X_rank_red2 = pcs @ pcl_rank_red.conj().T\n", "print(np.allclose(X_rank_red, X_rank_red2))\n", "pcl_rank_red" ] }, { "cell_type": "code", "execution_count": null, "id": "1ac13ecf", "metadata": {}, "outputs": [], "source": [ "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "\n", "ax.plot(X_rank_red2[:, 0], X_rank_red2[:, 1], \"x\", color=\"gray\")\n", "ax.plot(X_rank_red2[di1, 0], X_rank_red2[di1, 1], \"C1x\", mew=3)\n", "ax.plot(X_rank_red2[di2, 0], X_rank_red2[di2, 1], \"C3x\", mew=3)\n", "\n", "# draw directions of PC axis\n", "# length follows 3*sigma rule for normal distribution\n", "# note that V not necessarily spans a right-hand system\n", "col = ['C1', 'C3']\n", "for i in range(F):\n", " ax.plot(\n", " [0, 3*std_pcs[i] * V[0, i]],\n", " [0, 3*std_pcs[i] * V[1, i]],\n", " col[i], lw=1,\n", " label=f\"{'direction of PC '}{i+1}\" f\"{' score, col '}{i+1}\" f\"{' in V, right sing vec '}{i+1}\")\n", "\n", "ax.axis(\"square\")\n", "ax.set_xlim(-5, 5)\n", "ax.set_ylim(-5, 5)\n", "ax.set_xlabel(\"new feature 1 after rank reduction\")\n", "ax.set_ylabel(\"new feature 2\")\n", "ax.set_title(\"rank-{0:d} approximation of data\".format(r_des))\n", "ax.legend()\n", "ax.grid(True)\n", "plt.savefig('slides/pca_2d_truncated_svd.pdf', dpi=600, bbox_inches='tight')" ] }, { "cell_type": "markdown", "id": "4a621c65", "metadata": {}, "source": [ "## Dimensionality Reduction after PCA\n", "\n", "We do reduction of matrix size, so we reduce the number of columns, i.e. number of features. We do this for the orthogonal PC scores (the weighted column space of **X**)." ] }, { "cell_type": "code", "execution_count": null, "id": "17684636", "metadata": {}, "outputs": [], "source": [ "dim_des = 1\n", "# PCA mindset\n", "X_dim_red = np.zeros((N, dim_des))\n", "X_dim_red = pcs[:, :dim_des]\n", "print('original matrix shape ', X.shape)\n", "print('matrix shape after dim red', X_dim_red.shape)\n", "\n", "X_dim_red_plot = np.zeros((N, F))\n", "X_dim_red_plot[:, :dim_des] = pcs[:, :dim_des]\n", "\n", "# check with SVD mindset\n", "print(np.allclose((U @ S)[:, :dim_des], X_dim_red))\n", "print(np.allclose(X @ V[:, :dim_des], X_dim_red))" ] }, { "cell_type": "code", "execution_count": null, "id": "dd7e64c7", "metadata": {}, "outputs": [], "source": [ "# the dimensionality reduction actually yields a matrix with smaller\n", "# dimension, cf. shape of X_dim_red\n", "# for convenience we plot data here in 2D plot\n", "# note however that X_dim_red_plot[:,1] is precisely zero if\n", "# dim_des = 1 was chosen\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "ax.plot(X_dim_red_plot[:, 0], X_dim_red_plot[:, 1], \"x\", color=\"gray\")\n", "ax.plot(X_dim_red_plot[di1, 0], X_dim_red_plot[di1, 1], \"C1x\", mew=3)\n", "ax.plot(X_dim_red_plot[di2, 0], X_dim_red_plot[di2, 1], \"C3x\", mew=3)\n", "\n", "# draw directions of PC axis\n", "# length follows 3*sigma rule for normal distribution\n", "col = ['C1', 'C3']\n", "for i in range(F):\n", " ax.plot(\n", " [0, 3*std_pcs[i] * V_pcs[0, i]],\n", " [0, 3*std_pcs[i] * V_pcs[1, i]],\n", " col[i], lw=1,\n", " label=f\"{'direction of PC '}{i+1}\" f\"{', length = 3 std(PC '}{i+1}\" f\"{')'}\")\n", "\n", "ax.axis(\"square\")\n", "ax.set_xlim(-5, 5)\n", "ax.set_ylim(-5, 5)\n", "ax.set_xlabel(\"PC 1\")\n", "ax.set_ylabel(\"PC 2\")\n", "ax.set_title(\"PCA data dimensionality reduction to {0:d}D\".format(r_des))\n", "ax.legend()\n", "ax.grid(True)\n", "plt.savefig('slides/pca_2d_dim_red.pdf', dpi=600, bbox_inches='tight')\n", "\n", "print(\"reduced to dimension\", X_dim_red.shape)" ] }, { "cell_type": "markdown", "id": "4a6f7a5a", "metadata": {}, "source": [ "For both cases, (i) truncated SVD / rank reduction / low-rank approximation and (ii) dimensionality reduction, the variance sorted, orthogonal characteristics of the PC scores helps to encode the data in terms of maximum feature independence and explained variance. " ] }, { "cell_type": "markdown", "id": "651b1eff", "metadata": {}, "source": [ "## Copyright\n", "\n", "- the notebooks are provided as [Open Educational Resources](https://en.wikipedia.org/wiki/Open_educational_resources)\n", "- feel free to use the notebooks for your own purposes\n", "- the text is licensed under [Creative Commons Attribution 4.0](https://creativecommons.org/licenses/by/4.0/)\n", "- the code of the IPython examples is licensed under the [MIT license](https://opensource.org/licenses/MIT)\n", "- please attribute the work as follows: *Frank Schultz, Data Driven Audio Signal Processing - A Tutorial Featuring Computational Examples, University of Rostock* ideally with relevant file(s), github URL https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise, commit number and/or version tag, year." ] } ], "metadata": { "kernelspec": { "display_name": "myddasp", "language": "python", "name": "myddasp" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }