{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Principal Component Analysis\n", "\n", "
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
\n", "\n", "Kamila Zdybał, 2020\n", "\n", "Université libre de Bruxelles, kamila.zdybal@ulb.ac.be
\n", "Science Docs, kamila.zdybal@gmail.com
\n", "\n", "***\n", "\n", "In this notebook we present the code behind the tutorial that can be [downloaded here](https://github.com/kamilazdybal/ulb-atm-phd/raw/master/PCA/PCA.pdf).\n", "\n", "▷ We start with performing PCA on a synthetic data set to show all steps graphically.\n", "\n", "▷ Then we move on to performing local PCA on portions of the data set identified by K-Means clustering algorithm.\n", "\n", "▷ Finally, we present visually the low-dimensional approximation of the original data matrix.\n", "\n", "The PCA implementation that we use in this notebook comes from the *sklearn* library: [`sklearn.decomposition.PCA`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).\n", "\n", "***\n", "\n", "## Initial settings" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from sklearn.cluster import KMeans\n", "from sklearn.decomposition import PCA\n", "\n", "# Styles:\n", "lineColour = '#f44242'\n", "PCColour = '#002f72'\n", "scoresColour = '#ff7d14'\n", "cluster_colors = ['#0e7da7', '#ceca70', '#b45050', '#2d2d54']\n", "newPointColour = '#02111b'\n", "dataPointSize = 5\n", "data_point = 1\n", "new_point = 1\n", "ln = 1\n", "font_title = 18\n", "font_text = 16\n", "font_axis = 16\n", "font_legend = 16\n", "save_plots = False\n", "\n", "# Fonts:\n", "csfont = {'fontname':'Charter', 'fontweight':'regular'}\n", "hfont = {'fontname':'Charter', 'fontweight':'bold'}\n", "\n", "def axes_in_charter(figureSubplot, font_axis):\n", " \n", " for label in (figureSubplot.get_xticklabels()):\n", " label.set_fontname('Charter')\n", " label.set_fontweight('regular')\n", " label.set_fontsize(font_axis)\n", "\n", " for label in (figureSubplot.get_yticklabels()):\n", " label.set_fontname('Charter')\n", " label.set_fontweight('regular')\n", " label.set_fontsize(font_axis)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "## PCA in steps\n", "\n", "Generate a synthetic data set:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "Np = 100\n", "q = 1\n", "x = np.linspace(3, 6, Np)\n", "y = 0.8*x + 1*np.random.rand(Np)\n", "Dataset = np.column_stack((x, y))\n", "Dataset_proc = Dataset - np.mean(Dataset, axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Original data set:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "figure1 = plt.figure(figsize=(4, 4))\n", "figureSubplot = plt.subplot(1,1,1)\n", "plt.scatter(Dataset[:,0], Dataset[:,1], color=lineColour, marker='.', linewidth=ln)\n", "plt.axis('equal')\n", "plt.yticks([3, 4, 5, 6]), plt.xticks([3, 4, 5, 6])\n", "axes_in_charter(figureSubplot, font_axis)\n", "if save_plots==True: plt.savefig('plots/python-raw-data.pdf', dpi = 500, bbox_inches='tight')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Original data set, centered:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "figure2 = plt.figure(figsize=(4, 4))\n", "figureSubplot = plt.subplot(1,1,1)\n", "plt.scatter(Dataset_proc[:,0], Dataset_proc[:,1], color=lineColour, marker='.', linewidth=ln)\n", "plt.axis('equal')\n", "plt.yticks([-1, 0, 1]), plt.xticks([-1, 0, 1])\n", "axes_in_charter(figureSubplot, font_axis)\n", "if save_plots==True: plt.savefig('plots/python-data-centered.pdf', dpi = 500, bbox_inches='tight')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perform PCA:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "pca = PCA()\n", "pca.fit(Dataset)\n", "scores = pca.transform(Dataset)\n", "PCs = pca.components_\n", "eigvals = pca.explained_variance_ratio_\n", "\n", "Dataset_projected = np.dot(Dataset_proc, np.transpose(np.mat(PCs[0,:])))\n", "Dataset_approx = np.dot(pca.transform(Dataset)[:,:q], pca.components_[:q,:]) + np.mean(Dataset, axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Principal Components identified on the original data set:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "figure3 = plt.figure(figsize=(4, 4))\n", "figureSubplot = plt.subplot(1,1,1)\n", "plt.scatter(Dataset_proc[:,0], Dataset_proc[:,1], color=lineColour, marker='.', linewidth=ln)\n", "plt.quiver(PCs[0,0], PCs[0,1], scale=70*(1-eigvals[0]), color=PCColour, width=0.01)\n", "plt.quiver(PCs[1,0], PCs[1,1], scale=10*(1-eigvals[1]), color=PCColour, width=0.01)\n", "plt.axis('equal')\n", "plt.yticks([-1, 0, 1]), plt.xticks([-1, 0, 1])\n", "axes_in_charter(figureSubplot, font_axis)\n", "if save_plots==True: plt.savefig('plots/python-PCs.pdf', dpi = 500, bbox_inches='tight')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PCA-transformed data set:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAQoAAAD+CAYAAADCv0tyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAARDklEQVR4nO3df4wcd3nH8c9zNiYnUmRZOYVKcL02CsJUrRx0V1QZ0kLAdVJVrUggopSIVsJV+YdUqppWRvSPyErjf7CQ2kRp1Ua0ASWhSAQVaCmBnholxefE6g8QQgHHqUSlQyhtnB+N7X36x+w6e+PdfWZ2v9+Zub33S7LWO7s3892Znc88850fa+4uAJhkoe0GAOg+ggJAiKAAECIoAIQICgCh3W03oOyqq67ylZWVtpsB7EinTp36kbsvlYd3LihWVla0sbHRdjOAHcnMnhk1nF0PACGCAkCIoAAQIigAhAgKACGCAkAoW1CY2V4zO2FmXzWzk2b2LTM7nGt6APLJWVE8JGnJ3Q+7+5qkz0r6kpldl3GaADLIGRTXSHp+6PmnJZ2XdEvGaQLIIOeZmfslXRg8cfeemZ2T9LqM0wSQQbaKwt1fcffe4LmZvUXSkqRHyu81syNmtmFmG5ubm7maBGBKTR71OC7p8+7+aPkFd7/P3VfdfXVp6bLrUQC0rJGLwszsE5LeIOndTUwPQFrZKwozOyrpJkmH3P1c7ukBSC/neRS7zexeSddJusHdn+sPvzvXNAHkkWXXw8z2SnpY0qKkY5J+ycwGL39M0h05pgsgj1x9FL8h6T39/3850zQANCTLroe73+/uNu5fjmkCyIeLwgCECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECAoAIYICQIigABAiKACECIo2nT0prZ8oHnP+DTCj3W03YMc6e1K6/33SxfPSrtdIH/mCtLyW/m9SO3tSOvOYtHJw+mmnGAcaRVC05cxjxQrvF6WL/efRSjPN31QRrbiD1xf3SV85OltQdSHsUBtB0ZaVg8WKclHF48rB6n9zwSWzYsWdVbTiDr9uJnmv+DdtUOUKO2RFH8VA0/v+y2vFSnnDH1Xfqi6vSTcekxYWipX1K0frt7f8ObesuOeL58OGX+/1JFuQbFf1cCsbhN0s40DjqCik6fsLZt3PXl6r/7cv/Vhy72/Vz9fbIo/6nFFlU379xmNFG6LPPW7+DAJymnlH30ZrCAqpfjnc5n72NLssA6M+5/W3T15xp1mxo/kzTUC2Mc8JpksICqn+ytfmfvYsW+RxnzNaceuu2DnmT9PznE7XLQgKqf7KN8tWPYVptsiDv5s2ZOrIMX+anufz0umaqCoyd0/Yqtmtrq76xsZG282IpSpL57W8zfG5mpxXqSqKNpfvFJ/BzE65+2p5OBXFtKbdqg+b5/I2xfxpYpyTpjVr9VVevlU7glNJWBURFG2qsyDntfKItPm5o2CK2ja8fC+49Pd3FEesmtooJNxdIyjaVHVBznPlMUmXP3eVtg0v3xQnq9WVsE+KoGhT1QU5Lx1rdXX5c49q22D4YFkOL9/y6e/ljULVyqluhZVod42gaFuVBZmqhNxuuy9tH12apNy2xX2jK4zh5Xv1/tHzv2rl1GKFRVBsBynOZkxxQVfTmjqcW9dgng53TlapfsZtFKpWTi1WWASFlLbsy7XVnvVsxjb2kce1KWXp3HSVNDxPF3ZJ132wGB5VP5PaWbVyarHC2jlBMW5BpSz7utb5NrwF8oXiYjLZ1i/Zdjw3Idf4qtiyVb8obXxGOv1gMe1x1U+VU9qrVE4tVlg7IygmLaiUZV/5cNg3jkvv+sP2wmL4svSFBekXf09afP2rX7KmV7TUpXPV8U3aSNQZLg3N054kL/4NLs67/vbpvztVK8YmzyUZsjMuM590KXXVy56rvG/wHi1I6knfXy9WxLZuW7e8tvWy9H/9i61f/ugS89RSX2JeZXyDMPz6n25dFnWHDwy26qu3Sbv2VPssUTujWxx04PaHWSsKM/uwpI9LelHSayT9ibv/Y85pjjRp3y5l2Td4zzeOFyExzaXgqU26LL3poympS+cq4xu3Na87vDzd5TXpwK2zf+46Nw5qcXc2W1CY2QckfUrSqrufMbO3SVo3s/e6++O5pjvWgVtffcxZ9i2vFbsbzzwx/ph5k1KE5CR1v8ipS+dofOM+f93h00y7ynujYOrIuSRZgsLMTNJdku519zOS5O5PmtkXJd0p6T1JJzhpizb4Il94pSjBf/LnZ5/R0Ra0iU6nVFvxWVfcjnyRxxr3+esOz6XujYNa2uhkuXrUzH5W0n9I+pXhXQ0z+11Jfy7p9e7+wqi/rX31aLRFWz8h/dNdknrF84Xd0u88Mv0XoMlScNYjNbnb0UZbovZ02TSdp1VeT6jpq0ff3H98tjT8WRU9fddKOj3UuCOSjkjS8vJyvSlFW7SVg0Ul0esHhfdm2+pV2YKmuqX9rEdqUkh1aK+p9nTVpHZHVV1LRzqG5TrqcWX/8eXS8JdLr0uS3P0+d19199WlpaV6U4p6lJfXpF+9u6gkbKHoqZ6lfKvSgz2p17yqFEdqxqnTi17lyMjy2vhDg6k1faQmle3a7r5cFcW5/uMVpeGD588nm1KVLdrabePPs089vVRb+1ydkHW3yCsHizMQL/aKx5T7yNNUXh3ZZ69tu7a7L1dQfLf/+EZJ3xka/iYVs+p7SadW9WhEqi3epHGl/EKkOFJT1pWTwqbdhZgmJLvQp9H0LlpiWYLC3b9tZk9Leqekrw299A5Jj7r7izmm2wk5DjkOAiOFxX396z5MUk96+p+lH/xLsXu2dtvl7z/zmNS7KMmLx3KFNO1KeOax4kiUekVg1am86oRkl/o0Bu0e7Ppto8DIecLVHZLuMbO/dPezZnZA0q9LOpRk7NN+QZvYunT1kOPZk8UVpN7rh0X/FOTeheLuS1fvv3w6kyqkWW71trhPl45EqZfmV89G2VJB9YrrMtpcObsUXDVkCwp3/zszu0LSF8zsBUl7JN3s7k/MPPJpZ3YXFlKVoMq1P3tppelJWig6dz04GjSokE4/OGF8U9zq7aUfvzp9Wyie53Cpj6VfFT31udG7crOqugHq+nknY2Q9hdvdH5D0QPIRTzuzy+Xu6Qfbu0R50sqUa3+2HEBv/6j0+D3FyhodDTr9YNHuwZWSy2tbx1f3MvaVg8U0c5+9urxWXAq+8RmN3X2aVZ0N0Dbt1NyeV49OO7PL5e6Tn+2vJBmqi1FbmDplcI5j56MCaP+NcSCNC+bh8b30v/3QUbxMBvPm7R+V/vvfpbf+2uQTjRb3zXb36gO3vhp0OVbOOhuubdqpuT2DIprZ48rA4XJXVuyby9OXgOO2ME2VwZOUA6hKIEWHaqXi83qvOLntxmPjxzl8Sr36ux3PPHF5/8io9+3aM12gp145y9+vcfNn0u+vRm3owpGaIdszKKTxM3tSGThc7i7sKr7YvYvjzw+YpUd/3BY4dxmcQ+VzR/oBPKm/YfBeDfWNjLrCtur76nyGOkdK6p6yXp4/s/SHdaEvrWT7BsU4k8rA4QW6uE/68h+reNMIde58Vf5STdoC5y6Dc0l17sjgvRdclyqFhV3Sc/9VzMvyPBx+XxPzK1rukzYCVd5XntaoQOpgh+f8BUX0pR0s0PUTk88PqLqgx915eTiQBqfrll9LVVa2XaaWP5M0/jyB8rz54b8Vu2Cn/nZrR+nw0ZZzm9KVS83splW5dijF/S3DyrdbHZ7zFxRVV8QUl/dG1YsU38J9Vl0pU4dPJoraM/z5B4E9bsUcVF8Lu7b+fS5VNjQpbnQ06QzZDnZ4zl9QSNVWxGhhVFlY0ZeqiRIyxzRmqVDqtmfSPNwyrtKNbHOtPFWWe9Wgr7K7Nti1+v560ambY2OSwHwGRVXRwqjy+qQvVRMlZOppzFqh1G3PpHl4aWUacSPbSW2aJehy7caVxzv43F26beIEOzsoUpgUJk2UkKmnMWuFMk17xs3D4X6Kpz5X7KJUOUeja0cbJvVldem2iRMQFLk1UUKmnEaKCiVlewbjqnoj21mCLteuYtUjcR3pjxiFoMBWXf3iVg2fWYIu165i1SNxHZblnpmzqH3PTKBsO/RRdNS4e2YSFAAuGRcUO+OXwgDMhKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgChbEFhZr9lZv9gZt80s/80s2Nm9tpc0wOQz+4cIzWz35T0N5L2uPt5M/tpSackvVbSH+SYJoB8clUU10j6P3c/L0nu/gNJD0v6YKbpAcgoS1C4+52S9pYG/4+k1+WYHoC8svVRuPvLpUHXS3pk1HvN7IiZbZjZxubmZq4mAZhSI0c9zOx9kq6VdHTU6+5+n7uvuvvq0tJSE00CUEOlzkwze67C2464+0Mj/vbnJP2ZpJvd/dma7QPQAZWCwt3L/Q2VmNnbVHRifsjdvznNOAC0L+d5FDdJekDSLe7+aH/YYTP75VzTBJBHlqAws4+rOI/iuKSr+wFxWNIRSSs5pgkgnywnXEk60X/8qxGvjTzyAaC7sgSFu1uO8QJoBxeFAQgRFABCBAWAEEEBIGTu3nYbtjCzTUnP1PiTqyT9KFNzUA3LoH2plsFPuftl11F0LijqMrMNd19tux07GcugfbmXAbseAEIEBYDQPATFfW03ACyDDsi6DLZ9HwWA/OahogCQGUEBIERQAAjNXVDww0PNMbMP92+KvG5mj5vZobbbtJOY2V4zO2FmXzWzk2b2rf59X5LLdT+KVvDDQ80xsw9I+pSkVXc/07/t4bqZvdfdH2+5eTvFQ5I23f2wJJnZ7ZK+ZGa/4O5PpZzQvFUU/PBQA8zMJN0l6V53PyNJ7v6kpC9KurPFpu0010h6fuj5pyWdl3RL6gnNVVDww0ONeaukn5G0Xhq+LuldZsb8bsZ+SR8bPHH3nqRzyvB9n6ugkOr98BCm9ub+Y/nnF55V8Z26ttnm7Ezu/ko/HCRJZvYWSUvK8H2fqz6KsqEfHnp/222ZM1f2H8uh/HLpdTTruKTPD+56n1Kng4IfHuqsc/3HK0rDB8+fFxplZp+Q9AZJ784x/k4HBT881Fnf7T++UdJ3hoa/SdJFSd9rvEU7mJkdlXSTpEPufi56/zTmro+CHx7Kz92/LelpSe8svfQOSY+6+4vNt2rnMbPdZnavpOsk3eDuz/WH3518WvN0UVj/h4c+qeKciR8OvXRE0iPufn8b7ZpHZnazpHtUnEdx1swOqDjqccjdn2i3dfPPzPaqqJoXJR2TNLwiP+zuP5F0enMWFJM+zG8TFGmZ2Yck/b6kFyTtkfRJd/9au63aGczsI5L+etzrqX9bZ66CAkAec9dHASA9ggJAiKAAECIoAIQICgAhggJAiKAAECIoAIQICgCh/wdZT5gNJThYwAAAAABJRU5ErkJggg==\n", "text/plain": [ "