{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Week 9 \n", "### Clustering, Latent Variable models and some Portfolio optimization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Possible Additional reading: \n", "\n", "- Image Segmentation using K-means Clustering Algorithm and\n", "Subtractive Clustering Algorithm\n", "- Quantitative equity portfolio management: modern techniques and applications\n", "- Financial Econometrics: From Basics to Advanced Modeling Techniques" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### I. Clustering (continued)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### I.1. Back to K-means: Quantization and Segmentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clustering algorithms can be used in a variety of applications. An important example of such applications is quantization. \n", "\n", "\n", "__Exercise I.1.1__ Load each of the images 'roadSignEasy', 'roadSignMedium' and 'roadSignHarder' shown below and try to use K-means in the RGB space first to extract the letters from the road sign. Try to use your own K means code. Display the resulting black and white image." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Drawing\"\n", "\n", "image credit: [Shouse California Law Group](https://www.shouselaw.com/) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# put your code here\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise I.1.2__ Try to define additional features to improve the segmentation. You can for example work in a 5D space not only including the RGB triples but also the (X,Y) location of each pixel. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Bonus__ : Use your own image" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# put your code here \n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### II. Latent Variable models " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### II.1 Principal component Analysis, Warm up" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In principal component Analysis, we are interested in capturing the directions in which most of the variation occurs within a given dataset. \n", "\n", "__Exercise II.1.a__ To get some intuition on PCA, generate 3D data points along a plane of your choice. To do this, first fix the equation of the plane. Then use the equation you chose to generate points on the plane and perturb the points using small random Gaussian noise. Plot the resultin the points." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "\n", "# put the code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise II.1.b__ now that you have the noisy points, compute the first, second and third principal direction and represent them on your 3D scatter plot. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# put the code here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise II.1.c__ Project the points onto the the principal plane. and represent the projection on your scatter plot.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### II.2 Principal component Analysis: Image compression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise II.2.a__ As a second exercise, we will study the use of PCA for the compression of images. Load and display the image \"shapes.png\". Start by computing, sorting and plotting the singular values of the matrix. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from numpy import linalg as LA\n", "\n", "\n", "# Your answer\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise II.2.b__ For any given matrix $\\boldsymbol X$ and number of components $K$, the approximation of $X$ through the first $k$ principal components can be obtained by computing the SVD and retaining the first (i.e. largest) singular values and vectors. Concretely, if the SVD of $\\boldsymbol X$ is given by $\\boldsymbol U\\boldsymbol\\Sigma \\boldsymbol V^T$, the approximation can be computed as $\\boldsymbol U_k\\boldsymbol\\Sigma_k \\boldsymbol V_k^T$ where $U_k$ encodes the first $k$ columns of $U$, $V_k$ encodes the first $k$ rows of $V$ and $S_k$ is the $k$ by $k$ matrix retaining the $k$ largest singular values of $\\boldsymbol X$. \n", "\n", "Compute the compressed/approximated image for various values of $k$ (lets say $5$, $10$, $30$ and $50$) and display the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from numpy import linalg as LA\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# put your code here\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### II.3. Towards Portfolio optimization\n", "(The exercise is inspired by [quantopian](https://www.quantopian.com/about))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Drawing\"\n", "\n", "image credit: [https://emerj.com/](https://emerj.com/ai-future-outlook/machine-learning-finance-interviews-podcasts/) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise II.3.a__ As a second exercise, we will study how PCA can be used to optimize portfolios. Start by downloading the data from the following 10 stocks, \n", "\n", "IBM, MSFT, FB, T, INTC, ABX, NEM, AU, AEM, GFI. \n", "\n", "5 of those stocks come from tech companies, the remaining 5 come from gold mining companies\n", "\n", "Load the stock prices between 2015-09-01 and 2016-11-01. Plot the results\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import yfinance as yf\n", "\n", "start_date = '2015-09-01' \n", "end_date = '2016-11-01'\n", "\n", "# put your code here" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "What you should obtain:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\"Drawing\"\n", "\n", "image credit: [https://emerj.com/](https://emerj.com/ai-future-outlook/machine-learning-finance-interviews-podcasts/) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise II.3.b__ Store the stocks in a single matrix with rows = number of dates, columns = number of stocks in your portfolio. Then compute the PCA decomposition using the PCA decomposition from Scikit-learn. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise II.3.c__ The PCA implementation from Scikit learn comes witn an attribute that enables you to determine the fraction of the total variance explained by the components you extract. Compute the percentage of the variance. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# put your result here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise II.3.d__ The principal components can be understood as encoding some underlying statistical factors that \"drive\" the return of the portfolio. we can then look at how much of the closing price evolution of each stock is driven by the hidden factors. To see this, we can, as we did for the image, compute the representation of each stock as a combination of the factors. Those representations are known as \"factor returns\". Plot those returns.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# put the code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise II.3.e__\n", "\n", "Aside from the factor returns, we might want to investigate the factor exposures for each stock in the portfolio which are basically your principal components. Those components indicates how strongly each of the stock is influenced by the principal components. Plot those values in the (PCA1, PCA2) plane. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# put your code here\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### II.4. EigenFaces. \n", "\n", "Now that we have a better understanding of PCA, we can go back to the face dataset. Use the lines below to load the face images from the 'Labeled Faces in the Wild' dataset. Start by computing the decomposition of each face on the first $150$ principal faces. Then learn a Support vector classifier for the resulting dataset using radial basis functions and parameters $C=1000.0$ and $\\gamma=0.005$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from time import time\n", "import logging\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.datasets import fetch_lfw_people\n", "from sklearn.metrics import classification_report\n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.decomposition import PCA\n", "from sklearn.svm import SVC\n", "\n", "lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)\n", "\n", "n_samples, h, w = lfw_people.images.shape\n", "X = lfw_people.data\n", "n_features = X.shape[1]\n", "\n", "\n", "y = lfw_people.target\n", "target_names = lfw_people.target_names\n", "n_classes = target_names.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### II.5. Independent component analysis: The coktail party " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise, we will use another approach to dimensionality reduction, known as independent component analysis (ICA). ICA is particularly useful in speech or more generally, source separation. In the classical version of this problem, known as \"The coktail party problem\", one is interested in recovering two distinct signals from their mixing. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Drawing\"\n", "\n", "image credit: [The conversation](https://en.wikipedia.org/wiki/The_Conversation) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise II.4.a.__ Using the FastICA transform from scikit-learn, recover the two speeches from the mixed1.wav and mixed1.wav files which are given on github. \n", "\n", "(Hint: start by storing the two signals 'samples1' and 'samples1' into a single matrix, then pass this matrix as an input to the FastICA method of Scikit-learn) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "print(__doc__)\n", "\n", "import os\n", "import wave\n", "import pylab\n", "import matplotlib\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from scipy import signal\n", "from scipy.io import wavfile\n", "\n", "from sklearn.decomposition import FastICA, PCA\n", "\n", "###############################################################################\n", "\n", "\n", "# read data from wav files\n", "sample_rate1, samples1 = wavfile.read('mixed1.wav')\n", "sample_rate2, samples2 = wavfile.read('mixed2.wav')\n", "\n", "print 'sample_rate1', sample_rate1\n", "print 'sample_rate2', sample_rate2\n", "\n", "# Use FastICA\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise II.4.b.__ Once you have recovered the original signals, plot them as time series. Then use the lines below to store them in new .wav file. Then play them and compare them with their mixings." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", "# write data to wav files\n", "scaled1 = np.int16(recovered[:,0]/np.max(np.abs(recovered[:,0])) * 32767)\n", "wavfile.write('recovered-1.wav', sample_rate1, scaled1)\n", "\n", "scaled2 = np.int16(recovered[:,1]/np.max(np.abs(recovered[:,1])) * 32767)\n", "wavfile.write('recovered-2.wav', sample_rate2, scaled2)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### III Manifold Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise, we will get familiar with the most popular manifold learning methods (see http://www.augustincosse.com/wp-content/uploads/2018/11/slides10.pdf for a review of the theory) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__III.1. Getting some intuition: the moving ball__ Consider the sequence of frames defined below. Those frames are encoded as columns of the data matrix. Use the MDS and then ISOMAP algorithms to get an intuition on the trajectory followed by the white ball." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeMAAACDCAYAAAC+9HPWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi40LCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcv7US4rQAAAvhJREFUeJzt3cFu4jAUQNF41P//Zc+KDVWpgYRL\nyTlbohI1Kpfn4jDmnBsA0PlXnwAAnJ0YA0BMjAEgJsYAEBNjAIiJMQDExBgAYmIMADExBoDY1yuf\nbIzhdl9vYM459vg5rud72Ot6bptr+i78jX6WletpMgaAmBgDQEyMASAmxgAQE2MAiIkxAMTEGABi\nYgwAMTEGgJgYA0BMjAEgJsYAEBNjAIiJMQDExBgAYmIMADExBoCYGANATIwBICbGABATYwCIiTEA\nxMQYAGJiDACxr/oEALZt2+acPz42xnjhmcDrmYwBIGYyBlK3JuLrY0zIfCqTMZBZCfEzx8NfIcYA\nEBNjAIiJMQDExBgAYmIMADFbm2AHv33K15Yc4BaTMQDExBietHrTCntkv7t3xcAKA5/KMjU86JG4\nzjkF5crl9+He1JyZyRgAYiZj4C2YfjkzkzEAxMQYAGJiDAAxMQaAmBgDQEyM4UFjDDetAHYhxvCk\nlcA+Em7gPMQYAGJu+gE7MPUCzzAZA0BMjAEgJsYAEBNjAIiJMQDExBgAYmIMADExBoCYGANATIwB\nICbGABATYwCIiTEAxHxrEwCHm3PefPzs33xmMgbgUL+FePWYTybGABCzTH0nSy0Aa+6ddi/Hn/F1\n1GS8aM5pqQWAQ4gxAMTEeMEjSy0mZABWiTEAxMQYAGJiDAAxW5sAOMRli9LqZ2jOuKXpwmQMADEx\nBuBQKxPvmafibRNjAMj5n/GCMcZd+4bP/g4P4JrXxdtMxovGGJZaADiEGANAzDL1nUy+AOzNZAwA\nMTEGgJgYA0BMjAEgJsYAEBNjAIiJMQDExBgAYmIMADExBoCYGANATIwBICbGABATYwCIiTEAxMQY\nAGJiDACxMeeszwEATs1kDAAxMQaAmBgDQEyMASAmxgAQE2MAiIkxAMTEGABiYgwAMTEGgJgYA0BM\njAEgJsYAEBNjAIiJMQDExBgAYmIMADExBoCYGANATIwBICbGABATYwCIiTEAxP4Dbo5lI76EgT0A\nAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "radius = 15\n", "\n", "radius2 = 5\n", "\n", "\n", "theta = np.linspace(0, 2*np.pi, num=50)\n", "\n", "\n", "simple_movie = np.zeros((64,64,50))\n", "\n", "\n", "for k in range(0,50):\n", "\n", " \n", " xpos = np.rint(32 + radius*np.cos(theta[k]))\n", " ypos = np.rint(32 + radius*np.sin(theta[k]))\n", " \n", " for i in range(0,simple_movie.shape[0]):\n", " for j in range(0,simple_movie.shape[1]):\n", " \n", " if (i-xpos)**2 + (j-ypos)**2 < radius2**2:\n", " \n", " simple_movie[i,j,k] = 1\n", "\n", "plt.figure(1, figsize=(8, 3))\n", "plt.subplot(141) \n", "plt.imshow(simple_movie[:,:,1],interpolation='nearest',cmap=plt.cm.gray)\n", "plt.axis('off')\n", "plt.subplot(142) \n", "plt.imshow(simple_movie[:,:,15],interpolation='nearest',cmap=plt.cm.gray)\n", "plt.axis('off')\n", "plt.subplot(143) \n", "plt.imshow(simple_movie[:,:,30],interpolation='nearest',cmap=plt.cm.gray)\n", "plt.axis('off')\n", "plt.subplot(144) \n", "plt.imshow(simple_movie[:,:,45],interpolation='nearest',cmap=plt.cm.gray)\n", "plt.axis('off')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise III.2.__ Try fancies trajectories. Ex let the ball move along a 8 shape. Then apply MDS and ISOMAP." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }