{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Workshop on Data Science in Africa\n",
      "##Dedan Kimathi University of Technology, Nyeri, Kenya\n",
      "###June 15th -19th, 2015\n",
      "\n",
      "##Clustering, Ciira Maina\n",
      "\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Introduction\n",
      "\n",
      "This notebook will allow you to run the clustering examples presented in the clustering lecture. We will cover three clustering algorithms:\n",
      "\n",
      "1. K-means clustering\n",
      "2. Gaussian Mixture Models\n",
      "3. Hierachical clustering"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##K-means Clustering\n",
      "\n",
      "As discussed in the lecture, K-means works by iterating through two steps:\n",
      "\n",
      "1. Assigning data points to closest center\n",
      "2. Recomputing cluster centers based on current assignment\n",
      "\n",
      "When the cluster assignments don't change, the algorithm has converged.\n",
      "\n",
      "###Example\n",
      "\n",
      "We will consider an example with 2D data generated from a 2 component Gaussian Mixture Model. It is often important to test algorithms with toy data to ensure our implementations are correct.\n",
      "\n",
      "####Generating the toy data\n",
      "\n",
      "We will generate and plot 20 data points from a mixture of 2 Gaussians (10 data points from each component). "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "import pylab as plt\n",
      "\n",
      "#Set the random seed for reproducability\n",
      "np.random.seed(123)\n",
      "#Generate the data\n",
      "X1=np.random.multivariate_normal(np.array([1,1]),np.diag([.1,.1]),10)#10 points from the first GMM component\n",
      "X2=np.random.multivariate_normal(np.array([3,3]),np.diag([.1,.1]),10)#10 points from the second GMM component\n",
      "X=np.concatenate((X1,X2))\n",
      "\n",
      "#Plot the data\n",
      "%matplotlib inline\n",
      "plt.plot(X[:,0],X[:,1],'bo',markersize=10)\n",
      "plt.xlim([0,5]);\n",
      "plt.ylim([0,5]);"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We see that the two distinct clusters are clearly visible. If we run the K-mean algorithm on these data we can recover these clusters. We write a function that implements the two steps in the K-means algorithm. The inputs to this algorithm are the data and the current cluster centers. It returns a cluster assignment based on the input centers, the cost after the assignment step and new cluster centers based on the new assignment. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def kmeans(data,centers):\n",
      "\n",
      "\tnew_cent=np.zeros(centers.shape)\n",
      "\t#assignment step\n",
      "\tdist=np.zeros((data.shape[0],centers.shape[0]))\n",
      "\tfor i in range(centers.shape[0]):\n",
      "\t\tdist[:,i]=np.sum((data-centers[i,:])**2,1)\n",
      "\n",
      "\tassign=np.argmin(dist,1)\n",
      "\n",
      "\t#compute new centers and cost\n",
      "\tcost=0\n",
      "\tfor i in range(centers.shape[0]):\n",
      "\t\tnew_cent[i,:]=np.mean(data[assign==i,:],0)\n",
      "\t\tcost+=np.sum(np.sum((data[assign==i,:]-centers[i,:])**2,1))\n",
      "\t\n",
      "\treturn new_cent,cost,assign"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us run through one iteration of the K-means algorithm."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#We need a value of K\n",
      "K=2 # we assume there are two clusters\n",
      "\n",
      "#random starting points\n",
      "centers=np.random.multivariate_normal(np.array([2,2]),np.diag([1,1]),K)\n",
      "\n",
      "#run one iteration of K-means\n",
      "new_cent,cost,assign=kmeans(X,centers)\n",
      "\n",
      "#plot the assignment based on input cluster centers\n",
      "col=['r','g'] #colour code for the two clusters\n",
      "plt.plot(X[assign==0,0],X[assign==0,1],col[0]+'o',markersize=10)\n",
      "plt.plot(X[assign==1,0],X[assign==1,1],col[1]+'o',markersize=10)\n",
      "plt.plot(centers[0,0],centers[0,1],col[0]+'x',markersize=10,mew=2) #input center\n",
      "plt.plot(centers[1,0],centers[1,1],col[1]+'x',markersize=10,mew=2) #input center\n",
      "plt.xlim([0,5]);\n",
      "plt.ylim([0,5]);"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Plot the new centers\n",
      "centers=new_cent # this will be the centers in the next iteration\n",
      "plt.plot(X[assign==0,0],X[assign==0,1],col[0]+'o',markersize=10)\n",
      "plt.plot(X[assign==1,0],X[assign==1,1],col[1]+'o',markersize=10)\n",
      "plt.plot(centers[0,0],centers[0,1],col[0]+'x',markersize=10,mew=2)\n",
      "plt.plot(centers[1,0],centers[1,1],col[1]+'x',markersize=10,mew=2) \n",
      "plt.xlim([0,5]);\n",
      "plt.ylim([0,5]);"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Write a for loop that calls the function kmeans for a number of iterations (say 4) and monitor any change in cluster assignment."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "##Code box for your solution"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Write a while loop that allows you to terminate the algorithm when the cluster assignment doesn't change."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "##Code box for your solution"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 32
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### K-means for Image compression (Vector Quantization)\n",
      "\n",
      "The K-means algorithm can be used for image compression. In this example we will demonstrate it on an image of Mzee Jomo Kenyatta. We divide the image into 2-by-2 blocks which we treat as vectors in $\\mathbf{R}^4$. We then cluster these blocks and use the assigned cluster centers to represent the image.\n",
      "\n",
      "We will use the Python Imaging Library (PIL) to access the image. We will also use the scikit-learn toolbox of implementations of machine learning algorithms for K-means."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import PIL\n",
      "from sklearn.cluster import KMeans\n",
      "import matplotlib.cm as cm #colour map to display grayscale image\n",
      "\n",
      "#Obtain the image\n",
      "img=PIL.Image.open('data/jomo.jpg')\n",
      "\n",
      "#Display the image\n",
      "#img.show()\n",
      "plt.imshow(np.asarray(img),cmap = cm.Greys_r)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "A=np.asarray(img)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 44
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We need to get the 2-by-2 blocks from the image array and flatten them to vectors in $\\mathbf{R}^4$ before running K means."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#get the vectors\n",
      "px=img.size[0]\n",
      "vq=np.zeros(((px*px)/4,4))\n",
      "k=0\n",
      "for i in range(0,A.shape[0],2):\n",
      "\tfor j in range(0,A.shape[1],2):\n",
      "\t\tvq[k,:]=np.ndarray.flatten(A[i:i+2,j:j+2])\n",
      "\t\tk+=1\n",
      "\n",
      "kmeans = KMeans(init='k-means++', n_clusters=2, n_init=10)\n",
      "kmeans.fit(vq)\n",
      "cent=kmeans.cluster_centers_"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 68
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Plot the quantized image"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#quantized image\n",
      "img_mtx2=np.zeros(img.size)\n",
      "k=0\n",
      "for i in range(0,img.size[0],2):\n",
      "\tfor j in range(0,img.size[1],2):\n",
      "\t\timg_mtx2[i:i+2,j:j+2]=np.reshape(np.round(cent[kmeans.predict(vq[k,:])]),(2,2))\n",
      "\t\tk+=1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 69
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.imshow(img_mtx2,cmap = cm.Greys_r)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 70
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Gaussian Mixture Models\n",
      "\n",
      "In this example we will use GMMs to cluster the energy of speech frames and use this information to determine whether a speech frame corresponds to speech or silence. Let's start by using the scipy library to plot a mixture of Gaussians."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from scipy.stats import norm\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 71
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's plot the pdf of a Gaussian with mean zero and variance 1. Vary the parameters of the distribution and see what happens."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x=np.linspace(-5,5,100)\n",
      "plt.plot(x,norm.pdf(x,0,1),linewidth=3)\n",
      "plt.xlabel(r'$x$',fontsize=16)\n",
      "plt.ylabel(r'$p(x)$',fontsize=16)\n",
      "plt.title(r'Univariate Gaussian with $\\mu=0$ and $\\sigma=1$',fontsize=16)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 75
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now let us plot a GMM with three components."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.plot(x,0.4*norm.pdf(x,-3,.5),'r',linewidth=1)\n",
      "plt.plot(x,0.2*norm.pdf(x,1,1),'g',linewidth=1)\n",
      "plt.plot(x,0.4*norm.pdf(x,3,.5),'b',linewidth=1)\n",
      "plt.plot(x,0.4*norm.pdf(x,-3,.5)+0.2*norm.pdf(x,1,1)+0.4*norm.pdf(x,3,.5),'m',linewidth=2)\n",
      "plt.xlabel(r'$x$',fontsize=16)\n",
      "plt.ylabel(r'$p(x)$',fontsize=16)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 76
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Voice Activity Detection\n",
      "\n",
      "We now experiment with voice activity detection by modelling the log energy of speech frames using a GMM. We will use scipy to access the speech sample stored in a WAV file"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import scipy.io.wavfile\n",
      "\n",
      "#open the wav file\n",
      "x=scipy.io.wavfile.read('data/digit.wav')\n",
      "freq=x[0]#sampling frequency\n",
      "signal=x[1]/np.float(2**15);#normalize to range -1 to 1, The numbers in x[1] are 16 bit signed integers"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us plot this speech signal which contains the first seven digits in Kiswahili."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.plot(np.arange(len(signal))*(1./freq),signal)\n",
      "plt.xlabel('Time (seconds)')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 81
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We see that the location of the seven digits can be clearly see and correspond to areas of high energy in the signal. We divide the signal into segments with 256 samples each and determine the energy. We then plot a histogram of the log energy. It is clear that the data is multimodal"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "blk_size=256\n",
      "num_blk=int(np.floor(len(signal)/blk_size))\n",
      "energy=np.zeros(num_blk)\n",
      "\n",
      "for i in range(num_blk):\n",
      "\tenergy[i]=np.sum(signal[i*blk_size:(i+1)*blk_size]**2)\n",
      "    \n",
      "plt.hist(np.log(energy),50)\n",
      "plt.xlabel('Logarithm of block energy')\n",
      "\t"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 86
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can try to model the data using a single Gaussian."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mu, std = norm.fit(np.log(energy))\n",
      "plt.hist(np.log(energy), bins=50, normed=True)\n",
      "xmin, xmax = plt.xlim()\n",
      "x = np.linspace(xmin, xmax, 100)\n",
      "p = norm.pdf(x, mu, std)\n",
      "plt.plot(x, p, 'r', linewidth=3)\n",
      "plt.xlabel('Logarithm of block energy')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 87
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We see that this is not a good model of the data. We then fit a GMM with 2 components using the Expextation maximization implementation in scikit-learn.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn import mixture\n",
      "\n",
      "g = mixture.GMM(n_components=2)\n",
      "g.fit(np.log(energy)) \n",
      "p2 = g.weights_[0]*norm.pdf(x, g.means_[0], np.sqrt(g.covars_[0]))+g.weights_[1]*norm.pdf(x, g.means_[1], np.sqrt(g.covars_[1]))\n",
      "plt.hist(np.log(energy), bins=50, normed=True)\n",
      "plt.plot(x, p2, 'r', linewidth=3)\n",
      "plt.xlabel('Logarithm of block energy')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 89
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Try and fit GMMs with more components and see what happens."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can compute the probability of each frame belonging to each cluster. To get the cluster corresponding to speech, we determine the cluster with the highest mean."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "speech_comp=np.argmax(g.means_)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Then we compute the probability of each frame belonging to each of the clusters."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pred_prob=g.predict_proba(np.log(energy))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can classify a segment as being speech if the probability of the segment belonging to the cluster with the highest mean is above a given threshold."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#set threshold\n",
      "thresh=0.5\n",
      "#determine speech blocks\n",
      "speech=pred_prob[:,speech_comp]>thresh\n",
      "\n",
      "#Now plot the speech signal and indicate the regions classified as speech\n",
      "plt.figure()\n",
      "plt.plot(np.arange(len(signal))*(1./freq),signal)\n",
      "for i in range(num_blk):\n",
      "\tif speech[i]:\n",
      "\t\tplt.plot(np.array([i*blk_size,(i+1)*blk_size])*(1./freq),[0,0],'r-',linewidth=4)\n",
      "\n",
      "plt.xlabel('Time (seconds)')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "What do you think about this classification? Try and change the threshold and the number of clusters. How does this change the classification. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Hierachical Clustering\n",
      "\n",
      "This is an approach to clustering that yields a hierarchy of clusters. Clusters in one level of the hierarchy are formed by merging clusters in the lower level and at the lowest level of the hierarchy each datum is in its own cluster. We will demonstrate hierarchical clustering using the scipy module scipy.cluster.hierarchy"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import scipy.cluster.hierarchy"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We then generate 6 data points from a 2 mixture GMM. Three data points from each component. We plot the data using the index of the data point."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.random.seed(123)\n",
      "X1=np.random.multivariate_normal(np.array([1,1]),np.diag([.1,.1]),3)\n",
      "X2=np.random.multivariate_normal(np.array([3,3]),np.diag([.1,.1]),3)\n",
      "X=np.concatenate((X1,X2))\n",
      "\n",
      "%matplotlib inline\n",
      "for i in range(X.shape[0]):\n",
      "    plt.text(X[i,0],X[i,1],str(i),color='b',fontsize=16 )\n",
      "    \n",
      "plt.xlim([0,5]);\n",
      "plt.ylim([0,5]) ;  "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 21
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We then perform hierarchical clustering using single linkage."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#hierarchical clustering\n",
      "Z=scipy.cluster.hierarchy.single(X)\n",
      "\n",
      "#print Z to show the steps in the clustering process\n",
      "print Z"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 24
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The variable $Z$ summarises the steps covered in the lecture. They are\n",
      "\n",
      "1. Merge 0 and 2 to form a new cluster and label it 6 \n",
      "2. Merge 3 and 5 to form a new cluster and label it 7\n",
      "3. Merge 4 and 7 to form a cluster with the objects $\\{3,4,5\\}$ and label it 8\n",
      "4. Merge 1 and 6 to form a cluster with the objects $\\{0,1,2\\}$ and label it 9\n",
      "5. Merge 8 and 9 to form a cluster with all the objects\n",
      "\n",
      "We can plot a dendogram."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "scipy.cluster.hierarchy.dendrogram(Z,leaf_font_size=16)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 28
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}