{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# A Network Tour of Data Science\n",
"### Xavier Bresson, Winter 2016/17\n",
"## Assignment 1 : Unsupervised Clustering with the Normalized Association"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Load libraries\n",
"\n",
"# Math\n",
"import numpy as np\n",
"\n",
"# Visualization \n",
"%matplotlib notebook \n",
"import matplotlib.pyplot as plt\n",
"plt.rcParams.update({'figure.max_open_warning': 0})\n",
"from mpl_toolkits.axes_grid1 import make_axes_locatable\n",
"from scipy import ndimage\n",
"\n",
"# Print output of LFR code\n",
"import subprocess\n",
"\n",
"# Sparse matrix\n",
"import scipy.sparse\n",
"import scipy.sparse.linalg\n",
"\n",
"# 3D visualization\n",
"import pylab\n",
"from mpl_toolkits.mplot3d import Axes3D\n",
"from matplotlib import pyplot\n",
"\n",
"# Import data\n",
"import scipy.io\n",
"\n",
"# Import functions in lib folder\n",
"import sys\n",
"sys.path.insert(1, 'lib')\n",
"\n",
"# Import helper functions\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"from lib.utils import construct_kernel\n",
"from lib.utils import compute_kernel_kmeans_EM\n",
"from lib.utils import compute_purity\n",
"\n",
"# Import distance function\n",
"import sklearn.metrics.pairwise\n",
"\n",
"# Remove warnings\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 1:** Write down the mathematical relationship between Normalized Cut (NCut) and Normalized Association (NAssoc) for K clusters. It is not necessary to provide details.\n",
"\n",
"The Normalized Cut problem is defined as:
\n",
"$$\n",
"\\min_{\\{S_k\\}}\\ NCut(\\{S_k\\}) := \\sum_{k=1}^K \\frac{Cut(S_k,S_k^c)}{Vol(S_k)} \\ \\textrm{ s.t. } \\ \\cup_{k=1}^{K} S_k = V, \\ S_k \\cap S_{k'}=\\emptyset, \\ \\forall k \\not= k' \\quad\\quad\\quad(1)\n",
"$$\n",
"\n",
"and the Normalized Association problem is defined as:
\n",
"$$\n",
"\\max_{\\{S_k\\}}\\ NAssoc(\\{S_k\\}):= \\sum_{k=1}^K \\frac{Assoc(S_k,S_k)}{Vol(S_k)} \\ \\textrm{ s.t. } \\ \\cup_{k=1}^{K} S_k = V, \\ S_k \\cap S_{k'}=\\emptyset, \\ \\forall k \\not= k' .\n",
"$$\n",
"\n",
"\n",
"We may rewrite the Cut operator and the Volume operator with the Assoc operator as:
\n",
"$$\n",
"Vol(S_k) = \\sum_{i\\in S_k, j\\in V} W_{ij} \\\\\n",
"Assoc(S_k,S_k) = \\sum_{i\\in S_k, j\\in S_k} W_{ij} \\\\\n",
"Cut(S_k,S_k^c) = \\sum_{i\\in S_k, j\\in S_k^c=V\\setminus S_k} W_{ij} = \\sum_{i\\in S_k, j\\in V} W_{ij} - \\sum_{i\\in S_k, j\\in S_k} W_{ij} = Vol(S_k) - Assoc(S_k,S_k) \n",
"$$\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer to Q1:** Your answer here.\n",
"\n",
"We have
\n",
"$$\n",
"\\frac{Cut(S_k,S_k^c)}{Vol(S_k)} = \\frac{Vol(S_k) - Assoc(S_k,S_k)}{Vol(S_k)} = 1- \\frac{Assoc(S_k,S_k)}{Vol(S_k)}\n",
"$$\n",
"\n",
"and
\n",
"\n",
"$$\n",
"\\sum_{k=1}^K \\frac{Cut(S_k,S_k^c)}{Vol(S_k)} = K - \\sum_{k=1}^K \\frac{Assoc(S_k,S_k)}{Vol(S_k)}\n",
"$$\n",
"\n",
"The relationship between Normalized Cut (NCut) and Normalized Association (NAssoc) for K clusters is thus
\n",
"\n",
"$$\n",
"NCut(\\{S_k\\}) = K - NAssoc(\\{S_k\\}).\n",
"$$\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 2:** Using the relationship between NCut and NAssoc from Q1, it is therefore equivalent to maximize NAssoc by minimizing or maximizing NCut? That is \n",
"\n",
"$$\n",
"\\max_{\\{S_k\\}}\\ NAssoc(\\{S_k\\}) \\ \\textrm{ s.t. } \\cup_{k=1}^{K} S_k = V, \\quad S_k \\cap S_{k'}=\\emptyset, \\ \\forall k \\not= k'\n",
"$$\n",
"\n",
"$$\n",
"\\Updownarrow\n",
"$$\n",
"\n",
"$$\n",
"\\min_{\\{S_k\\}}\\ NCut(\\{S_k\\}) \\ \\textrm{ s.t. } \\cup_{k=1}^{K} S_k = V, \\quad S_k \\cap S_{k'}=\\emptyset, \\ \\forall k \\not= k'\n",
"$$\n",
"\n",
"or\n",
"\n",
"$$\n",
"\\max_{\\{S_k\\}}\\ NCut(\\{S_k\\}) \\ \\textrm{ s.t. } \\cup_{k=1}^{K} S_k = V, \\quad S_k \\cap S_{k'}=\\emptyset, \\ \\forall k \\not= k'\n",
"$$\n",
"\n",
"It is not necessary to provide details."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer to Q2:** Your answer here.\n",
"\n",
"As $\\min F \\Leftrightarrow \\max -F$, we have equivalence between the max NAssoc problem:\n",
"\n",
"$$\n",
"\\max_{\\{S_k\\}}\\ NAssoc(\\{S_k\\}) \\ \\textrm{ s.t. } \\cup_{k=1}^{K} S_k = V, \\quad S_k \\cap S_{k'}=\\emptyset, \\ \\forall k \\not= k'\n",
"$$\n",
"\n",
"and the min NCut problem:\n",
"\n",
"$$\n",
"\\min_{\\{S_k\\}}\\ NCut(\\{S_k\\}) \\ \\textrm{ s.t. } \\cup_{k=1}^{K} S_k = V, \\quad S_k \\cap S_{k'}=\\emptyset, \\ \\forall k \\not= k'\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 3:** Solving the NCut problem in Q2 is NP-hard => let us consider a spectral relaxation of NCut. Write down the Spectral Matrix A of NCut that satisfies the equivalent functional optimization problem of Q2: \n",
"\n",
"$$\n",
"\\min_{Y}\\ tr( Y^\\top A Y) \\ \\textrm{ s.t. } \\ Y^\\top Y = I_K \\textrm{ and } Y \\in Ind_S, \\quad\\quad\\quad(3)\n",
"$$\n",
"\n",
"where\n",
"\n",
"$$\n",
"Y \\in Ind_S \\ \\textrm{ reads as } \\ Y_{ik} = \n",
"\\left\\{\n",
"\\begin{array}{ll}\n",
"\\big(\\frac{D_{ii}}{Vol(S_k)}\\big)^{1/2} & \\textrm{if} \\ i \\in S_k\\\\\n",
"0 & \\textrm{otherwise}\n",
"\\end{array}\n",
"\\right..\n",
"$$\n",
"\n",
"and\n",
"\n",
"$$\n",
"A=???\n",
"$$\n",
"\n",
"It is not necessary to provide details.\n",
"\n",
"*Hint:* Let us introduce the indicator matrix $F$ of the clusters $S_k$ such that:\n",
"\n",
"$$\n",
"F_{ik} = \n",
"\\left\\{\n",
"\\begin{array}{ll}\n",
"1 & \\textrm{if} \\ i \\in S_k\\\\\n",
"0 & \\textrm{otherwise}\n",
"\\end{array}\n",
"\\right..\n",
"$$\n",
"\n",
"We may rewrite the Cut operator and the Volume operator with $F$ as:\n",
"\n",
"$$\n",
"Vol(S_k) = \\sum_{i\\in S_k, j\\in V} W_{ij} = F_{\\cdot,k}^\\top D F_{\\cdot,k}\\\\\n",
"Cut(S_k,S_k^c) = \\sum_{i\\in S_k, j\\in V} W_{ij} - \\sum_{i\\in S_k, j\\in S_k} W_{ij} = F_{\\cdot,k}^\\top D F_{\\cdot,k} - F_{\\cdot,k}^\\top W F_{\\cdot,k} = F_{\\cdot,k}^\\top (D - W) F_{\\cdot,k} \\quad\n",
"$$\n",
"\n",
"We thus have\n",
"\n",
"$$\n",
"\\frac{Cut(S_k,S_k^c)}{Vol(S_k)} = \\frac{ F_{\\cdot,k}^\\top (D - W) F_{\\cdot,k} }{ F_{\\cdot,k}^\\top D F_{\\cdot,k} } \n",
"$$\n",
"\n",
"\n",
"Set $\\hat{F}_{\\cdot,k}=D^{1/2}F_{\\cdot,k}$ and observe that\n",
"\n",
"$$\n",
"\\frac{ F_{\\cdot,k}^\\top (D - W) F_{\\cdot,k} }{ F_{\\cdot,k}^\\top D F_{\\cdot,k} } = \\frac{ \\hat{F}_{\\cdot,k}^\\top D^{-1/2}(D - W)D^{-1/2} \\hat{F}_{\\cdot,k} }{ \\hat{F}_{\\cdot,k}^\\top \\hat{F}_{\\cdot,k} } = \\frac{ \\hat{F}_{\\cdot,k}^\\top (I - D^{-1/2}WD^{-1/2}) \\hat{F}_{\\cdot,k} }{ \\hat{F}_{\\cdot,k}^\\top \\hat{F}_{\\cdot,k} } ,\n",
"$$\n",
"\n",
"with $L_N=I - D^{-1/2}WD^{-1/2}$ is the normalized graph Laplacian. Set $Y_{\\cdot,k}=\\frac{\\hat{F}_{\\cdot,k}}{\\|\\hat{F}_{\\cdot,k}\\|_2}$:\n",
"\n",
"$$\n",
"\\frac{ \\hat{F}_{\\cdot,k}^\\top L_N \\hat{F}_{\\cdot,k} }{ \\hat{F}_{\\cdot,k}^\\top \\hat{F}_{\\cdot,k} } = Y_{\\cdot,k}^\\top L_N Y_{\\cdot,k} \\quad\\quad\\quad(2)\n",
"$$\n",
"\n",
"\n",
"Using (2), we can rewrite (1) as a functional optimization problem:\n",
"\n",
"$$\n",
"\\min_{Y}\\ tr( Y^\\top A Y) \\ \\textrm{ s.t. } \\ Y^\\top Y = I_K \\textrm{ and } Y \\in Ind_S,\n",
"$$\n",
"\n",
"where\n",
"\n",
"\n",
"$$\n",
"Y \\in Ind_S \\ \\textrm{ reads as } \\ Y_{ik} = \n",
"\\left\\{\n",
"\\begin{array}{ll}\n",
"\\big(\\frac{D_{ii}}{Vol(S_k)}\\big)^{1/2} & \\textrm{if} \\ i \\in S_k\\\\\n",
"0 & \\textrm{otherwise}\n",
"\\end{array}\n",
"\\right..\n",
"$$\n",
"\n",
"and\n",
"\n",
"$$\n",
"A=???\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer to Q3:** Let us introduce the indicator matrix $F$ of the clusters $S_k$ such that:\n",
"\n",
"$$\n",
"F_{ik} = \n",
"\\left\\{\n",
"\\begin{array}{ll}\n",
"1 & \\textrm{if} \\ i \\in S_k\\\\\n",
"0 & \\textrm{otherwise}\n",
"\\end{array}\n",
"\\right..\n",
"$$\n",
"\n",
"We may rewrite the Cut operator and the Volume operator with $F$ as:\n",
"\n",
"$$\n",
"Vol(S_k) = \\sum_{i\\in S_k, j\\in V} W_{ij} = F_{\\cdot,k}^\\top D F_{\\cdot,k}\\\\\n",
"Cut(S_k,S_k^c) = \\sum_{i\\in S_k, j\\in V} W_{ij} - \\sum_{i\\in S_k, j\\in S_k} W_{ij} = F_{\\cdot,k}^\\top D F_{\\cdot,k} - F_{\\cdot,k}^\\top W F_{\\cdot,k} = F_{\\cdot,k}^\\top (D - W) F_{\\cdot,k} \\quad\n",
"$$\n",
"\n",
"We thus have\n",
"\n",
"$$\n",
"\\frac{Cut(S_k,S_k^c)}{Vol(S_k)} = \\frac{ F_{\\cdot,k}^\\top (D - W) F_{\\cdot,k} }{ F_{\\cdot,k}^\\top D F_{\\cdot,k} } \n",
"$$\n",
"\n",
"\n",
"Set $\\hat{F}_{\\cdot,k}=D^{1/2}F_{\\cdot,k}$ and observe that\n",
"\n",
"$$\n",
"\\frac{ F_{\\cdot,k}^\\top (D - W) F_{\\cdot,k} }{ F_{\\cdot,k}^\\top D F_{\\cdot,k} } = \\frac{ \\hat{F}_{\\cdot,k}^\\top D^{-1/2}(D - W)D^{-1/2} \\hat{F}_{\\cdot,k} }{ \\hat{F}_{\\cdot,k}^\\top \\hat{F}_{\\cdot,k} } = \\frac{ \\hat{F}_{\\cdot,k}^\\top (I - D^{-1/2}WD^{-1/2}) \\hat{F}_{\\cdot,k} }{ \\hat{F}_{\\cdot,k}^\\top \\hat{F}_{\\cdot,k} } ,\n",
"$$\n",
"\n",
"where $L_N=I - D^{-1/2}WD^{-1/2}$ is the normalized graph Laplacian. Set $Y_{\\cdot,k}=\\frac{\\hat{F}_{\\cdot,k}}{\\|\\hat{F}_{\\cdot,k}\\|_2}$, we have:\n",
"\n",
"$$\n",
"\\frac{ \\hat{F}_{\\cdot,k}^\\top L_N \\hat{F}_{\\cdot,k} }{ \\hat{F}_{\\cdot,k}^\\top \\hat{F}_{\\cdot,k} } = Y_{\\cdot,k}^\\top L_N Y_{\\cdot,k} \\quad\\quad\\quad(2)\n",
"$$\n",
"\n",
"\n",
"Using (2), we can rewrite (1) as a functional optimization problem:\n",
"\n",
"$$\n",
"\\min_{Y}\\ tr( Y^\\top A Y) \\ \\textrm{ s.t. } \\ Y^\\top Y = I_K \\textrm{ and } Y \\in Ind_S,\n",
"$$\n",
"\n",
"where\n",
"\n",
"$$\n",
"Y \\in Ind_S \\ \\textrm{ reads as } \\ Y_{ik} = \n",
"\\left\\{\n",
"\\begin{array}{ll}\n",
"\\big(\\frac{D_{ii}}{Vol(S_k)}\\big)^{1/2} & \\textrm{if} \\ i \\in S_k\\\\\n",
"0 & \\textrm{otherwise}\n",
"\\end{array}\n",
"\\right..\n",
"$$\n",
"\n",
"and\n",
"\n",
"$$\n",
"A=L_N=I-D^{-1/2}WD^{-1/2}.\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 4:** Drop the cluster indicator constraint $Y\\in Ind_S$ in Q3, how do you compute the solution $Y^\\star$ of (3)? Why the first column of $Y^\\star$ is not relevant for clustering?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer to Q4:** Your answer here.\n",
"\n",
"Dropping the constraint $Y\\in Ind_S$ in (3) leads to a standard spectral relaxation problem:\n",
"\n",
"$$\n",
"\\min_{Y}\\ tr( Y^\\top A Y) \\ \\textrm{ s.t. } \\ Y^\\top Y = I_K,\n",
"$$\n",
"\n",
"which solution $Y^\\star$ is given by the $K$ smallest eigenvectors/eigenvalues of $A=L_N=I-D^{-1/2}WD^{-1/2}$. Note that the first column of $Y^\\star$ is the constant signal $y_1=\\frac{1}{\\sqrt{n}}1_{n\\times 1}$ associated to the smallest eigenvalue of $L_N$, which has value $\\lambda_1=0$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 5:** Plot in 3D the 2nd, 3rd, 4th columns of $Y^\\star$.
\n",
"Hint: Compute the degree matrix $D$.
\n",
"Hint: You may use function *D_sqrt_inv = scipy.sparse.diags(d_sqrt_inv.A.squeeze(), 0)* for creating $D^{-1/2}$.
\n",
"Hint: You may use function *I = scipy.sparse.identity(d.size, dtype=W.dtype)* for creating a sparse identity matrix.
\n",
"Hint: You may use function *lamb, U = scipy.sparse.linalg.eigsh(A, k=4, which='SM')* to perform the eigenvalue decomposition of A.
\n",
"Hint: You may use function *ax.scatter(Xdisp, Ydisp, Zdisp, c=Cgt)* for 3D visualization."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Load dataset: W is the Adjacency Matrix and Cgt is the ground truth clusters\n",
"mat = scipy.io.loadmat('datasets/mnist_2000_graph.mat')\n",
"W = mat['W']\n",
"n = W.shape[0]\n",
"Cgt = mat['Cgt'] - 1; Cgt = Cgt.squeeze()\n",
"nc = len(np.unique(Cgt))\n",
"print('Number of nodes =',n)\n",
"print('Number of classes =',nc);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Your code here\n",
"\n",
"# Construct Spectal Matrix A\n",
"d = W.sum(axis=0) + 1e-6 # degree vector\n",
"d = 1.0 / np.sqrt(d)\n",
"Dinv = scipy.sparse.diags(d.A.squeeze(), 0)\n",
"I = scipy.sparse.identity(d.size, dtype=W.dtype)\n",
"A = I - Dinv* (W* Dinv)\n",
"\n",
"# Compute K smallest eigenvectors/eigenvalues of A\n",
"lamb, U = scipy.sparse.linalg.eigsh(A, k=4, which='SM')\n",
"\n",
"# Sort eigenvalue from smallest to largest values\n",
"idx = lamb.argsort() # increasing order\n",
"lamb, U = lamb[idx], U[:,idx]\n",
"print(lamb)\n",
"\n",
"# Y*\n",
"Y = U\n",
"\n",
"# Plot in 3D the 2nd, 3rd, 4th columns of Y*\n",
"Xdisp = Y[:,1]\n",
"Ydisp = Y[:,2]\n",
"Zdisp = Y[:,3]\n",
"\n",
"# 2D Visualization\n",
"plt.figure(14)\n",
"size_vertex_plot = 10\n",
"plt.scatter(Xdisp, Ydisp, s=size_vertex_plot*np.ones(n), c=Cgt)\n",
"plt.title('2D Visualization')\n",
"plt.show()\n",
"\n",
"# 3D Visualization\n",
"fig = pylab.figure(15)\n",
"ax = Axes3D(fig)\n",
"ax.scatter(Xdisp, Ydisp, Zdisp, c=Cgt)\n",
"pylab.title('3D Visualization')\n",
"pyplot.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 6:** Solve the unsupervised clustering problem for MNIST following the popular technique of [Ng, Jordan, Weiss, “On Spectral Clustering: Analysis and an algorithm”, 2002], i.e.
\n",
"(1) Compute $Y^\\star$? solution of Q4.
\n",
"(2) Normalize the rows of $Y^\\star$? with the L2-norm.
\n",
"Hint: You may use function X = ( X.T / np.sqrt(np.sum(X**2,axis=1)+1e-10) ).T for the L2-normalization of the rows of X.
\n",
"(3) Run standard K-Means on normalized $Y^\\star$? to get the clusters, and compute the clustering accuracy. You should get more than 50% accuracy. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Your code here\n",
"# Normalize the rows of Y* with the L2 norm, i.e. ||y_i||_2 = 1\n",
"Y = ( Y.T / np.sqrt(np.sum(Y**2,axis=1)+1e-10) ).T "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Your code here\n",
"# Run standard K-Means\n",
"Ker = construct_kernel(Y,'linear') # Compute linear Kernel for standard K-Means\n",
"Theta = np.ones(n) # Equal weight for each data\n",
"[C_kmeans,En_kmeans] = compute_kernel_kmeans_EM(nc,Ker,Theta,10)\n",
"acc= compute_purity(C_kmeans,Cgt,nc)\n",
"print('accuracy standard kmeans=',acc)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}