{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# A Network Tour of Data Science\n",
"### Xavier Bresson, Winter 2016/17\n",
"## Assignment 1 : Unsupervised Clustering with the Normalized Association"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Load libraries\n",
"\n",
"# Math\n",
"import numpy as np\n",
"\n",
"# Visualization \n",
"%matplotlib notebook \n",
"import matplotlib.pyplot as plt\n",
"plt.rcParams.update({'figure.max_open_warning': 0})\n",
"from mpl_toolkits.axes_grid1 import make_axes_locatable\n",
"from scipy import ndimage\n",
"\n",
"# Print output of LFR code\n",
"import subprocess\n",
"\n",
"# Sparse matrix\n",
"import scipy.sparse\n",
"import scipy.sparse.linalg\n",
"\n",
"# 3D visualization\n",
"import pylab\n",
"from mpl_toolkits.mplot3d import Axes3D\n",
"from matplotlib import pyplot\n",
"\n",
"# Import data\n",
"import scipy.io\n",
"\n",
"# Import functions in lib folder\n",
"import sys\n",
"sys.path.insert(1, 'lib')\n",
"\n",
"# Import helper functions\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"from lib.utils import construct_kernel\n",
"from lib.utils import compute_kernel_kmeans_EM\n",
"from lib.utils import compute_purity\n",
"\n",
"# Import distance function\n",
"import sklearn.metrics.pairwise\n",
"\n",
"# Remove warnings\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 1:** Write down the mathematical relationship between Normalized Cut (NCut) and Normalized Association (NAssoc) for K clusters. It is not necessary to provide details.\n",
"\n",
"The Normalized Cut problem is defined as:
\n",
"$$\n",
"\\min_{\\{S_k\\}}\\ NCut(\\{S_k\\}) := \\sum_{k=1}^K \\frac{Cut(S_k,S_k^c)}{Vol(S_k)} \\ \\textrm{ s.t. } \\ \\cup_{k=1}^{K} S_k = V, \\ S_k \\cap S_{k'}=\\emptyset, \\ \\forall k \\not= k' \\quad\\quad\\quad(1)\n",
"$$\n",
"\n",
"and the Normalized Association problem is defined as:
\n",
"$$\n",
"\\max_{\\{S_k\\}}\\ NAssoc(\\{S_k\\}):= \\sum_{k=1}^K \\frac{Assoc(S_k,S_k)}{Vol(S_k)} \\ \\textrm{ s.t. } \\ \\cup_{k=1}^{K} S_k = V, \\ S_k \\cap S_{k'}=\\emptyset, \\ \\forall k \\not= k' .\n",
"$$\n",
"\n",
"\n",
"We may rewrite the Cut operator and the Volume operator with the Assoc operator as:
\n",
"$$\n",
"Vol(S_k) = \\sum_{i\\in S_k, j\\in V} W_{ij} \\\\\n",
"Assoc(S_k,S_k) = \\sum_{i\\in S_k, j\\in S_k} W_{ij} \\\\\n",
"Cut(S_k,S_k^c) = \\sum_{i\\in S_k, j\\in S_k^c=V\\setminus S_k} W_{ij} = \\sum_{i\\in S_k, j\\in V} W_{ij} - \\sum_{i\\in S_k, j\\in S_k} W_{ij} = Vol(S_k) - Assoc(S_k,S_k) \n",
"$$\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer to Q1:** Your answer here.\n",
"\n",
"We have
\n",
"$$\n",
"\\frac{Cut(S_k,S_k^c)}{Vol(S_k)} = \\frac{Vol(S_k) - Assoc(S_k,S_k)}{Vol(S_k)} = 1- \\frac{Assoc(S_k,S_k)}{Vol(S_k)}\n",
"$$\n",
"\n",
"and
\n",
"\n",
"$$\n",
"\\sum_{k=1}^K \\frac{Cut(S_k,S_k^c)}{Vol(S_k)} = K - \\sum_{k=1}^K \\frac{Assoc(S_k,S_k)}{Vol(S_k)}\n",
"$$\n",
"\n",
"The relationship between Normalized Cut (NCut) and Normalized Association (NAssoc) for K clusters is thus
\n",
"\n",
"$$\n",
"NCut(\\{S_k\\}) = K - NAssoc(\\{S_k\\}).\n",
"$$\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 2:** Using the relationship between NCut and NAssoc from Q1, it is therefore equivalent to maximize NAssoc by minimizing or maximizing NCut? That is \n",
"\n",
"$$\n",
"\\max_{\\{S_k\\}}\\ NAssoc(\\{S_k\\}) \\ \\textrm{ s.t. } \\cup_{k=1}^{K} S_k = V, \\quad S_k \\cap S_{k'}=\\emptyset, \\ \\forall k \\not= k'\n",
"$$\n",
"\n",
"$$\n",
"\\Updownarrow\n",
"$$\n",
"\n",
"$$\n",
"\\min_{\\{S_k\\}}\\ NCut(\\{S_k\\}) \\ \\textrm{ s.t. } \\cup_{k=1}^{K} S_k = V, \\quad S_k \\cap S_{k'}=\\emptyset, \\ \\forall k \\not= k'\n",
"$$\n",
"\n",
"or\n",
"\n",
"$$\n",
"\\max_{\\{S_k\\}}\\ NCut(\\{S_k\\}) \\ \\textrm{ s.t. } \\cup_{k=1}^{K} S_k = V, \\quad S_k \\cap S_{k'}=\\emptyset, \\ \\forall k \\not= k'\n",
"$$\n",
"\n",
"It is not necessary to provide details."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer to Q2:** Your answer here.\n",
"\n",
"As $\\min F \\Leftrightarrow \\max -F$, we have equivalence between the max NAssoc problem:\n",
"\n",
"$$\n",
"\\max_{\\{S_k\\}}\\ NAssoc(\\{S_k\\}) \\ \\textrm{ s.t. } \\cup_{k=1}^{K} S_k = V, \\quad S_k \\cap S_{k'}=\\emptyset, \\ \\forall k \\not= k'\n",
"$$\n",
"\n",
"and the min NCut problem:\n",
"\n",
"$$\n",
"\\min_{\\{S_k\\}}\\ NCut(\\{S_k\\}) \\ \\textrm{ s.t. } \\cup_{k=1}^{K} S_k = V, \\quad S_k \\cap S_{k'}=\\emptyset, \\ \\forall k \\not= k'\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 3:** Solving the NCut problem in Q2 is NP-hard => let us consider a spectral relaxation of NCut. Write down the Spectral Matrix A of NCut that satisfies the equivalent functional optimization problem of Q2: \n",
"\n",
"$$\n",
"\\min_{Y}\\ tr( Y^\\top A Y) \\ \\textrm{ s.t. } \\ Y^\\top Y = I_K \\textrm{ and } Y \\in Ind_S, \\quad\\quad\\quad(3)\n",
"$$\n",
"\n",
"where\n",
"\n",
"$$\n",
"Y \\in Ind_S \\ \\textrm{ reads as } \\ Y_{ik} = \n",
"\\left\\{\n",
"\\begin{array}{ll}\n",
"\\big(\\frac{D_{ii}}{Vol(S_k)}\\big)^{1/2} & \\textrm{if} \\ i \\in S_k\\\\\n",
"0 & \\textrm{otherwise}\n",
"\\end{array}\n",
"\\right..\n",
"$$\n",
"\n",
"and\n",
"\n",
"$$\n",
"A=???\n",
"$$\n",
"\n",
"It is not necessary to provide details.\n",
"\n",
"*Hint:* Let us introduce the indicator matrix $F$ of the clusters $S_k$ such that:\n",
"\n",
"$$\n",
"F_{ik} = \n",
"\\left\\{\n",
"\\begin{array}{ll}\n",
"1 & \\textrm{if} \\ i \\in S_k\\\\\n",
"0 & \\textrm{otherwise}\n",
"\\end{array}\n",
"\\right..\n",
"$$\n",
"\n",
"We may rewrite the Cut operator and the Volume operator with $F$ as:\n",
"\n",
"$$\n",
"Vol(S_k) = \\sum_{i\\in S_k, j\\in V} W_{ij} = F_{\\cdot,k}^\\top D F_{\\cdot,k}\\\\\n",
"Cut(S_k,S_k^c) = \\sum_{i\\in S_k, j\\in V} W_{ij} - \\sum_{i\\in S_k, j\\in S_k} W_{ij} = F_{\\cdot,k}^\\top D F_{\\cdot,k} - F_{\\cdot,k}^\\top W F_{\\cdot,k} = F_{\\cdot,k}^\\top (D - W) F_{\\cdot,k} \\quad\n",
"$$\n",
"\n",
"We thus have\n",
"\n",
"$$\n",
"\\frac{Cut(S_k,S_k^c)}{Vol(S_k)} = \\frac{ F_{\\cdot,k}^\\top (D - W) F_{\\cdot,k} }{ F_{\\cdot,k}^\\top D F_{\\cdot,k} } \n",
"$$\n",
"\n",
"\n",
"Set $\\hat{F}_{\\cdot,k}=D^{1/2}F_{\\cdot,k}$ and observe that\n",
"\n",
"$$\n",
"\\frac{ F_{\\cdot,k}^\\top (D - W) F_{\\cdot,k} }{ F_{\\cdot,k}^\\top D F_{\\cdot,k} } = \\frac{ \\hat{F}_{\\cdot,k}^\\top D^{-1/2}(D - W)D^{-1/2} \\hat{F}_{\\cdot,k} }{ \\hat{F}_{\\cdot,k}^\\top \\hat{F}_{\\cdot,k} } = \\frac{ \\hat{F}_{\\cdot,k}^\\top (I - D^{-1/2}WD^{-1/2}) \\hat{F}_{\\cdot,k} }{ \\hat{F}_{\\cdot,k}^\\top \\hat{F}_{\\cdot,k} } ,\n",
"$$\n",
"\n",
"with $L_N=I - D^{-1/2}WD^{-1/2}$ is the normalized graph Laplacian. Set $Y_{\\cdot,k}=\\frac{\\hat{F}_{\\cdot,k}}{\\|\\hat{F}_{\\cdot,k}\\|_2}$:\n",
"\n",
"$$\n",
"\\frac{ \\hat{F}_{\\cdot,k}^\\top L_N \\hat{F}_{\\cdot,k} }{ \\hat{F}_{\\cdot,k}^\\top \\hat{F}_{\\cdot,k} } = Y_{\\cdot,k}^\\top L_N Y_{\\cdot,k} \\quad\\quad\\quad(2)\n",
"$$\n",
"\n",
"\n",
"Using (2), we can rewrite (1) as a functional optimization problem:\n",
"\n",
"$$\n",
"\\min_{Y}\\ tr( Y^\\top A Y) \\ \\textrm{ s.t. } \\ Y^\\top Y = I_K \\textrm{ and } Y \\in Ind_S,\n",
"$$\n",
"\n",
"where\n",
"\n",
"\n",
"$$\n",
"Y \\in Ind_S \\ \\textrm{ reads as } \\ Y_{ik} = \n",
"\\left\\{\n",
"\\begin{array}{ll}\n",
"\\big(\\frac{D_{ii}}{Vol(S_k)}\\big)^{1/2} & \\textrm{if} \\ i \\in S_k\\\\\n",
"0 & \\textrm{otherwise}\n",
"\\end{array}\n",
"\\right..\n",
"$$\n",
"\n",
"and\n",
"\n",
"$$\n",
"A=???\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer to Q3:** Let us introduce the indicator matrix $F$ of the clusters $S_k$ such that:\n",
"\n",
"$$\n",
"F_{ik} = \n",
"\\left\\{\n",
"\\begin{array}{ll}\n",
"1 & \\textrm{if} \\ i \\in S_k\\\\\n",
"0 & \\textrm{otherwise}\n",
"\\end{array}\n",
"\\right..\n",
"$$\n",
"\n",
"We may rewrite the Cut operator and the Volume operator with $F$ as:\n",
"\n",
"$$\n",
"Vol(S_k) = \\sum_{i\\in S_k, j\\in V} W_{ij} = F_{\\cdot,k}^\\top D F_{\\cdot,k}\\\\\n",
"Cut(S_k,S_k^c) = \\sum_{i\\in S_k, j\\in V} W_{ij} - \\sum_{i\\in S_k, j\\in S_k} W_{ij} = F_{\\cdot,k}^\\top D F_{\\cdot,k} - F_{\\cdot,k}^\\top W F_{\\cdot,k} = F_{\\cdot,k}^\\top (D - W) F_{\\cdot,k} \\quad\n",
"$$\n",
"\n",
"We thus have\n",
"\n",
"$$\n",
"\\frac{Cut(S_k,S_k^c)}{Vol(S_k)} = \\frac{ F_{\\cdot,k}^\\top (D - W) F_{\\cdot,k} }{ F_{\\cdot,k}^\\top D F_{\\cdot,k} } \n",
"$$\n",
"\n",
"\n",
"Set $\\hat{F}_{\\cdot,k}=D^{1/2}F_{\\cdot,k}$ and observe that\n",
"\n",
"$$\n",
"\\frac{ F_{\\cdot,k}^\\top (D - W) F_{\\cdot,k} }{ F_{\\cdot,k}^\\top D F_{\\cdot,k} } = \\frac{ \\hat{F}_{\\cdot,k}^\\top D^{-1/2}(D - W)D^{-1/2} \\hat{F}_{\\cdot,k} }{ \\hat{F}_{\\cdot,k}^\\top \\hat{F}_{\\cdot,k} } = \\frac{ \\hat{F}_{\\cdot,k}^\\top (I - D^{-1/2}WD^{-1/2}) \\hat{F}_{\\cdot,k} }{ \\hat{F}_{\\cdot,k}^\\top \\hat{F}_{\\cdot,k} } ,\n",
"$$\n",
"\n",
"where $L_N=I - D^{-1/2}WD^{-1/2}$ is the normalized graph Laplacian. Set $Y_{\\cdot,k}=\\frac{\\hat{F}_{\\cdot,k}}{\\|\\hat{F}_{\\cdot,k}\\|_2}$, we have:\n",
"\n",
"$$\n",
"\\frac{ \\hat{F}_{\\cdot,k}^\\top L_N \\hat{F}_{\\cdot,k} }{ \\hat{F}_{\\cdot,k}^\\top \\hat{F}_{\\cdot,k} } = Y_{\\cdot,k}^\\top L_N Y_{\\cdot,k} \\quad\\quad\\quad(2)\n",
"$$\n",
"\n",
"\n",
"Using (2), we can rewrite (1) as a functional optimization problem:\n",
"\n",
"$$\n",
"\\min_{Y}\\ tr( Y^\\top A Y) \\ \\textrm{ s.t. } \\ Y^\\top Y = I_K \\textrm{ and } Y \\in Ind_S,\n",
"$$\n",
"\n",
"where\n",
"\n",
"$$\n",
"Y \\in Ind_S \\ \\textrm{ reads as } \\ Y_{ik} = \n",
"\\left\\{\n",
"\\begin{array}{ll}\n",
"\\big(\\frac{D_{ii}}{Vol(S_k)}\\big)^{1/2} & \\textrm{if} \\ i \\in S_k\\\\\n",
"0 & \\textrm{otherwise}\n",
"\\end{array}\n",
"\\right..\n",
"$$\n",
"\n",
"and\n",
"\n",
"$$\n",
"A=L_N=I-D^{-1/2}WD^{-1/2}.\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 4:** Drop the cluster indicator constraint $Y\\in Ind_S$ in Q3, how do you compute the solution $Y^\\star$ of (3)? Why the first column of $Y^\\star$ is not relevant for clustering?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer to Q4:** Your answer here.\n",
"\n",
"Dropping the constraint $Y\\in Ind_S$ in (3) leads to a standard spectral relaxation problem:\n",
"\n",
"$$\n",
"\\min_{Y}\\ tr( Y^\\top A Y) \\ \\textrm{ s.t. } \\ Y^\\top Y = I_K,\n",
"$$\n",
"\n",
"which solution $Y^\\star$ is given by the $K$ smallest eigenvectors/eigenvalues of $A=L_N=I-D^{-1/2}WD^{-1/2}$. Note that the first column of $Y^\\star$ is the constant signal $y_1=\\frac{1}{\\sqrt{n}}1_{n\\times 1}$ associated to the smallest eigenvalue of $L_N$, which has value $\\lambda_1=0$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 5:** Plot in 3D the 2nd, 3rd, 4th columns of $Y^\\star$.
\n",
"Hint: Compute the degree matrix $D$.
\n",
"Hint: You may use function *D_sqrt_inv = scipy.sparse.diags(d_sqrt_inv.A.squeeze(), 0)* for creating $D^{-1/2}$.
\n",
"Hint: You may use function *I = scipy.sparse.identity(d.size, dtype=W.dtype)* for creating a sparse identity matrix.
\n",
"Hint: You may use function *lamb, U = scipy.sparse.linalg.eigsh(A, k=4, which='SM')* to perform the eigenvalue decomposition of A.
\n",
"Hint: You may use function *ax.scatter(Xdisp, Ydisp, Zdisp, c=Cgt)* for 3D visualization."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of nodes = 2000\n",
"Number of classes = 10\n"
]
}
],
"source": [
"# Load dataset: W is the Adjacency Matrix and Cgt is the ground truth clusters\n",
"mat = scipy.io.loadmat('datasets/mnist_2000_graph.mat')\n",
"W = mat['W']\n",
"n = W.shape[0]\n",
"Cgt = mat['Cgt'] - 1; Cgt = Cgt.squeeze()\n",
"nc = len(np.unique(Cgt))\n",
"print('Number of nodes =',n)\n",
"print('Number of classes =',nc);"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 1.34828146e-05 2.81922200e-02 3.81623410e-02 5.08202654e-02]\n"
]
},
{
"data": {
"application/javascript": [
"/* Put everything inside the global mpl namespace */\n",
"window.mpl = {};\n",
"\n",
"mpl.get_websocket_type = function() {\n",
" if (typeof(WebSocket) !== 'undefined') {\n",
" return WebSocket;\n",
" } else if (typeof(MozWebSocket) !== 'undefined') {\n",
" return MozWebSocket;\n",
" } else {\n",
" alert('Your browser does not have WebSocket support.' +\n",
" 'Please try Chrome, Safari or Firefox ≥ 6. ' +\n",
" 'Firefox 4 and 5 are also supported but you ' +\n",
" 'have to enable WebSockets in about:config.');\n",
" };\n",
"}\n",
"\n",
"mpl.figure = function(figure_id, websocket, ondownload, parent_element) {\n",
" this.id = figure_id;\n",
"\n",
" this.ws = websocket;\n",
"\n",
" this.supports_binary = (this.ws.binaryType != undefined);\n",
"\n",
" if (!this.supports_binary) {\n",
" var warnings = document.getElementById(\"mpl-warnings\");\n",
" if (warnings) {\n",
" warnings.style.display = 'block';\n",
" warnings.textContent = (\n",
" \"This browser does not support binary websocket messages. \" +\n",
" \"Performance may be slow.\");\n",
" }\n",
" }\n",
"\n",
" this.imageObj = new Image();\n",
"\n",
" this.context = undefined;\n",
" this.message = undefined;\n",
" this.canvas = undefined;\n",
" this.rubberband_canvas = undefined;\n",
" this.rubberband_context = undefined;\n",
" this.format_dropdown = undefined;\n",
"\n",
" this.image_mode = 'full';\n",
"\n",
" this.root = $('