{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# [NTDS'18] tutorial 2: build a graph from features\n", "[ntds'18]: https://github.com/mdeff/ntds_2018\n", "\n", "[Benjamin Ricaud](https://people.epfl.ch/benjamin.ricaud), [EPFL LTS2](https://lts2.epfl.ch), with contributions from [Michaël Defferrard](http://deff.ch) and [Effrosyni Simou](https://lts4.epfl.ch/simou).\n", "\n", "* Dataset: [Iris](https://archive.ics.uci.edu/ml/datasets/Iris)\n", "* Tools: [pandas](https://pandas.pydata.org), [numpy](http://www.numpy.org), [scipy](https://www.scipy.org), [matplotlib](https://matplotlib.org), [networkx](https://networkx.github.io), [gephi](https://gephi.org/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tools" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The below line is a [magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html) that allows plots to appear in the notebook." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first thing is always to import the packages we'll use." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from scipy.spatial.distance import pdist, squareform\n", "from matplotlib import pyplot as plt\n", "import networkx as nx" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tutorials on pandas can be found at:\n", "* \n", "* \n", "\n", "Tutorials on numpy can be found at:\n", "* \n", "* \n", "* \n", "\n", "A tutorial on networkx can be found at:\n", "* " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import and explore the data\n", "\n", "We will play with the famous Iris dataset. This dataset can be found in many places on the net and was first released at . For example it is stored on [Kaggle](https://www.kaggle.com/uciml/iris/), with many demos and Jupyter notebooks you can test (have a look at the \"kernels\" tab).\n", "\n", "![Iris Par Za — Travail personnel, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=144395](figures/iris_germanica.jpg)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdSepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCmSpecies
015.13.51.40.2Iris-setosa
124.93.01.40.2Iris-setosa
234.73.21.30.2Iris-setosa
344.63.11.50.2Iris-setosa
455.03.61.40.2Iris-setosa
\n", "
" ], "text/plain": [ " Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species\n", "0 1 5.1 3.5 1.4 0.2 Iris-setosa\n", "1 2 4.9 3.0 1.4 0.2 Iris-setosa\n", "2 3 4.7 3.2 1.3 0.2 Iris-setosa\n", "3 4 4.6 3.1 1.5 0.2 Iris-setosa\n", "4 5 5.0 3.6 1.4 0.2 Iris-setosa" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris = pd.read_csv('data/iris.csv')\n", "iris.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The description of the entries is given here:\n", "https://www.kaggle.com/uciml/iris/home" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris['Species'].unique()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdSepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCm
count150.000000150.000000150.000000150.000000150.000000
mean75.5000005.8433333.0540003.7586671.198667
std43.4453680.8280660.4335941.7644200.763161
min1.0000004.3000002.0000001.0000000.100000
25%38.2500005.1000002.8000001.6000000.300000
50%75.5000005.8000003.0000004.3500001.300000
75%112.7500006.4000003.3000005.1000001.800000
max150.0000007.9000004.4000006.9000002.500000
\n", "
" ], "text/plain": [ " Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm\n", "count 150.000000 150.000000 150.000000 150.000000 150.000000\n", "mean 75.500000 5.843333 3.054000 3.758667 1.198667\n", "std 43.445368 0.828066 0.433594 1.764420 0.763161\n", "min 1.000000 4.300000 2.000000 1.000000 0.100000\n", "25% 38.250000 5.100000 2.800000 1.600000 0.300000\n", "50% 75.500000 5.800000 3.000000 4.350000 1.300000\n", "75% 112.750000 6.400000 3.300000 5.100000 1.800000\n", "max 150.000000 7.900000 4.400000 6.900000 2.500000" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build a graph from the features\n", "\n", "We are going to build a graph from this data. The idea is to represent iris samples (rows of the table) as nodes, with connections depending on their physical similarity.\n", "\n", "The main question is how to define the notion of similarity between the flowers. For that, we need to introduce a measure of similarity. It should use the properties of the flowers and provide a positive real value for each pair of samples. The value should be larger for more similar samples.\n", "\n", "Let us separate the data into two parts: physical properties and labels." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "features = iris.loc[:, ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]\n", "species = iris.loc[:, 'Species']" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCm
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
" ], "text/plain": [ " SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "features.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Iris-setosa\n", "1 Iris-setosa\n", "2 Iris-setosa\n", "3 Iris-setosa\n", "4 Iris-setosa\n", "Name: Species, dtype: object" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "species.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Similarity, distance and edge weight" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can define many similarity measures. One of the most intuitive and perhaps the easiest to program relies on the notion of distance. If a distance between samples is defined, we can compute the weight accordingly: if the distance is short, which means the nodes are similar, we want a strong edge (large weight)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Different distances\n", "The cosine distance is a good candidate for high-dimensional data. It is defined as follows:\n", "$$d(u,v) = 1 - \\frac{u \\cdot v} {\\|u\\|_2 \\|v\\|_2},$$\n", "where $u$ and $v$ are two feature vectors.\n", " \n", "The distance is proportional to the angle formed by the two vectors (0 if colinear, 1 if orthogonal, 2 if opposed direction).\n", "\n", "Alternatives are the [$p$-norms](https://en.wikipedia.org/wiki/Norm_%28mathematics%29#p-norm) (or $\\ell_p$-norms), defined as\n", "$$d(u,v) = \\|u - v\\|_p,$$\n", "of which the Euclidean distance is a special case with $p=2$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pdist` function from `scipy` computes the pairwise distance. By default it is the Euclidian distance. `features.values` is a numpy array extracted from the Pandas dataframe. Very handy." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "#from scipy.spatial.distance import pdist, squareform\n", "pdist?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "distances = pdist(features.values, metric='euclidean')\n", "# other metrics: 'cosine', 'cityblock', 'minkowski'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a distance, we can compute the weights." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Distance to weights\n", "A common function used to turn distances into edge weights is the Gaussian function:\n", "$$\\mathbf{W}(u,v) = \\exp \\left( \\frac{-d^2(u, v)}{\\sigma^2} \\right),$$\n", "where $\\sigma$ is the parameter which controls the width of the Gaussian.\n", " \n", "The function giving the weights should be positive and monotonically decreasing with respect to the distance. It should take its maximum value when the distance is zero, and tend to zero when the distance increases. Note that distances are non-negative by definition. So any funtion $f : \\mathbb{R}^+ \\rightarrow [0,C]$ that verifies $f(0)=C$ and $\\lim_{x \\rightarrow +\\infty}f(x)=0$ and is *strictly* decreasing should be adapted. The choice of the function depends on the data.\n", "\n", "Some examples:\n", "* A simple linear function $\\mathbf{W}(u,v) = \\frac{d_{max} - d(u, v)}{d_{max} - d_{min}}$. As the cosine distance is bounded by $[0,2]$, a suitable linear function for it would be $\\mathbf{W}(u,v) = 1 - d(u,v)/2$.\n", "* A triangular kernel: a straight line between the points $(0,1)$ and $(t_0,0)$, and equal to 0 after this point.\n", "* The logistic kernel $\\left(e^{d(u,v)} + 2 + e^{-d(u,v)} \\right)^{-1}$.\n", "* An inverse function $(\\epsilon+d(u,v))^{-n}$, with $n \\in \\mathbb{N}^{+*}$ and $\\epsilon \\in \\mathbb{R}^+$.\n", "* You can find some more [here](https://en.wikipedia.org/wiki/Kernel_%28statistics%29).\n", " " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Let us use the Gaussian function\n", "kernel_width = distances.mean()\n", "weights = np.exp(-distances**2 / kernel_width**2)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Turn the list of weights into a matrix.\n", "adjacency = squareform(weights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, you may need to compute additional features before processing them with some machine learning or some other data processing step. With Pandas, it is as simple as that:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCmSepalLengthSquared
05.13.51.40.226.01
14.93.01.40.224.01
24.73.21.30.222.09
34.63.11.50.221.16
45.03.61.40.225.00
\n", "
" ], "text/plain": [ " SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 4.9 3.0 1.4 0.2 \n", "2 4.7 3.2 1.3 0.2 \n", "3 4.6 3.1 1.5 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", "\n", " SepalLengthSquared \n", "0 26.01 \n", "1 24.01 \n", "2 22.09 \n", "3 21.16 \n", "4 25.00 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Compute a new column using the existing ones.\n", "features['SepalLengthSquared'] = features['SepalLengthCm']**2\n", "features.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Coming back to the weight matrix, we have obtained a full matrix but we may not need all the connections (reducing the number of connections saves some space and computations!). We can sparsify the graph by removing the values (edges) below some fixed threshold. Let us see what kind of threshold we could use:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEICAYAAABWJCMKAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAFcpJREFUeJzt3X+0nVV95/H3RxAdBxRsAmIIBG2cGl0julLUZafShcOPaMWOVWFVDZSZaAuOrcwP/NGB0VqxHXXVGaqDQxagFaStlqi0FFFBR1GCVeTHOKYYIQZJEEQUaw3znT/Ojh7Cvbnn3nvuvbns92uts87z7Gc/z7P3uTfnc/d+nnOSqkKS1J9HLHQDJEkLwwCQpE4ZAJLUKQNAkjplAEhSpwwASeqUAaCxS/L+JH8wpmMdmuSHSfZq659N8m/Hcex2vL9JsnZcx5vGef8wyV1JvjuH5/hhkieNWLeS/OJctUV7JgNA05Jkc5IfJ7kvyfeTfCHJa5P87Hepql5bVW8b8Vgv2F2dqrqtqvatqgfG0Pazk3xol+MfX1UXzvbY02zHcuAMYFVVPWGuztNet1tne5wkJyf5/DjapD2LAaCZ+PWq2g84DDgH+M/A+eM+SZK9x33MPcRhwPeqattCN0R9MwA0Y1V1b1VtAF4BrE3ydIAkFyT5w7a8JMkn2mjh7iSfS/KIJB8EDgU+3qYq/lOSFW0q4tQktwGfHiobDoMnJ/lyknuTXJbk8e1cRyXZMtzGnaOMJMcBbwJe0c73tbb9Z1NKrV1vSfLtJNuSXJTkcW3bznasTXJbm75582SvTZLHtf23t+O9pR3/BcCVwBNbOy6YYN+rk7y0Lf9KO++atv6CJF8dqvvbSW5Jck+SK5IcNrTtZ9M6SX4hyceT/CDJdW0Kate/6l+Q5JvtWOdm4KnA+4HntvZ+vx1vTZKb20jwO0n+w2SvhfZcBoBmraq+DGwB/tUEm89o25YCBzF4E66qehVwG4PRxL5V9cdD+zwfeCpw7CSnfDXw28ATgR3Ae0do498CfwR8pJ3vGRNUO7k9fg14ErAv8D92qfMrwL8Ajgb+S3uDnMh/Bx7XjvP81uZTqupTwPHA1taOkyfY92rgqLb8q8Ct7Rg7168GSPISBq/nv2Hw+n4OuHiS9pwL/Ah4ArC2PXb1IuCXgWcALweOrapbgNcCX2zt3b/VPR94TRsJPh349CTn1R7MANC4bAUeP0H5T4GDgcOq6qdV9bma+guozq6qH1XVjyfZ/sGqurGqfgT8AfDynReJZ+m3gHdX1a1V9UPgjcCJu4w+/mtV/biqvgZ8jcGb5YO0trwCeGNV3VdVm4F3Aa8asR1X8+A3/HcMrT+/bQd4DfCOqrqlqnYwCLgjhkcBQ+15KXBWVd1fVTcDE133OKeqvl9VtwGfAY7YTRt/CqxK8tiquqeqvjJi37QHMQA0LsuAuyco/xNgE/B3SW5NcuYIx7p9Gtu/DTwSWDJSK3fvie14w8fem8HIZafhu3buZzBK2NUSYJ8JjrVsxHZ8EXhKkoMYvAlfBCxPsgQ4Erim1TsM+NM2vfZ9Bq9/JjjP0taP4ddtotd4lL7t9FJgDfDtNmX13JF6pj2KAaBZS/LLDN50HnKnSPsL+IyqehLw68Abkhy9c/Mkh5xqhLB8aPlQBn+N3sVgiuMxQ+3ai8Gb36jH3crgTXX42DuAO6fYb1d3tTbteqzvjLJzVd0PXA+8Hrixqv4J+ALwBuAfququVvV2BtMw+w89/llVfWGXQ25v/ThkqGw5o3vI61ZV11XVCcCBwF8Dl07jeNpDGACasSSPTfIi4BLgQ1X19QnqvCjJLyYJ8APggfaAwRvrSPep7+KVSVYleQzwVuAv222i/xd4dJIXJnkk8BbgUUP73QmsyNAtq7u4GPj9JIcn2ZefXzPYMZ3GtbZcCrw9yX5tSuYNwId2v+eDXA2czs+nez67yzoMLs6+McnT4GcXnl82SXs+Cpyd5DFJfonBNYlR3QkckmSfdp59kvxWksdV1U/5+c9Vi4wBoJn4eJL7GPwF+mbg3cApk9RdCXwK+CGDqY0/q6rPtm3vAN7SpjCmcxfJB4ELGExZPBr49zC4Kwn4XeB/Mfhr+0cMLkDv9Bft+XtJJpqzXt+OfQ3wLeAfgddNo13DXtfOfyuDkdGH2/FHdTWwHz+f7tl1nar6GPBO4JIkPwBuZHCBeSKnM7go/V0GfbwY+MmIbfk0cBPw3SQ7Rx+vAja3874WeOWIx9IeJP6HMFJ/krwTeEJVzfunoLXncAQgdSDJLyX5l+3e/iOBU4GPLXS7tLAerp+0lPRg+zGY9nkisI3BbamXLWiLtOCcApKkTjkFJEmd2qOngJYsWVIrVqxY6GZI0qJy/fXX31VVS6eqt0cHwIoVK9i4ceNCN0OSFpUk3566llNAktQtA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUqT36k8CzteLMTy7IeTef88IFOa8kTYcjAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdWrKAEiyPMlnktyS5KYkr2/lZyf5TpKvtseaoX3emGRTkm8kOXao/LhWtinJmXPTJUnSKPYeoc4O4Iyq+kqS/YDrk1zZtr2nqv7bcOUkq4ATgacBTwQ+leQpbfO5wL8GtgDXJdlQVTePoyOSpOmZMgCq6g7gjrZ8X5JbgGW72eUE4JKq+gnwrSSbgCPbtk1VdStAkktaXQNAkhbAtK4BJFkBPBP4Uis6PckNSdYnOaCVLQNuH9ptSyubrHzXc6xLsjHJxu3bt0+neZKkaRg5AJLsC/wV8HtV9QPgfcCTgSMYjBDetbPqBLvXbsofXFB1XlWtrqrVS5cuHbV5kqRpGuUaAEkeyeDN/8+r6qMAVXXn0PYPAJ9oq1uA5UO7HwJsbcuTlUuS5tkodwEFOB+4parePVR+8FC13wBubMsbgBOTPCrJ4cBK4MvAdcDKJIcn2YfBheIN4+mGJGm6RhkBPA94FfD1JF9tZW8CTkpyBINpnM3AawCq6qYklzK4uLsDOK2qHgBIcjpwBbAXsL6qbhpjXyRJ0zDKXUCfZ+L5+8t3s8/bgbdPUH757vaTJM0fPwksSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkTk0ZAEmWJ/lMkluS3JTk9a388UmuTPLN9nxAK0+S9ybZlOSGJM8aOtbaVv+bSdbOXbckSVMZZQSwAzijqp4KPAc4Lckq4EzgqqpaCVzV1gGOB1a2xzrgfTAIDOAs4NnAkcBZO0NDkjT/pgyAqrqjqr7Slu8DbgGWAScAF7ZqFwIvacsnABfVwLXA/kkOBo4Frqyqu6vqHuBK4Lix9kaSNLK9p1M5yQrgmcCXgIOq6g4YhESSA1u1ZcDtQ7ttaWWTle96jnUMRg4ceuih02meJI3dijM/uSDn3XzOC+f8HCNfBE6yL/BXwO9V1Q92V3WCstpN+YMLqs6rqtVVtXrp0qWjNk+SNE0jBUCSRzJ48//zqvpoK76zTe3Qnre18i3A8qHdDwG27qZckrQARrkLKMD5wC1V9e6hTRuAnXfyrAUuGyp/dbsb6DnAvW2q6ArgmCQHtIu/x7QySdICGOUawPOAVwFfT/LVVvYm4Bzg0iSnArcBL2vbLgfWAJuA+4FTAKrq7iRvA65r9d5aVXePpReSpGmbMgCq6vNMPH8PcPQE9Qs4bZJjrQfWT6eBkqS54SeBJalTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROTRkASdYn2ZbkxqGys5N8J8lX22PN0LY3JtmU5BtJjh0qP66VbUpy5vi7IkmajlFGABcAx01Q/p6qOqI9LgdIsgo4EXha2+fPkuyVZC/gXOB4YBVwUqsrSVoge09VoaquSbJixOOdAFxSVT8BvpVkE3Bk27apqm4FSHJJq3vztFssSRqL2VwDOD3JDW2K6IBWtgy4fajOllY2WflDJFmXZGOSjdu3b59F8yRJuzPTAHgf8GTgCOAO4F2tPBPUrd2UP7Sw6ryqWl1Vq5cuXTrD5kmSpjLlFNBEqurOnctJPgB8oq1uAZYPVT0E2NqWJyuXJC2AGY0Akhw8tPobwM47hDYAJyZ5VJLDgZXAl4HrgJVJDk+yD4MLxRtm3mxJ0mxNOQJIcjFwFLAkyRbgLOCoJEcwmMbZDLwGoKpuSnIpg4u7O4DTquqBdpzTgSuAvYD1VXXT2HsjSRrZKHcBnTRB8fm7qf924O0TlF8OXD6t1kmS5oyfBJakThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMz+ioISZpPK8785EI34WHJEYAkdcoAkKROGQCS1CkDQJI6ZQBIUqe8C2gOLOQdC5vPeeGCnVvS4uIIQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1asoASLI+ybYkNw6VPT7JlUm+2Z4PaOVJ8t4km5LckORZQ/usbfW/mWTt3HRHkjSqUUYAFwDH7VJ2JnBVVa0ErmrrAMcDK9tjHfA+GAQGcBbwbOBI4KydoSFJWhhTBkBVXQPcvUvxCcCFbflC4CVD5RfVwLXA/kkOBo4Frqyqu6vqHuBKHhoqkqR5NNNrAAdV1R0A7fnAVr4MuH2o3pZWNln5QyRZl2Rjko3bt2+fYfMkSVMZ90XgTFBWuyl/aGHVeVW1uqpWL126dKyNkyT93EwD4M42tUN73tbKtwDLh+odAmzdTbkkaYHMNAA2ADvv5FkLXDZU/up2N9BzgHvbFNEVwDFJDmgXf49pZZKkBbL3VBWSXAwcBSxJsoXB3TznAJcmORW4DXhZq345sAbYBNwPnAJQVXcneRtwXav31qra9cKyJGkeTRkAVXXSJJuOnqBuAadNcpz1wPpptU6SNGf8JLAkdWrKEYAk7bTizE8udBM0Ro4AJKlTBoAkdcoAkKROGQCS1CkDQJI65V1ADzMLdZfG5nNeuCDnlTRzjgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpU/6PYBoL/ycyafFxBCBJnZrVCCDJZuA+4AFgR1WtTvJ44CPACmAz8PKquidJgD8F1gD3AydX1Vdmc36pRws12tLDzzhGAL9WVUdU1eq2fiZwVVWtBK5q6wDHAyvbYx3wvjGcW5I0Q3MxBXQCcGFbvhB4yVD5RTVwLbB/koPn4PySpBHMNgAK+Lsk1ydZ18oOqqo7ANrzga18GXD70L5bWtmDJFmXZGOSjdu3b59l8yRJk5ntXUDPq6qtSQ4Erkzyf3ZTNxOU1UMKqs4DzgNYvXr1Q7ZLksZjViOAqtranrcBHwOOBO7cObXTnre16luA5UO7HwJsnc35JUkzN+MRQJJ/Djyiqu5ry8cAbwU2AGuBc9rzZW2XDcDpSS4Bng3cu3OqSJop74iRZm42U0AHAR8b3N3J3sCHq+pvk1wHXJrkVOA24GWt/uUMbgHdxOA20FNmcW5J0izNOACq6lbgGROUfw84eoLyAk6b6fkkSePlJ4ElqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUqXkPgCTHJflGkk1Jzpzv80uSBuY1AJLsBZwLHA+sAk5Ksmo+2yBJGpjvEcCRwKaqurWq/gm4BDhhntsgSQL2nufzLQNuH1rfAjx7uEKSdcC6tvrDJN+YxfmWAHfNYv/Fprf+gn3uRXd9zjtn1efDRqk03wGQCcrqQStV5wHnjeVkycaqWj2OYy0GvfUX7HMv7PPcmO8poC3A8qH1Q4Ct89wGSRLzHwDXASuTHJ5kH+BEYMM8t0GSxDxPAVXVjiSnA1cAewHrq+qmOTzlWKaSFpHe+gv2uRf2eQ6kqqauJUl62PGTwJLUKQNAkjq16ANgqq+WSPKoJB9p27+UZMX8t3K8RujzG5LcnOSGJFclGeme4D3ZqF8hkuQ3k1SSRX/L4Ch9TvLy9rO+KcmH57uN4zbC7/ahST6T5O/b7/eahWjnuCRZn2Rbkhsn2Z4k722vxw1JnjXWBlTVon0wuJD8D8CTgH2ArwGrdqnzu8D72/KJwEcWut3z0OdfAx7Tln+nhz63evsB1wDXAqsXut3z8HNeCfw9cEBbP3Ch2z0PfT4P+J22vArYvNDtnmWffxV4FnDjJNvXAH/D4DNUzwG+NM7zL/YRwChfLXECcGFb/kvg6CQTfSBtsZiyz1X1maq6v61ey+DzFovZqF8h8jbgj4F/nM/GzZFR+vzvgHOr6h6Aqto2z20ct1H6XMBj2/LjWOSfI6qqa4C7d1PlBOCiGrgW2D/JweM6/2IPgIm+WmLZZHWqagdwL/AL89K6uTFKn4edyuAviMVsyj4neSawvKo+MZ8Nm0Oj/JyfAjwlyf9Ocm2S4+atdXNjlD6fDbwyyRbgcuB189O0BTPdf+/TMt9fBTFuU361xIh1FpOR+5PklcBq4Plz2qK5t9s+J3kE8B7g5Plq0DwY5ee8N4NpoKMYjPI+l+TpVfX9OW7bXBmlzycBF1TVu5I8F/hg6/P/m/vmLYg5ff9a7COAUb5a4md1kuzNYNi4uyHXnm6kr9NI8gLgzcCLq+on89S2uTJVn/cDng58NslmBnOlGxb5heBRf7cvq6qfVtW3gG8wCITFapQ+nwpcClBVXwQezeCL4h6u5vTrcxZ7AIzy1RIbgLVt+TeBT1e7urJITdnnNh3yPxm8+S/2eWGYos9VdW9VLamqFVW1gsF1jxdX1caFae5YjPK7/dcMLviTZAmDKaFb57WV4zVKn28DjgZI8lQGAbB9Xls5vzYAr253Az0HuLeq7hjXwRf1FFBN8tUSSd4KbKyqDcD5DIaJmxj85X/iwrV49kbs858A+wJ/0a5331ZVL16wRs/SiH1+WBmxz1cAxyS5GXgA+I9V9b2Fa/XsjNjnM4APJPl9BlMhJy/mP+iSXMxgCm9Ju65xFvBIgKp6P4PrHGuATcD9wCljPf8ifu0kSbOw2KeAJEkzZABIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkTv1/HEtQwvaQS64AAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.hist(weights)\n", "plt.title('Distribution of weights')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Let us choose a threshold of 0.6.\n", "# Too high, we will have disconnected components\n", "# Too low, the graph will have too many connections\n", "adjacency[adjacency < 0.6] = 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Remark: The distances presented here do not work well for categorical data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Graph visualization\n", "\n", "To conclude, let us visualize the graph. We will use the python module networkx." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# A simple command to create the graph from the adjacency matrix.\n", "graph = nx.from_numpy_array(adjacency)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us try some direct visualizations using networkx." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/michael/.conda/envs/ntds_2018/lib/python3.6/site-packages/networkx/drawing/nx_pylab.py:611: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)\n", " if cb.is_numlike(alpha):\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "nx.draw_spectral(graph)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh! It seems to be separated in 3 parts! Are they related to the 3 different species of iris?\n", "\n", "Let us try another [layout algorithm](https://en.wikipedia.org/wiki/Graph_drawing#Layout_methods), where the edges are modeled as springs." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/michael/.conda/envs/ntds_2018/lib/python3.6/site-packages/networkx/drawing/nx_pylab.py:611: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)\n", " if cb.is_numlike(alpha):\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "nx.draw_spring(graph)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save the graph to disk in the `gexf` format, readable by gephi and other tools that manipulate graphs. You may now explore the graph using gephi and compare the visualizations." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "nx.write_gexf(graph,'iris.gexf')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }