{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Measuring differentiation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import scipy\n", "import numpy as np\n", "from sklearn.neighbors import KernelDensity\n", "from sklearn.decomposition import PCA\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.cluster import estimate_bandwidth\n", "from sklearn.cluster import MeanShift, estimate_bandwidth\n", "\n", "import pandas as pd\n", "\n", "from scipy import stats\n", "from scipy.stats import beta\n", "from math import sin\n", "from random import randint\n", "\n", "import matplotlib.pyplot as plt\n", "import itertools as it\n", "\n", "import plotly\n", "import plotly.plotly as py\n", "import plotly.graph_objs as go\n", "from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot\n", "\n", "from ipywidgets import interact, interactive, fixed, interact_manual\n", "import ipywidgets as widgets\n", "init_notebook_mode(connected=True)\n", "\n", "import collections\n", "\n", "def recursively_default_dict():\n", " return collections.defaultdict(recursively_default_dict)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do you measure the overlap of two distributions?\n", "\n", "This is a frequently asked question, over which a remarkable number of studies have been published. When faced with a structured data set, the degree, location and timing of contacts between populations we are focusing on can often serve as a direct answer to our questions, or offer a path for further research.\n", "\n", "Existing metrics for this purpose usually take into account the means, standard deviations and sizes of the populations involved, the combinations of which are matched against known distributions in order to extract their significance (Fisher or Student's t distributions are the common resort). \n", "\n", "The use of summary statistics as the basis for these metrics was related to the computing power available to the researchers that developped them, which, for most, was very limited. What i propose here is a pretty bruttish approach to the matter, but one that takes advantage of our modern tools and processors. \n", "\n", "\n", "We will generate three clouds of points (the possible output of dimensionality reduction on a very neat data set). Two of these populations will have their distance proportional to the sinusoide of an 'X' variable (time, physical location, your choice). " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a2030d5e8c6945cf9e05042ec59a2862", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type interactive.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "interactive(children=(IntSlider(value=15, description='x', max=30), Output()), _dom_classes=('widget-interact',))" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## creating a stable pop:\n", "cov_pop1= [[2,0,0],[0,1,0],[0,0,1]]\n", "cov_pop2= [[1,0,0],[0,.3,0],[0,0,.1]]\n", "cov_pop3= [[1,0,0],[0,.3,0],[0,0,.1]]\n", "\n", "### Number of values:\n", "Sizes= [170,130,30]\n", "labels= [2,0,1]\n", "label_vector= label_vector= np.repeat(np.array([x for x in labels]),Sizes)\n", "\n", "def jumpingJack(x):\n", " height= sin(x) * 5\n", " \n", " mean_pop1= [2,0,7]\n", " mean_pop3= [0,10,height]\n", " mean_pop2= [0,10,-height]\n", " \n", " \n", " Pop1= np.random.multivariate_normal(mean_pop1, cov_pop1, Sizes[0])\n", " Pop2= np.random.multivariate_normal(mean_pop2, cov_pop2, Sizes[1])\n", "\n", " feature= np.vstack((Pop1,Pop2))\n", " \n", " Pop3= np.random.multivariate_normal(mean_pop3, cov_pop3, Sizes[2])\n", "\n", " data= np.vstack((feature,Pop3))\n", "\n", " fig_data= [go.Scatter3d(\n", " x = data[[x for x in range(sum(Sizes)) if label_vector[x] == i],0],\n", " y = data[[x for x in range(sum(Sizes)) if label_vector[x] == i],1],\n", " z = data[[x for x in range(sum(Sizes)) if label_vector[x] == i],2],\n", " mode= \"markers\",\n", " marker= {\n", " 'line': {'width': 0},\n", " 'size': 4,\n", " 'symbol': 'circle',\n", " \"opacity\": .8\n", " },\n", " name= str(i)\n", " ) for i in labels]\n", "\n", "\n", "\n", " layout = go.Layout(\n", " margin=dict(\n", " l=0,\n", " r=0,\n", " b=0,\n", " t=0\n", " )\n", " )\n", " fig = go.Figure(data=fig_data, layout=layout)\n", " iplot(fig)\n", "\n", "interact(jumpingJack,x=(0,30,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have three sets of points, two of which see their distribution overlap as a function of 'X'. \n", "\n", "What i wanted out of this was something akin to conditional probabilities:\n", "- P(A|B): what proportion of distribution A is also within of distribution B?\n", "- P(B|A): what proportion of distribution B is also within of distribution A?\n", "- O - P(A U B): what proportion of the space occupied by both distributions is empty?\n", "\n", "And once again i was brought back to KDE. \n", "- O, the whole of the space occupied by two distributions, we approximate using a meshgrid on the min and max of each cloud in each dimension covered. \n", "\n", "- The coarsness of this grid we set as P. Then N, the total number of points in this grid, is equal to P ** the number of dimensions.\n", "\n", "Then, for every combination of pairs of populations, we can determine which gridpoints have significant densities relative to each distribution. I set this threshold to .005, and normalized densities by those of the actual data points they were drawn from.\n", "\n", "- P(A) = number of points where La >= threshold.\n", "- P(B) = number of points where Lb >= threshold.\n", "- P(A int B) = number of points where Lb >= threshold and La >= threshold.\n", "\n", "It then becomes a matter of P(A|B) = P(A int B) / P(B), same for P(B|A), and Empty= N - P(A) - P(B) + P(A int B).\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "## Function to calculate overlap given coordinates matrix and dictionary of indicies.\n", "def extract_profiles(global_data,target_ind_dict,threshold,P):\n", " ## estimate the bandwith\n", " params = {'bandwidth': np.linspace(np.min(global_data), np.max(global_data),20)}\n", " grid = GridSearchCV(KernelDensity(algorithm = \"ball_tree\",breadth_first = False), params,verbose=0)\n", "\n", " ## perform MeanShift clustering.\n", " combine= {}\n", " for bull in target_ind_dict.keys():\n", " grid.fit(global_data[target_ind_dict[bull],:])\n", " combine[bull]= grid.best_estimator_ \n", "\n", " Stats= recursively_default_dict()\n", "\n", " for combo in it.combinations(target_ind_dict.keys(),2):\n", " pop1= combo[0]\n", " pop2= combo[1]\n", "\n", " All_coords= [x for x in it.chain(*[target_ind_dict[z] for z in combo])]\n", "\n", " Quanted_set= global_data[All_coords,:]\n", "\n", " i_coords, j_coords, z_coords = np.meshgrid(np.linspace(min(Quanted_set[:,0]),max(Quanted_set[:,0]),P),\n", " np.linspace(min(Quanted_set[:,1]),max(Quanted_set[:,1]),P),\n", " np.linspace(min(Quanted_set[:,2]),max(Quanted_set[:,2]),P), indexing= 'ij')\n", "\n", "\n", " traces= [x for x in it.product(range(P),range(P),range(P))]\n", "\n", " background= np.array([i_coords,j_coords,z_coords])\n", "\n", " background= [background[:,c[0],c[1],c[2]] for c in traces]\n", "\n", " background=np.array(background)\n", "\n", " pop1_fist= combine[pop1].score_samples(background)\n", " #pop1_fist= np.exp(pop1_fist)\n", " P_dist_pop1= combine[pop1].score_samples(global_data[target_ind_dict[pop1],:])\n", " pop1_fist = scipy.stats.norm(np.mean(P_dist_pop1),np.std(P_dist_pop1)).cdf(pop1_fist)\n", " pop1_fist= [int(x >= threshold) for x in pop1_fist]\n", " \n", " pop2_fist= combine[pop2].score_samples(background)\n", " #pop2_fist= np.exp(pop2_fist)\n", " P_dist_pop2= combine[pop2].score_samples(global_data[target_ind_dict[pop2],:])\n", " pop2_fist = scipy.stats.norm(np.mean(P_dist_pop2),np.std(P_dist_pop2)).cdf(pop2_fist)\n", " pop2_fist= [int(x >= threshold) for x in pop2_fist]\n", "\n", " \n", " pop1_and_2= len([x for x in range(background.shape[0]) if pop1_fist[x] == 1 and pop2_fist[x] == 1])\n", " pop1_I_pop2= pop1_and_2 / float(sum(pop1_fist))\n", " pop2_I_pop1= pop1_and_2 / float(sum(pop2_fist))\n", " \n", " empty_space= 1 - (sum(pop1_fist) + sum(pop2_fist) - pop1_and_2) / background.shape[0]\n", " \n", " Stats[combo][pop1]= pop1_I_pop2\n", " Stats[combo][pop2]= pop2_I_pop1\n", " Stats[combo]['empty']= empty_space\n", " \n", " return Stats\n", "\n", "## Generating sets of coordinates given X.\n", "def jumpingJack(x):\n", " height= sin(x) * 5\n", " \n", " mean_pop1= [-1,0,4]\n", " mean_pop3= [0,10,height]\n", " mean_pop2= [0,10,-height]\n", " \n", " \n", " Pop1= np.random.multivariate_normal(mean_pop1, cov_pop1, Sizes[0])\n", " Pop2= np.random.multivariate_normal(mean_pop2, cov_pop2, Sizes[1])\n", "\n", " feature= np.vstack((Pop1,Pop2))\n", " \n", " Pop3= np.random.multivariate_normal(mean_pop3, cov_pop3, Sizes[2])\n", "\n", " data= np.vstack((feature,Pop3))\n", " return data\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Varying 'X'. This can take a few minutes.\n", "\n", "Windows= recursively_default_dict()\n", "\n", "target_indx= {z:[x for x in range(len(label_vector)) if label_vector[x] == z] for z in list(set(label_vector))}\n", "threshold= .005\n", "P= 30\n", "\n", "for window in np.arange(0.0, 30, .1):\n", " data = jumpingJack(window)\n", " \n", " profiles= extract_profiles(data,target_indx,threshold,P)\n", " \n", " Windows[window]= profiles\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": false }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9f9ae3ee0a934fe1bebcfc7157723754", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type interactive.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "interactive(children=(Dropdown(description='tup', options=('01', '02', '12'), value='01'), Output()), _dom_classes=('widget-interact',))" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Plot overlaps\n", "\n", "def plot_proximities(tup):\n", " Xs= []\n", " pop1= []\n", " labels= []\n", " \n", " for window in Windows.keys():\n", " Xs.extend([window] * 3)\n", " pop1.append(Windows[window][tuple((int(tup[0]),int(tup[1])))][int(tup[0])])\n", " pop1.append(Windows[window][tuple((int(tup[0]),int(tup[1])))][int(tup[1])])\n", " pop1.append(Windows[window][tuple((int(tup[0]),int(tup[1])))]['empty'])\n", " labels.extend(['P2|1','P1|2','leftover space'])\n", " \n", " coords= {i:[x for x in range(len(labels)) if labels[x] == i] for i in list(set(labels))}\n", " \n", " datum= np.array([Xs,pop1]).T\n", " \n", " fig_data= [go.Scatter(\n", " x= datum[coords[i],0],\n", " y= datum[coords[i],1],\n", " mode= 'lines',\n", " name= i\n", " ) for i in coords.keys()\n", " ]\n", " layout = go.Layout(\n", " margin=dict(\n", " l=0,\n", " r=0,\n", " b=0,\n", " t=0\n", " ),\n", " title= 'Variation between',\n", " yaxis=dict(\n", " range=[0, 1]\n", " )\n", " )\n", " fig = go.Figure(data=fig_data, layout=layout)\n", " iplot(fig)\n", "\n", "\n", "\n", "interact(plot_proximities, tup=['01','02','12'])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }