{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# _MiSTree Tutorial 2_ - Minimum Spanning Trees\n", "\n", "## (1) _Basic Usage_\n", "\n", "To construct the minimum spanning tree (MST) from a data set we will usually\n", "interact with the ``get_mst`` class. Unless you need to do something more sophisticated\n", "with the MST you will not need to use the internal functions that are used by the class.\n", "\n", "To initiate the class we will run:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from __future__ import print_function\n", "import numpy as np\n", "import matplotlib.pylab as plt\n", "import mistree as mist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (1.1) _Initialising_\n", "#### _In 2D_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.random.random_sample(1000)\n", "y = np.random.random_sample(1000)\n", "\n", "mst = mist.GetMST(x=x, y=y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### _In 3D_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.random.random_sample(1000)\n", "y = np.random.random_sample(1000)\n", "z = np.random.random_sample(1000)\n", "\n", "mst = mist.GetMST(x=x, y=y, z=z)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### _In Tomographic Coordinates_\n", "\n", "We generate a uniform random distribution on the sphere." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "phi = 360.*np.random.random_sample(1000)\n", "theta = np.rad2deg(np.arccos(1.-2.*np.random.random_sample(1000)))\n", "\n", "mst = mist.GetMST(phi=phi, theta=theta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### _In Tomographic Celestial Coordinates_\n", "\n", "Once again using a uniform random distribution on the sphere." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ra = 360.*np.random.random_sample(1000)\n", "dec = np.rad2deg(np.arccos(1.-2.*np.random.random_sample(1000))) - 90.\n", "\n", "mst = mist.GetMST(ra=ra, dec=dec)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### _In Spherical Polar Coordinates_\n", "\n", "This generates a uniform distribution of points with a sphere of radius 10." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "phi = 360.*np.random.random_sample(1000)\n", "theta = np.rad2deg(np.arccos(1.-2.*np.random.random_sample(1000)))\n", "r = 10.*(np.random.random_sample(1000))**(1./3.)\n", "\n", "mst = mist.GetMST(phi=phi, theta=theta, r=r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### _In Spherical Celestial Coordinates_\n", "\n", "This generates a uniform distribution of points with a sphere of radius 10." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ra = 360.*np.random.random_sample(1000)\n", "dec = np.rad2deg(np.arccos(1.-2.*np.random.random_sample(1000))) - 90.\n", "r = 10.*np.random.random_sample(1000)**(1./3.)\n", "\n", "mst = mist.GetMST(ra=ra, dec=dec, r=r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (1.2) _Measure MST statistics_\n", "\n", "And to construct the MST and output the MST statistics: degree (d), edge length (l),\n", "branch length (b) and branch shape (s):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.random.random_sample(1000)\n", "y = np.random.random_sample(1000)\n", "\n", "mst = mist.GetMST(x=x, y=y)\n", "\n", "d, l, b, s = mst.get_stats()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you would also like the edge (``l_index``) and branch index (``b_index``),\n", "this can be done in two ways:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d, l, b, s, l_index, b_index = mst.get_stats(include_index=True)\n", "\n", "# alternatively:\n", "\n", "d, l, b, s = mst.get_stats()\n", "l_index = mst.edge_index\n", "b_index = mst.branch_index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The edge index (``l_index``) is a 2 dimensional array, indicating the pair of nodes\n", "that make up each edge. The branch index are list of the member edges in each branches.\n", "\n", "### (1.3) _Plotting the MST_\n", "\n", "#### _Plotting Edges_\n", "\n", "To plot the MST, i.e. the nodes and edges you can use the following piece of python code\n", "where we plot a set of 2D random points:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.random.random_sample(100)\n", "y = np.random.random_sample(100)\n", "\n", "mst = mist.GetMST(x=x, y=y)\n", "d, l, b, s, l_index, b_index = mst.get_stats(include_index=True)\n", "\n", "plt.figure(figsize=(7., 7.))\n", "\n", "# plotting nodes:\n", "plt.scatter(x, y, s=10, color='r')\n", "\n", "# plotting MST edges:\n", "plt.plot([x[l_index[0]], x[l_index[1]]],\n", " [y[l_index[0]], y[l_index[1]]],\n", " color='k')\n", "\n", "plt.xlim(0., 1.)\n", "plt.ylim(0., 1.)\n", "plt.xlabel(r'$X$', size=16)\n", "plt.ylabel(r'$Y$', size=16)\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### _Plotting Branches_\n", "\n", "If you would also like to plot branches then you can use the following piece of python code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(7., 7.))\n", "\n", "# plotting nodes:\n", "plt.scatter(x, y, s=10, color='r')\n", "\n", "# plotting branches:\n", "for i in range(0, len(b_index)):\n", " plt.plot([x[l_index[0][b_index[i][0]]], x[l_index[1][b_index[i][0]]]],\n", " [y[l_index[0][b_index[i][0]]], y[l_index[1][b_index[i][0]]]],\n", " color='C0', linestyle=':')\n", " plt.plot([x[l_index[0][b_index[i][1:-1]]], x[l_index[1][b_index[i][1:-1]]]],\n", " [y[l_index[0][b_index[i][1:-1]]], y[l_index[1][b_index[i][1:-1]]]],\n", " color='C0')\n", " plt.plot([x[l_index[0][b_index[i][-1]]], x[l_index[1][b_index[i][-1]]]],\n", " [y[l_index[0][b_index[i][-1]]], y[l_index[1][b_index[i][-1]]]],\n", " color='C0', linestyle=':')\n", "\n", "# ploting MST edges:\n", "plt.plot([x[l_index[0]], x[l_index[1]]],\n", " [y[l_index[0]], y[l_index[1]]],\n", " color='grey', linewidth=2, alpha=0.25)\n", "\n", "plt.plot([], [], color='C0', label=r'$Branch$ $Mid$')\n", "plt.plot([], [], color='C0', label=r'$Branch$ $End$', linestyle=':')\n", "plt.plot([], [], color='grey', alpha=0.25, label=r'$MST$ $Edges$')\n", "plt.xlim(0., 1.)\n", "plt.ylim(0., 1.)\n", "plt.xlabel(r'$X$', size=16)\n", "plt.ylabel(r'$Y$', size=16)\n", "plt.legend(loc='best')\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## (2) _Binning and Plotting_\n", "\n", "### (2.1) _Quick Bin and Plot_\n", "\n", "A very simple plot of the MST summary statistics can be generated using:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.random.random_sample(1000)\n", "y = np.random.random_sample(1000)\n", "z = np.random.random_sample(1000)\n", "\n", "mst = mist.GetMST(x=x, y=y, z=z)\n", "d, l, b, s = mst.get_stats()\n", "\n", "# begins by binning the data and storing this in a dictionary.\n", "hmst = mist.HistMST()\n", "hmst.setup()\n", "mst_dict = hmst.get_hist(d, l, b, s)\n", "\n", "# plotting which takes as input the dictionary created before.\n", "pmst = mist.PlotHistMST()\n", "pmst.read_mst(mst_dict)\n", "pmst.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first ``HistMST`` class bins the data and stores it as a dictionary and the\n", "``PlotHistMST`` class is used make the plot.\n", "\n", "### (2.1) _Binning_\n", "\n", "Once we have created the data set we need to bin the data. This is done by first\n", "initialising the ``HistMST`` class and then setting it up. The most simple case\n", "(using the default settings) is shown below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hmst = mist.HistMST()\n", "hmst.setup()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can make the following changes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# to bin in log_10(l) and log_10(b) rather than just l and b:\n", "hmst.setup(uselog=True)\n", "\n", "# to bin using s rather than sqrt(1-s)\n", "hmst.setup(use_sqrt_s=False)\n", "\n", "# to output the unnormalised histograms (i.e. just counts)\n", "hmst.setup(usenorm=False)\n", "\n", "# to change the range of the binning, the number of bins, etc:\n", "\n", "# for the degree, although this is rarely necessary, please ensure the minimum\n", "# and maximum are half integers and the number of bins is equal to maximum-minimum.\n", "hmst.setup(d_min=0.5, d_max=6.5, num_d_bins=6) # these are the default values.\n", "\n", "# for edge lengths, note the default values are l_min=0., l_max=1.05*l.max()\n", "# and ``num_l_bins=100``.\n", "hmst.setup(l_min=0., l_max=10., num_l_bins=100)\n", "\n", "# for edge lengths, note the default value are b_min=0. and b_max=1.05*b.max()\n", "# and ``num_b_bins=100``.\n", "hmst.setup(b_min=0., b_max=10., num_b_bins=100)\n", "\n", "# for branch shape in either projections the range can be altered by changing the following,\n", "# however it will rarely be necessary to change from the default values of s_min=0 and s_max=1.,\n", "# but you may want to alter the binning (default is 50).\n", "hmst.setup(s_min=0., s_max=1., num_s_bins=50)\n", "\n", "# if you are instead using $log_{10}l$ and $log_{10}b$ then you would specify the range\n", "# by using the following but note the binning still uses num_l_bins and num_b_bins.\n", "hmst.setup(logl_min=-2., logl_max=4., logb_min=-1, logb_max=5.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once this is done we can actually pass the MST statistics to the class and create a dictionary\n", "of the binned statistics:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hmst = mist.HistMST()\n", "hmst.setup(uselog=True)\n", "mst_dict = hmst.get_hist(d, l, b, s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dictionary created is stored with the following entries:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(mst_dict.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- ``uselog`` : stores whether the bins for l and b are in logs.\n", "- ``use_sqrt_s`` : stores whether the the bins for s are in s or sqrt(1-s)\n", "- ``usenorm`` : stores whether the histograms are normalised.\n", "- ``isgroup`` : stores whether the histogram come from a group of histograms (discussed later)\n", "- ``x_d`` : bin centres for degree\n", "- ``y_d`` : bin heights for degree\n", "- ``x_l`` : bin centres for edge length\n", "- ``y_l`` : bin heights for edge length\n", "- ``x_b`` : bin centres for branch length\n", "- ``y_b`` : bin heights for branch length\n", "- ``x_s`` : bin centres for branch shape\n", "- ``y_s`` : bin heights for branch shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, if we want to instead store the ensemble mean and standard deviation of a group of MSTs we would\n", "add the individual MST to ``HistMST`` class in the following way:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hmst = mist.HistMST()\n", "hmst.setup(uselog=True)\n", "hmst.start_group() # this tells HistMST to store the individual binned MST statistics\n", "\n", "for i in range(0, 10):\n", "\n", " # Read or measure MST statistics, we will use the default levy flight distribution here\n", " \n", " x, y, z = mist.get_levy_flight(50000)\n", " \n", " mst = mist.GetMST(x=x, y=y, z=z)\n", " d, l, b, s = mst.get_stats()\n", " # we use it just as we did before, where the outputted dictionary is for that single\n", " # realisation\n", "\n", " mst_dict = hmst.get_hist(d, l, b, s)\n", " \n", " print(i+1, '/ 10')\n", "\n", "# to output the mean and standard deviation of the ensemble histograms.\n", "\n", "mst_dict_group = hmst.end_group()\n", "\n", "# you must use hmst.start_group() to start collecting MST statistics from another group\n", "# otherwise this will continue collecting histograms for the current group" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly to before the dictionary contains many of the same elements with some additional ones." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(mst_dict_group.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- ``y_d_std`` : standard deviation for the bin heights for degree\n", "- ``y_l_std`` : standard deviation for the bin heights for edge length\n", "- ``y_b_std`` : standard deviation for the bin heights for branch length\n", "- ``y_s_std`` : standard deviation for the bin heights for branch shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This makes the assumption that the counts follow a Gaussian distribution, since these\n", "are counts this actually follows a discrete Poisson distribution but for large values\n", "a Gaussian is an appropriate approximation (usually greater than 50 should be okay).\n", "This is important to consider if you use these summary statistics in regions where\n", "the counts are small." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (2.2) _Plotting_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can choose to use this as well or use the default matplotlib fonts.\n", "\n", "Once we have the binned MST dictionary we can plot it very simply using ``PlotHistMST`` class:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pmst = mist.PlotHistMST()\n", "pmst.read_mst(mst_dict)\n", "pmst.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To alter how the plot looks we can alter the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pmst = mist.PlotHistMST()\n", "pmst.read_mst(mst_dict, color='Dodgerblue', linewidth=2., linestyle='--', alpha=0.8,\n", " label='Levy Flight')\n", "pmst.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To change from the default box binned plots to smooth lines (excluding degree):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pmst = mist.PlotHistMST()\n", "pmst.read_mst(mst_dict)\n", "pmst.plot(usebox=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparing randoms points, a Levy-Flight distribution and adjusted Levy-Flight distribution:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We first create a random distribution\n", "x_r = 75.*np.random.random_sample(50000)\n", "y_r = 75.*np.random.random_sample(50000)\n", "z_r = 75.*np.random.random_sample(50000)\n", "\n", "# a levy flight distribution\n", "x_lf, y_lf, z_lf = mist.get_levy_flight(50000)\n", "\n", "# an adjusted levy flight distribution\n", "x_alf, y_alf, z_alf = mist.get_adjusted_levy_flight(50000)\n", "\n", "# then construct and measure the MST for each distribution\n", "mst = mist.GetMST(x=x_r, y=y_r, z=z_r)\n", "d_r, l_r, b_r, s_r = mst.get_stats()\n", "\n", "mst = mist.GetMST(x=x_lf, y=y_lf, z=z_lf)\n", "d_lf, l_lf, b_lf, s_lf = mst.get_stats()\n", "\n", "mst = mist.GetMST(x=x_alf, y=y_alf, z=z_alf)\n", "d_alf, l_alf, b_alf, s_alf = mst.get_stats()\n", "\n", "# bin the MST statistics\n", "hmst = mist.HistMST()\n", "hmst.setup(uselog=True)\n", "hist_alf = hmst.get_hist(d_alf, l_alf, b_alf, s_alf)\n", "hist_lf = hmst.get_hist(d_lf, l_lf, b_lf, s_lf)\n", "hist_r = hmst.get_hist(d_r, l_r, b_r, s_r)\n", "\n", "# and plot it\n", "pmst = mist.PlotHistMST()\n", "pmst.read_mst(hist_r, label='Randoms')\n", "pmst.read_mst(hist_lf, label='Levy Flight')\n", "pmst.read_mst(hist_alf, label='Adjusted Levy Flight')\n", "pmst.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can plot difference subplots:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pmst = mist.PlotHistMST()\n", "pmst.read_mst(hist_lf, label='Levy Flight')\n", "pmst.read_mst(hist_alf, label='Adjusted Levy Flight')\n", "pmst.plot(usecomp=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally plotting the histogram for a group works in the very same way except we\n", "pass the dictionary of a group. The final plot has 1 sigma error bars." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hmst = mist.HistMST()\n", "hmst.setup(uselog=True)\n", "\n", "hist_lf = hmst.get_hist(d_lf, l_lf, b_lf, s_lf)\n", "\n", "hmst.start_group()\n", "\n", "for i in range(0, 10):\n", "\n", " x_alf, y_alf, z_alf = mist.get_adjusted_levy_flight(50000)\n", "\n", " mst = mist.GetMST(x=x_alf, y=y_alf, z=z_alf)\n", " d_alf, l_alf, b_alf, s_alf = mst.get_stats()\n", "\n", " _hist_alf = hmst.get_hist(d_alf, l_alf, b_alf, s_alf)\n", " \n", " print(i+1, '/ 10')\n", "\n", "hist_alf_group = hmst.end_group()\n", "\n", "pmst = mist.PlotHistMST()\n", "pmst.read_mst(hist_lf, label='Levy Flight')\n", "pmst.read_mst(hist_alf_group, label='Adjusted Levy Flight')\n", "pmst.plot(usecomp=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## (3) _Advanced Usage_\n", "\n", "\n", "### (3.1) _k Nearest Neighbours_\n", "\n", "The k-nearest neighbour graph is a spanning graph which is passed on to the\n", "``scipy`` kruskal algorithm. The actual graph is constructed using the ``scikit-learn``\n", "``kneighbors_graph`` and by default will include the nearest 20 neighbours to\n", "each node. We can specify the number of nearest neighbours (we will set this to 30)\n", "in the following way:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.random.random_sample(1000)\n", "y = np.random.random_sample(1000)\n", "\n", "mst = mist.GetMST(x=x, y=y) # Assuming our input data set is 2D.\n", "mst.define_k_neighbours(30)\n", "d, l, b, s = mst.get_stats()\n", "\n", "# or directly:\n", "\n", "mst = mist.GetMST(x=x, y=y) # Assuming our input data set is 2D.\n", "d, l, b, s = mst.get_stats(k_neighbours=30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: changing ``k`` to larger values will result in longer computation time to construct\n", "the MST.\n", "\n", "### (3.2) _Scale Cuts_\n", "\n", "In cosmological data sets we often need to remove small scales due to numerical\n", "simulation or observational limitations. To remove this we carry out\n", "the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.random.random_sample(100000)\n", "y = np.random.random_sample(100000)\n", "\n", "mst = mist.GetMST(x=x, y=y)\n", "mst.scale_cut(0.002)\n", "d, l, b, s = mst.get_stats()\n", "\n", "# or directly:\n", "\n", "mst = mist.GetMST(x=x, y=y)\n", "d, l, b, s = mst.get_stats(scale_cut_length=0.002)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.15" } }, "nbformat": 4, "nbformat_minor": 2 }