{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CSNAnalysis Tutorial\n", "### A brief introduction to the use of the CSNAnalysis package\n", "---\n", "**Updated Aug 19, 2020**\n", "*Dickson Lab, Michigan State University*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "The CSNAnalysis package is a set of tools for network-based analysis of molecular dynamics trajectories.\n", " CSNAnalysis is an easy interface between enhanced sampling algorithms\n", " (e.g. WExplore implemented in `wepy`), molecular clustering programs (e.g. `MSMBuilder`), graph analysis packages (e.g. `networkX`) and graph visualization programs (e.g. `Gephi`).\n", "\n", "### What are conformation space networks?\n", "\n", "A conformation space network is a visualization of a free energy landscape, where each node is a cluster of molecular conformations, and the edges show which conformations can directly interconvert during a molecular dynamics simulation. A CSN can be thought of as a visual representation of a transition matrix, where the nodes represent the row / column indices and the edges show the off-diagonal elements. `CSNAnalysis` offers a concise set of tools for the creation, analysis and visualization of CSNs.\n", "\n", "**This tutorial will give quick examples for the following use cases:**\n", "\n", "1. Initializing CSN objects from count matrices\n", "2. Trimming CSNs\n", "2. Obtaining steady-state weights from a transition matrix\n", " * By eigenvalue\n", " * By iterative multiplication\n", "3. Computing committor probabilities to an arbitrary set of basins\n", "4. Exporting gexf files for visualization with the Gephi program" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting started\n", "\n", "Clone the CSNAnalysis repository:\n", "\n", "```\n", "git clone https://github.com/ADicksonLab/CSNAnalysis.git```\n", "\n", "Navigate to the examples directory and install using pip:\n", "\n", "```\n", "cd CSNAnalysis\n", "pip install --user -e\n", "```\n", "\n", "Go to the examples directory and open this notebook (`examples.ipynb`):\n", "\n", "```\n", "cd examples; jupyter notebook```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dependencies\n", "\n", "I highly recommend using Anaconda and working in a `python3` environment. CSNAnalysis uses the packages `numpy`, `scipy` and `networkx`. If these are installed then the following lines of code should run without error:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import networkx as nx\n", "import scipy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If `CSNAnalysis` was installed (i.e. added to your `sys.path`), then this should also work:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from csnanalysis.csn import CSN\n", "from csnanalysis.matrix import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook also uses `matplotlib`, to visualize output." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import matplotlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! Now let's load in the count matrix that we'll use for all the examples here:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "count_mat = scipy.sparse.load_npz('matrix.npz')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Background: Sparse matrices" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "It's worth knowing a little about sparse matrices before we start. If we have a huge $N$ by $N$ matrix, where $N > 1000$, but most of the elements are zero, it is more efficient to store the data as a sparse matrix." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "scipy.sparse.coo.coo_matrix" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(count_mat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`coo_matrix` refers to \"coordinate format\", where the matrix is essentially a set of lists of matrix \"coordinates\" (rows, columns) and data:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0 382.0\n", "0 651 2.0\n", "0 909 2.0\n", "0 920 2.0\n", "0 1363 1.0\n", "0 1445 2.0\n", "0 2021 5.0\n", "0 2022 7.0\n", "0 2085 4.0\n", "0 2131 1.0\n" ] } ], "source": [ "rows = count_mat.row\n", "cols = count_mat.col\n", "data = count_mat.data\n", "\n", "for r,c,d in zip(rows[0:10],cols[0:10],data[0:10]):\n", " print(r,c,d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although it can be treated like a normal matrix ($4000$ by $4000$ in this case):" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4000, 4000)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "count_mat.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It only needs to store non-zero elements, which are much fewer than $4000^2$:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "44163" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(rows)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**OK, let's get started building a Conformation Space Network!**\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1) Initializing CSN objects from count matrices\n", "\n", "To get started we need a count matrix, which can be a `numpy` array, or a `scipy.sparse` matrix, or a list of lists:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "our_csn = CSN(count_mat,symmetrize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any of the `CSNAnalysis` functions can be queried using \"?\"" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "CSN?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `our_csn` object now holds three different representations of our data. The original counts can now be found in `scipy.sparse` format:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<4000x4000 sparse matrix of type ''\n", "\twith 62280 stored elements in COOrdinate format>" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "our_csn.countmat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A transition matrix has been computed from this count matrix according to: \n", "\\begin{equation}\n", "t_{ij} = \\frac{c_{ij}}{\\sum_j c_{ij}}\n", "\\end{equation}" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<4000x4000 sparse matrix of type ''\n", "\twith 62280 stored elements in COOrdinate format>" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "our_csn.transmat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "where the elements in each column sum to one:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "matrix([[1., 1., 1., ..., 1., 1., 1.]])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "our_csn.transmat.sum(axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, the data has been stored in a `networkx` directed graph:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "our_csn.graph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "that holds the nodes and edges of our csn, and we can use in other `networkx` functions. For example, we can calculate the shortest path between nodes 0 and 10:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 1445, 2125, 2043, 247, 1780, 10]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nx.shortest_path(our_csn.graph,0,10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 2) Trimming CSNs\n", "\n", "A big benefit of coupling the count matrix, transition matrix and graph representations is that elements can be \"trimmed\" from all three simultaneously. The `trim` function will eliminate nodes that are not connected to the main component (by inflow, outflow, or both), and can also eliminate nodes that do not meet a minimum count requirement:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "our_csn.trim(by_inflow=True, by_outflow=True, min_count=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The trimmed graph, count matrix and transition matrix are stored as `our_csn.trim_graph`, `our_csn.trim_countmat` and `our_csn.trim_transmat`, respectively." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2282" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "our_csn.trim_graph.number_of_nodes()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2282, 2282)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "our_csn.trim_countmat.shape" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2282, 2282)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "our_csn.trim_transmat.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3) Obtaining steady-state weights from the transition matrix\n", "\n", "Now that we've ensured that our transition matrix is fully-connected, we can compute its equilibrium weights. This is implemented in two ways.\n", "\n", "First, we can compute the eigenvector of the transition matrix with eigenvalue one:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "wt_eig = our_csn.calc_eig_weights()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can exhibit some instability, especially for low-weight states, so we can also calculate weights by iterative multiplication of the transition matrix, which can take a little longer:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "wt_mult = our_csn.calc_mult_weights()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "plt.scatter(wt_eig,wt_mult)\n", "plt.plot([0,wt_mult.max()],[0,wt_mult.max()],'r-')\n", "plt.xlabel(\"Eigenvalue weight\")\n", "plt.ylabel(\"Mult weight\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These weights are automatically added as attributes to the nodes in `our_csn.graph`:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'label': 0,\n", " 'count': 482,\n", " 'trim': 0.0,\n", " 'eig_weights': 0.002595528367725156,\n", " 'mult_weights': 0.0025955283677248217}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "our_csn.graph.node[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4) Committor probabilities to an arbitrary set of basins\n", "\n", "We are often doing simulations in the presence of one or more high probability \"basins\" of attraction. When there more than one basin, it can be useful to find the probability that a simulation started in a given state will visit (or \"commit to\") a given basin before the others.\n", "\n", "`CSNAnalysis` calculates committor probabilities by creating a sink matrix ($S$), where each column in the transition matrix that corresponds to a sink state is replaced by an identity vector. This turns each state into a \"black hole\" where probability can get in, but not out. \n", "\n", "By iteratively multiplying this matrix by itself, we can approximate $S^\\infty$. The elements of this matrix reveal the probability of transitioning to any of the sink states, upon starting in any non-sink state, $i$.\n", "\n", "Let's see this in action. We'll start by reading in a set of three basins: $A$, $B$ and $U$." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "Astates = [2031,596,1923,3223,2715]\n", "Bstates = [1550,3168,476,1616,2590]\n", "Ustates = list(np.loadtxt('state_U.dat',dtype=int))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then use the `calc_committors` function to calculate committors between this set of three basins. This will calculate $p_A$, $p_B$, and $p_U$ for each state, which sum to one." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "basins = [Astates,Bstates,Ustates]\n", "labels = ['pA','pB','pU']\n", "comms = our_csn.calc_committors(basins,labels=labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The committors can be interpreted as follows:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "comms[0] = [0.26406217 0.29477873 0.44115911]\n", "\n", "In other words, if you start in state 0:\n", "You will reach basin A first with probability 0.26, basin B with probability 0.29 and basin U with probability 0.44\n" ] } ], "source": [ "i = our_csn.trim_indices[0]\n", "print('comms['+str(i)+'] = ',comms[i])\n", "print('\\nIn other words, if you start in state {0:d}:'.format(i))\n", "print('You will reach basin A first with probability {0:.2f}, basin B with probability {1:.2f} and basin U with probability {2:.2f}'.format(comms[i,0],comms[i,1],comms[i,2]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5) Exporting graph for visualization in Gephi\n", "\n", "`NetworkX` is great for doing graph-based analyses, but not stellar at greating graph layouts for large(r) networks. However, they do have excellent built-in support for exporting graph objects in a variety of formats. \n", "\n", "Here we'll use the `.gexf` format to save our network, as well as all of the attributes we've calculated, to a file that can be read into [Gephi](https://gephi.org/), a powerful graph visualization program. While support for Gephi has been spotty in the recent past, it is still one of the best available options for graph visualization.\n", "\n", "Before exporting to `.gexf`, let's use the committors we've calculated to add colors to the nodes:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "rgb = our_csn.colors_from_committors(comms)\n", "our_csn.set_colors(rgb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have added some properties to our nodes under 'viz', which will be interpreted by Gephi:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'label': 0,\n", " 'count': 482,\n", " 'trim': 0.0,\n", " 'eig_weights': 0.002595528367725156,\n", " 'mult_weights': 0.0025955283677248217,\n", " 'pA': 0.26406216543613925,\n", " 'pB': 0.2947787254045238,\n", " 'pU': 0.4411591091593356,\n", " 'viz': {'color': {'r': 152, 'g': 170, 'b': 255, 'a': 0}}}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "our_csn.graph.node[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can use an internal `networkx` function to write all of this to a `.gexf` file:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "nx.readwrite.gexf.write_gexf(our_csn.graph.to_undirected(),'test.gexf')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After opening this file in Gephi, I recommend creating a layout using the \"Force Atlas 2\" algorithm in the layout panel. I set the node sizes to the \"eig_weights\" variable, and after exporting to pdf and adding some labels, I get the following:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Gephi graph export](committor_net_3state.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**That's the end of our tutorial!** I hope you enjoyed it and you find `CSNAnalysis` useful in your research. If you are having difficulties with the installation or running of the software, feel free to create an [issue on the Github page](https://github.com/ADicksonLab/CSNAnalysis)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 1 }