{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**Source of the materials**: Biopython cookbook (adapted)\n", "Status: Draft" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cluster analysis\n", "================\n", "\n", "Cluster analysis is the grouping of items into clusters based on the\n", "similarity of the items to each other. In bioinformatics, clustering is\n", "widely used in gene expression data analysis to find groups of genes\n", "with similar gene expression profiles. This may identify functionally\n", "related genes, as well as suggest the function of presently unknown\n", "genes.\n", "\n", "The Biopython module `Bio.Cluster` provides commonly used clustering\n", "algorithms and was designed with the application to gene expression data\n", "in mind. However, this module can also be used for cluster analysis of\n", "other types of data. `Bio.Cluster` and the underlying C Clustering\n", "Library is described by De Hoon *et al.* @dehoon2004.\n", "\n", "The following four clustering approaches are implemented in\n", "`Bio.Cluster`:\n", "\n", "- Hierarchical clustering (pairwise centroid-, single-, complete-, and\n", " average-linkage);\n", "\n", "- $k$-means, $k$-medians, and $k$-medoids clustering;\n", "\n", "- Self-Organizing Maps;\n", "\n", "- Principal Component Analysis.\n", "\n", "### Data representation {#data-representation .unnumbered}\n", "\n", "The data to be clustered are represented by a $n \\times m$ Numerical\n", "Python array `data`. Within the context of gene expression data\n", "clustering, typically the rows correspond to different genes whereas the\n", "columns correspond to different experimental conditions. The clustering\n", "algorithms in `Bio.Cluster` can be applied both to rows (genes) and to\n", "columns (experiments).\n", "\n", "### Missing values {#missing-values .unnumbered}\n", "\n", "Often in microarray experiments, some of the data values are missing,\n", "which is indicated by an additional $n \\times m$ Numerical Python\n", "integer array `mask`. If `mask[i,j]==0`, then `data[i,j]` is missing and\n", "is ignored in the analysis.\n", "\n", "### Random number generator {#random-number-generator .unnumbered}\n", "\n", "The $k$-means/medians/medoids clustering algorithms and Self-Organizing\n", "Maps (SOMs) include the use of a random number generator. The uniform\n", "random number generator in `Bio.Cluster` is based on the algorithm by\n", "L’Ecuyer @lecuyer1988, while random numbers following the binomial\n", "distribution are generated using the BTPE algorithm by Kachitvichyanukul\n", "and Schmeiser @kachitvichyanukul1988. The random number generator is\n", "initialized automatically during its first call. As this random number\n", "generator uses a combination of two multiplicative linear congruential\n", "generators, two (integer) seeds are needed for initialization, for which\n", "we use the system-supplied random number generator `rand` (in the C\n", "standard library). We initialize this generator by calling `srand` with\n", "the epoch time in seconds, and use the first two random numbers\n", "generated by `rand` as seeds for the uniform random number generator in\n", "`Bio.Cluster`.\n", "\n", "Distance functions {#sec:distancefunctions}\n", "------------------\n", "\n", "In order to cluster items into groups based on their similarity, we\n", "should first define what exactly we mean by *similar*. `Bio.Cluster`\n", "provides eight distance functions, indicated by a single character, to\n", "measure similarity, or conversely, distance:\n", "\n", "- `'e'`: Euclidean distance;\n", "\n", "- `'b'`: City-block distance.\n", "\n", "- `'c'`: Pearson correlation coefficient;\n", "\n", "- `'a'`: Absolute value of the Pearson correlation coefficient;\n", "\n", "- `'u'`: Uncentered Pearson correlation (equivalent to the cosine of\n", " the angle between two data vectors);\n", "\n", "- `'x'`: Absolute uncentered Pearson correlation;\n", "\n", "- `'s'`: Spearman’s rank correlation;\n", "\n", "- `'k'`: Kendall’s $\\tau$.\n", "\n", "The first two are true distance functions that satisfy the triangle\n", "inequality:\n", "$$d\\left(\\underline{u},\\underline{v}\\right) \\leq d\\left(\\underline{u},\\underline{w}\\right) + d\\left(\\underline{w},\\underline{v}\\right) \\textrm{ for all } \\underline{u}, \\underline{v}, \\underline{w},$$\n", "and are therefore refered to as *metrics*. In everyday language, this\n", "means that the shortest distance between two points is a straight line.\n", "\n", "The remaining six distance measures are related to the correlation\n", "coefficient, where the distance $d$ is defined in terms of the\n", "correlation $r$ by $d=1-r$. Note that these distance functions are\n", "*semi-metrics* that do not satisfy the triangle inequality. For example,\n", "for $$\\underline{u}=\\left(1,0,-1\\right);$$\n", "$$\\underline{v}=\\left(1,1,0\\right);$$\n", "$$\\underline{w}=\\left(0,1,1\\right);$$ we find a Pearson distance\n", "$d\\left(\\underline{u},\\underline{w}\\right) = 1.8660$, while\n", "$d\\left(\\underline{u},\\underline{v}\\right)+d\\left(\\underline{v},\\underline{w}\\right) = 1.6340$.\n", "\n", "### Euclidean distance {#euclidean-distance .unnumbered}\n", "\n", "In `Bio.Cluster`, we define the Euclidean distance as\n", "$$d = {1 \\over n} \\sum_{i=1}^{n} \\left(x_i-y_i\\right)^{2}.$$ Only those\n", "terms are included in the summation for which both $x_i$ and $y_i$ are\n", "present, and the denominator $n$ is chosen accordingly. As the\n", "expression data $x_i$ and $y_i$ are subtracted directly from each other,\n", "we should make sure that the expression data are properly normalized\n", "when using the Euclidean distance.\n", "\n", "### City-block distance {#city-block-distance .unnumbered}\n", "\n", "The city-block distance, alternatively known as the Manhattan distance,\n", "is related to the Euclidean distance. Whereas the Euclidean distance\n", "corresponds to the length of the shortest path between two points, the\n", "city-block distance is the sum of distances along each dimension. As\n", "gene expression data tend to have missing values, in `Bio.Cluster` we\n", "define the city-block distance as the sum of distances divided by the\n", "number of dimensions:\n", "$$d = {1 \\over n} \\sum_{i=1}^n \\left|x_i-y_i\\right|.$$ This is equal to\n", "the distance you would have to walk between two points in a city, where\n", "you have to walk along city blocks. As for the Euclidean distance, the\n", "expression data are subtracted directly from each other, and we should\n", "therefore make sure that they are properly normalized.\n", "\n", "### The Pearson correlation coefficient {#the-pearson-correlation-coefficient .unnumbered}\n", "\n", "The Pearson correlation coefficient is defined as\n", "$$r = \\frac{1}{n} \\sum_{i=1}^n \\left( \\frac{x_i -\\bar{x}}{\\sigma_x} \\right) \\left(\\frac{y_i -\\bar{y}}{\\sigma_y} \\right),$$\n", "in which $\\bar{x}, \\bar{y}$ are the sample mean of $x$ and $y$\n", "respectively, and $\\sigma_x, \\sigma_y$ are the sample standard deviation\n", "of $x$ and $y$. The Pearson correlation coefficient is a measure for how\n", "well a straight line can be fitted to a scatterplot of $x$ and $y$. If\n", "all the points in the scatterplot lie on a straight line, the Pearson\n", "correlation coefficient is either +1 or -1, depending on whether the\n", "slope of line is positive or negative. If the Pearson correlation\n", "coefficient is equal to zero, there is no correlation between $x$ and\n", "$y$.\n", "\n", "The *Pearson distance* is then defined as\n", "$$d_{\\textrm{P}} \\equiv 1 - r.$$ As the Pearson correlation coefficient\n", "lies between -1 and 1, the Pearson distance lies between 0 and 2.\n", "\n", "### Absolute Pearson correlation {#absolute-pearson-correlation .unnumbered}\n", "\n", "By taking the absolute value of the Pearson correlation, we find a\n", "number between 0 and 1. If the absolute value is 1, all the points in\n", "the scatter plot lie on a straight line with either a positive or a\n", "negative slope. If the absolute value is equal to zero, there is no\n", "correlation between $x$ and $y$.\n", "\n", "The corresponding distance is defined as\n", "$$d_{\\textrm A} \\equiv 1 - \\left|r\\right|,$$ where $r$ is the Pearson\n", "correlation coefficient. As the absolute value of the Pearson\n", "correlation coefficient lies between 0 and 1, the corresponding distance\n", "lies between 0 and 1 as well.\n", "\n", "In the context of gene expression experiments, the absolute correlation\n", "is equal to 1 if the gene expression profiles of two genes are either\n", "exactly the same or exactly opposite. The absolute correlation\n", "coefficient should therefore be used with care.\n", "\n", "### Uncentered correlation (cosine of the angle) {#uncentered-correlation-cosine-of-the-angle .unnumbered}\n", "\n", "In some cases, it may be preferable to use the *uncentered correlation*\n", "instead of the regular Pearson correlation coefficient. The uncentered\n", "correlation is defined as\n", "$$r_{\\textrm U} = \\frac{1}{n} \\sum_{i=1}^{n} \\left(\\frac{x_i}{\\sigma_x^{(0)}} \\right) \\left(\\frac{y_i}{\\sigma_y^{(0)}} \\right),$$\n", "where $$\\begin{aligned}\n", "\\sigma_x^{(0)} & = & \\sqrt{{\\frac{1}{n}} \\sum_{i=1}^{n}x_i^2}; \\nonumber \\\\\n", "\\sigma_y^{(0)} & = & \\sqrt{{\\frac{1}{n}} \\sum_{i=1}^{n}y_i^2}. \\nonumber\\end{aligned}$$\n", "This is the same expression as for the regular Pearson correlation\n", "coefficient, except that the sample means $\\bar{x}, \\bar{y}$ are set\n", "equal to zero. The uncentered correlation may be appropriate if there is\n", "a zero reference state. For instance, in the case of gene expression\n", "data given in terms of log-ratios, a log-ratio equal to zero corresponds\n", "to the green and red signal being equal, which means that the\n", "experimental manipulation did not affect the gene expression.\n", "\n", "The distance corresponding to the uncentered correlation coefficient is\n", "defined as $$d_{\\mbox{U}} \\equiv 1 - r_{\\mbox{U}},$$ where\n", "$r_{\\mbox{U}}$ is the uncentered correlation. As the uncentered\n", "correlation coefficient lies between -1 and 1, the corresponding\n", "distance lies between 0 and 2.\n", "\n", "The uncentered correlation is equal to the cosine of the angle of the\n", "two data vectors in $n$-dimensional space, and is often referred to as\n", "such.\n", "\n", "### Absolute uncentered correlation {#absolute-uncentered-correlation .unnumbered}\n", "\n", "As for the regular Pearson correlation, we can define a distance measure\n", "using the absolute value of the uncentered correlation:\n", "$$d_{\\mbox{AU}} \\equiv 1 - \\left|r_{\\mbox{U}}\\right|,$$ where\n", "$r_{\\mbox{U}}$ is the uncentered correlation coefficient. As the\n", "absolute value of the uncentered correlation coefficient lies between 0\n", "and 1, the corresponding distance lies between 0 and 1 as well.\n", "\n", "Geometrically, the absolute value of the uncentered correlation is equal\n", "to the cosine between the supporting lines of the two data vectors\n", "(i.e., the angle without taking the direction of the vectors into\n", "consideration).\n", "\n", "### Spearman rank correlation {#spearman-rank-correlation .unnumbered}\n", "\n", "The Spearman rank correlation is an example of a non-parametric\n", "similarity measure, and tends to be more robust against outliers than\n", "the Pearson correlation.\n", "\n", "To calculate the Spearman rank correlation, we replace each data value\n", "by their rank if we would order the data in each vector by their value.\n", "We then calculate the Pearson correlation between the two rank vectors\n", "instead of the data vectors.\n", "\n", "As in the case of the Pearson correlation, we can define a distance\n", "measure corresponding to the Spearman rank correlation as\n", "$$d_{\\mbox{S}} \\equiv 1 - r_{\\mbox{S}},$$ where $r_{\\mbox{S}}$ is the\n", "Spearman rank correlation.\n", "\n", "### Kendall’s $\\tau$ {#kendalls-tau .unnumbered}\n", "\n", "Kendall’s $\\tau$ is another example of a non-parametric similarity\n", "measure. It is similar to the Spearman rank correlation, but instead of\n", "the ranks themselves only the relative ranks are used to calculate\n", "$\\tau$ (see Snedecor & Cochran @snedecor1989).\n", "\n", "We can define a distance measure corresponding to Kendall’s $\\tau$ as\n", "$$d_{\\mbox{K}} \\equiv 1 - \\tau.$$ As Kendall’s $\\tau$ is always between\n", "-1 and 1, the corresponding distance will be between 0 and 2.\n", "\n", "### Weighting {#weighting .unnumbered}\n", "\n", "For most of the distance functions available in `Bio.Cluster`, a weight\n", "vector can be applied. The weight vector contains weights for the items\n", "in the data vector. If the weight for item $i$ is $w_i$, then that item\n", "is treated as if it occurred $w_i$ times in the data. The weight do not\n", "have to be integers. For the Spearman rank correlation and Kendall’s\n", "$\\tau$, weights do not have a well-defined meaning and are therefore not\n", "implemented.\n", "\n", "### Calculating the distance matrix {#subsec:distancematrix .unnumbered}\n", "\n", "The distance matrix is a square matrix with all pairwise distances\n", "between the items in `data`, and can be calculated by the function\n", "`distancematrix` in the `Bio.Cluster` module:\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'data' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mBio\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mCluster\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mdistancematrix\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mmatrix\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mdistancematrix\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'data' is not defined" ] } ], "source": [ "from Bio.Cluster import distancematrix\n", "matrix = distancematrix(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "where the following arguments are defined:\n", "\n", "- `data` (required)\\\n", " Array containing the data for the items.\n", "\n", "- `mask` (default: `None`)\\\n", " Array of integers showing which data are missing. If `mask[i,j]==0`,\n", " then `data[i,j]` is missing. If `mask==None`, then all data\n", " are present.\n", "\n", "- `weight` (default: `None`)\\\n", " The weights to be used when calculating distances. If\n", " `weight==None`, then equal weights are assumed.\n", "\n", "- `transpose` (default: `0`)\\\n", " Determines if the distances between the rows of `data` are to be\n", " calculated (`transpose==0`), or between the columns of `data`\n", " (`transpose==1`).\n", "\n", "- `dist` (default: `'e'`, Euclidean distance)\\\n", " Defines the distance function to be used\n", " (see \\[sec:distancefunctions\\]).\n", "\n", "To save memory, the distance matrix is returned as a list of 1D arrays.\n", "The number of columns in each row is equal to the row number. Hence, the\n", "first row has zero elements. An example of the return value is\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "[array([]),\n", " array([1.]),\n", " array([7., 3.]),\n", " array([4., 2., 6.])]\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This corresponds to the distance matrix $$\\left(\n", "\\begin{array}{cccc}\n", "0 & 1 & 7 & 4 \\\\\n", "1 & 0 & 3 & 2 \\\\\n", "7 & 3 & 0 & 6 \\\\\n", "4 & 2 & 6 & 0\n", "\\end{array}\n", "\\right).$$\n", "\n", "Calculating cluster properties\n", "------------------------------\n", "\n", "### Calculating the cluster centroids {#subsec:clustercentroids .unnumbered}\n", "\n", "The centroid of a cluster can be defined either as the mean or as the\n", "median of each dimension over all cluster items. The function\n", "`clustercentroids` in `Bio.Cluster` can be used to calculate either:\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'data' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mBio\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mCluster\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mclustercentroids\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mcdata\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcmask\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mclustercentroids\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'data' is not defined" ] } ], "source": [ "from Bio.Cluster import clustercentroids\n", "cdata, cmask = clustercentroids(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "where the following arguments are defined:\n", "\n", "- `data` (required)\\\n", " Array containing the data for the items.\n", "\n", "- `mask` (default: `None`)\\\n", " Array of integers showing which data are missing. If `mask[i,j]==0`,\n", " then `data[i,j]` is missing. If `mask==None`, then all data\n", " are present.\n", "\n", "- `clusterid` (default: `None`)\\\n", " Vector of integers showing to which cluster each item belongs. If\n", " `clusterid` is `None`, then all items are assumed to belong to the\n", " same cluster.\n", "\n", "- `method` (default: `'a'`)\\\n", " Specifies whether the arithmetic mean (`method=='a'`) or the median\n", " (`method=='m'`) is used to calculate the cluster center.\n", "\n", "- `transpose` (default: `0`)\\\n", " Determines if the centroids of the rows of `data` are to be\n", " calculated (`transpose==0`), or the centroids of the columns of\n", " `data` (`transpose==1`).\n", "\n", "This function returns the tuple `(cdata, cmask)`. The centroid data are\n", "stored in the 2D Numerical Python array `cdata`, with missing data\n", "indicated by the 2D Numerical Python integer array `cmask`. The\n", "dimensions of these arrays are\n", "$\\left(\\textrm{number of clusters}, \\textrm{number of columns}\\right)$\n", "if `transpose` is `0`, or\n", "$\\left(\\textrm{number of rows}, \\textrm{number of clusters}\\right)$ if\n", "`transpose` is `1`. Each row (if `transpose` is `0`) or column (if\n", "`transpose` is `1`) contains the averaged data corresponding to the\n", "centroid of each cluster.\n", "\n", "### Calculating the distance between clusters {#calculating-the-distance-between-clusters .unnumbered}\n", "\n", "Given a distance function between *items*, we can define the distance\n", "between two *clusters* in several ways. The distance between the\n", "arithmetic means of the two clusters is used in pairwise\n", "centroid-linkage clustering and in $k$-means clustering. In $k$-medoids\n", "clustering, the distance between the medians of the two clusters is used\n", "instead. The shortest pairwise distance between items of the two\n", "clusters is used in pairwise single-linkage clustering, while the\n", "longest pairwise distance is used in pairwise maximum-linkage\n", "clustering. In pairwise average-linkage clustering, the distance between\n", "two clusters is defined as the average over the pairwise distances.\n", "\n", "To calculate the distance between two clusters, use\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'data' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mBio\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mCluster\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mclusterdistance\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mdistance\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mclusterdistance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'data' is not defined" ] } ], "source": [ "from Bio.Cluster import clusterdistance\n", "distance = clusterdistance(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "where the following arguments are defined:\n", "\n", "- `data` (required)\\\n", " Array containing the data for the items.\n", "\n", "- `mask` (default: `None`)\\\n", " Array of integers showing which data are missing. If `mask[i,j]==0`,\n", " then `data[i,j]` is missing. If `mask==None`, then all data\n", " are present.\n", "\n", "- `weight` (default: `None`)\\\n", " The weights to be used when calculating distances. If\n", " `weight==None`, then equal weights are assumed.\n", "\n", "- `index1` (default: `0`)\\\n", " A list containing the indices of the items belonging to the\n", " first cluster. A cluster containing only one item $i$ can be\n", " represented either as a list `[i]`, or as an integer `i`.\n", "\n", "- `index2` (default: `0`)\\\n", " A list containing the indices of the items belonging to the\n", " second cluster. A cluster containing only one items $i$ can be\n", " represented either as a list `[i]`, or as an integer `i`.\n", "\n", "- `method` (default: `'a'`)\\\n", " Specifies how the distance between clusters is defined:\n", "\n", " - `'a'`: Distance between the two cluster centroids (arithmetic\n", " mean);\n", "\n", " - `'m'`: Distance between the two cluster centroids (median);\n", "\n", " - `'s'`: Shortest pairwise distance between items in the two\n", " clusters;\n", "\n", " - `'x'`: Longest pairwise distance between items in the two\n", " clusters;\n", "\n", " - `'v'`: Average over the pairwise distances between items in the\n", " two clusters.\n", "\n", "- `dist` (default: `'e'`, Euclidean distance)\\\n", " Defines the distance function to be used\n", " (see \\[sec:distancefunctions\\]).\n", "\n", "- `transpose` (default: `0`)\\\n", " If `transpose==0`, calculate the distance between the rows of\n", " `data`. If `transpose==1`, calculate the distance between the\n", " columns of `data`.\n", "\n", "Partitioning algorithms\n", "-----------------------\n", "\n", "Partitioning algorithms divide items into $k$ clusters such that the sum\n", "of distances over the items to their cluster centers is minimal. The\n", "number of clusters $k$ is specified by the user. Three partitioning\n", "algorithms are available in `Bio.Cluster`:\n", "\n", "- $k$-means clustering\n", "\n", "- $k$-medians clustering\n", "\n", "- $k$-medoids clustering\n", "\n", "These algorithms differ in how the cluster center is defined. In\n", "$k$-means clustering, the cluster center is defined as the mean data\n", "vector averaged over all items in the cluster. Instead of the mean, in\n", "$k$-medians clustering the median is calculated for each dimension in\n", "the data vector. Finally, in $k$-medoids clustering the cluster center\n", "is defined as the item which has the smallest sum of distances to the\n", "other items in the cluster. This clustering algorithm is suitable for\n", "cases in which the distance matrix is known but the original data matrix\n", "is not available, for example when clustering proteins based on their\n", "structural similarity.\n", "\n", "The expectation-maximization (EM) algorithm is used to find this\n", "partitioning into $k$ groups. In the initialization of the EM algorithm,\n", "we randomly assign items to clusters. To ensure that no empty clusters\n", "are produced, we use the binomial distribution to randomly choose the\n", "number of items in each cluster to be one or more. We then randomly\n", "permute the cluster assignments to items such that each item has an\n", "equal probability to be in any cluster. Each cluster is thus guaranteed\n", "to contain at least one item.\n", "\n", "We then iterate:\n", "\n", "- Calculate the centroid of each cluster, defined as either the mean,\n", " the median, or the medoid of the cluster;\n", "\n", "- Calculate the distances of each item to the cluster centers;\n", "\n", "- For each item, determine which cluster centroid is closest;\n", "\n", "- Reassign each item to its closest cluster, or stop the iteration if\n", " no further item reassignments take place.\n", "\n", "To avoid clusters becoming empty during the iteration, in $k$-means and\n", "$k$-medians clustering the algorithm keeps track of the number of items\n", "in each cluster, and prohibits the last remaining item in a cluster from\n", "being reassigned to a different cluster. For $k$-medoids clustering,\n", "such a check is not needed, as the item that functions as the cluster\n", "centroid has a zero distance to itself, and will therefore never be\n", "closer to a different cluster.\n", "\n", "As the initial assignment of items to clusters is done randomly, usually\n", "a different clustering solution is found each time the EM algorithm is\n", "executed. To find the optimal clustering solution, the $k$-means\n", "algorithm is repeated many times, each time starting from a different\n", "initial random clustering. The sum of distances of the items to their\n", "cluster center is saved for each run, and the solution with the smallest\n", "value of this sum will be returned as the overall clustering solution.\n", "\n", "How often the EM algorithm should be run depends on the number of items\n", "being clustered. As a rule of thumb, we can consider how often the\n", "optimal solution was found; this number is returned by the partitioning\n", "algorithms as implemented in this library. If the optimal solution was\n", "found many times, it is unlikely that better solutions exist than the\n", "one that was found. However, if the optimal solution was found only\n", "once, there may well be other solutions with a smaller within-cluster\n", "sum of distances. If the number of items is large (more than several\n", "hundreds), it may be difficult to find the globally optimal solution.\n", "\n", "The EM algorithm terminates when no further reassignments take place. We\n", "noticed that for some sets of initial cluster assignments, the EM\n", "algorithm fails to converge due to the same clustering solution\n", "reappearing periodically after a small number of iteration steps. We\n", "therefore check for the occurrence of such periodic solutions during the\n", "iteration. After a given number of iteration steps, the current\n", "clustering result is saved as a reference. By comparing the clustering\n", "result after each subsequent iteration step to the reference state, we\n", "can determine if a previously encountered clustering result is found. In\n", "such a case, the iteration is halted. If after a given number of\n", "iterations the reference state has not yet been encountered, the current\n", "clustering solution is saved to be used as the new reference state.\n", "Initially, ten iteration steps are executed before resaving the\n", "reference state. This number of iteration steps is doubled each time, to\n", "ensure that periodic behavior with longer periods can also be detected.\n", "\n", "### $k$-means and $k$-medians {#k-means-and-k-medians .unnumbered}\n", "\n", "The $k$-means and $k$-medians algorithms are implemented as the function\n", "`kcluster` in `Bio.Cluster`:\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'data' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mBio\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mCluster\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mkcluster\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mclusterid\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merror\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnfound\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mkcluster\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'data' is not defined" ] } ], "source": [ "from Bio.Cluster import kcluster\n", "clusterid, error, nfound = kcluster(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "where the following arguments are defined:\n", "\n", "- `data` (required)\\\n", " Array containing the data for the items.\n", "\n", "- `nclusters` (default: `2`)\\\n", " The number of clusters $k$.\n", "\n", "- `mask` (default: `None`)\\\n", " Array of integers showing which data are missing. If `mask[i,j]==0`,\n", " then `data[i,j]` is missing. If `mask==None`, then all data\n", " are present.\n", "\n", "- `weight` (default: `None`)\\\n", " The weights to be used when calculating distances. If\n", " `weight==None`, then equal weights are assumed.\n", "\n", "- `transpose` (default: `0`)\\\n", " Determines if rows (`transpose` is `0`) or columns (`transpose` is\n", " `1`) are to be clustered.\n", "\n", "- `npass` (default: `1`)\\\n", " The number of times the $k$-means/-medians clustering algorithm is\n", " performed, each time with a different (random) initial condition. If\n", " `initialid` is given, the value of `npass` is ignored and the\n", " clustering algorithm is run only once, as it behaves\n", " deterministically in that case.\n", "\n", "- `method` (default: `a`)\\\n", " describes how the center of a cluster is found:\n", "\n", " - `method=='a'`: arithmetic mean ($k$-means clustering);\n", "\n", " - `method=='m'`: median ($k$-medians clustering).\n", "\n", " For other values of `method`, the arithmetic mean is used.\n", "\n", "- `dist` (default: `'e'`, Euclidean distance)\\\n", " Defines the distance function to be used\n", " (see \\[sec:distancefunctions\\]). Whereas all eight distance measures\n", " are accepted by `kcluster`, from a theoretical viewpoint it is best\n", " to use the Euclidean distance for the $k$-means algorithm, and the\n", " city-block distance for $k$-medians.\n", "\n", "- `initialid` (default: `None`)\\\n", " Specifies the initial clustering to be used for the EM algorithm. If\n", " `initialid==None`, then a different random initial clustering is\n", " used for each of the `npass` runs of the EM algorithm. If\n", " `initialid` is not `None`, then it should be equal to a 1D array\n", " containing the cluster number (between `0` and `nclusters-1`) for\n", " each item. Each cluster should contain at least one item. With the\n", " initial clustering specified, the EM algorithm is deterministic.\n", "\n", "This function returns a tuple `(clusterid, error, nfound)`, where\n", "`clusterid` is an integer array containing the number of the cluster to\n", "which each row or cluster was assigned, `error` is the within-cluster\n", "sum of distances for the optimal clustering solution, and `nfound` is\n", "the number of times this optimal solution was found.\n", "\n", "### $k$-medoids clustering {#k-medoids-clustering .unnumbered}\n", "\n", "The `kmedoids` routine performs $k$-medoids clustering on a given set of\n", "items, using the distance matrix and the number of clusters passed by\n", "the user:\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'distance' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mBio\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mCluster\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mkmedoids\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mclusterid\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merror\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnfound\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mkmedoids\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdistance\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'distance' is not defined" ] } ], "source": [ "from Bio.Cluster import kmedoids\n", "clusterid, error, nfound = kmedoids(distance)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "where the following arguments are defined: , nclusters=2, npass=1,\n", "initialid=None)|\n", "\n", "- `distance` (required)\\\n", " The matrix containing the distances between the items; this matrix\n", " can be specified in three ways:\n", "\n", " - as a 2D Numerical Python array (in which only the left-lower\n", " part of the array will be accessed):\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "distance = array([[0.0, 1.1, 2.3],\n", " [1.1, 0.0, 4.5],\n", " [2.3, 4.5, 0.0]])\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " - as a 1D Numerical Python array containing consecutively the\n", " distances in the left-lower part of the distance matrix:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "distance = array([1.1, 2.3, 4.5])\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " - as a list containing the rows of the left-lower part of the\n", " distance matrix:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "distance = [array([]|,\n", " array([1.1]),\n", " array([2.3, 4.5])\n", " ]\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " These three expressions correspond to the same distance matrix.\n", "\n", "- `nclusters` (default: `2`)\\\n", " The number of clusters $k$.\n", "\n", "- `npass` (default: `1`)\\\n", " The number of times the $k$-medoids clustering algorithm is\n", " performed, each time with a different (random) initial condition. If\n", " `initialid` is given, the value of `npass` is ignored, as the\n", " clustering algorithm behaves deterministically in that case.\n", "\n", "- `initialid` (default: `None`)\\\n", " Specifies the initial clustering to be used for the EM algorithm. If\n", " `initialid==None`, then a different random initial clustering is\n", " used for each of the `npass` runs of the EM algorithm. If\n", " `initialid` is not `None`, then it should be equal to a 1D array\n", " containing the cluster number (between `0` and `nclusters-1`) for\n", " each item. Each cluster should contain at least one item. With the\n", " initial clustering specified, the EM algorithm is deterministic.\n", "\n", "This function returns a tuple `(clusterid, error, nfound)`, where\n", "`clusterid` is an array containing the number of the cluster to which\n", "each item was assigned, `error` is the within-cluster sum of distances\n", "for the optimal $k$-medoids clustering solution, and `nfound` is the\n", "number of times the optimal solution was found. Note that the cluster\n", "number in `clusterid` is defined as the item number of the item\n", "representing the cluster centroid.\n", "\n", "Hierarchical clustering\n", "-----------------------\n", "\n", "Hierarchical clustering methods are inherently different from the\n", "$k$-means clustering method. In hierarchical clustering, the similarity\n", "in the expression profile between genes or experimental conditions are\n", "represented in the form of a tree structure. This tree structure can be\n", "shown graphically by programs such as Treeview and Java Treeview, which\n", "has contributed to the popularity of hierarchical clustering in the\n", "analysis of gene expression data.\n", "\n", "The first step in hierarchical clustering is to calculate the distance\n", "matrix, specifying all the distances between the items to be clustered.\n", "Next, we create a node by joining the two closest items. Subsequent\n", "nodes are created by pairwise joining of items or nodes based on the\n", "distance between them, until all items belong to the same node. A tree\n", "structure can then be created by retracing which items and nodes were\n", "merged. Unlike the EM algorithm, which is used in $k$-means clustering,\n", "the complete process of hierarchical clustering is deterministic.\n", "\n", "Several flavors of hierarchical clustering exist, which differ in how\n", "the distance between subnodes is defined in terms of their members. In\n", "`Bio.Cluster`, pairwise single, maximum, average, and centroid linkage\n", "are available.\n", "\n", "- In pairwise single-linkage clustering, the distance between two\n", " nodes is defined as the shortest distance among the pairwise\n", " distances between the members of the two nodes.\n", "\n", "- In pairwise maximum-linkage clustering, alternatively known as\n", " pairwise complete-linkage clustering, the distance between two nodes\n", " is defined as the longest distance among the pairwise distances\n", " between the members of the two nodes.\n", "\n", "- In pairwise average-linkage clustering, the distance between two\n", " nodes is defined as the average over all pairwise distances between\n", " the items of the two nodes.\n", "\n", "- In pairwise centroid-linkage clustering, the distance between two\n", " nodes is defined as the distance between their centroids. The\n", " centroids are calculated by taking the mean over all the items in\n", " a cluster. As the distance from each newly formed node to existing\n", " nodes and items need to be calculated at each step, the computing\n", " time of pairwise centroid-linkage clustering may be significantly\n", " longer than for the other hierarchical clustering methods. Another\n", " peculiarity is that (for a distance measure based on the Pearson\n", " correlation), the distances do not necessarily increase when going\n", " up in the clustering tree, and may even decrease. This is caused by\n", " an inconsistency between the centroid calculation and the distance\n", " calculation when using the Pearson correlation: Whereas the Pearson\n", " correlation effectively normalizes the data for the distance\n", " calculation, no such normalization occurs for the\n", " centroid calculation.\n", "\n", "For pairwise single-, complete-, and average-linkage clustering, the\n", "distance between two nodes can be found directly from the distances\n", "between the individual items. Therefore, the clustering algorithm does\n", "not need access to the original gene expression data, once the distance\n", "matrix is known. For pairwise centroid-linkage clustering, however, the\n", "centroids of newly formed subnodes can only be calculated from the\n", "original data and not from the distance matrix.\n", "\n", "The implementation of pairwise single-linkage hierarchical clustering is\n", "based on the SLINK algorithm (R. Sibson, 1973), which is much faster and\n", "more memory-efficient than a straightforward implementation of pairwise\n", "single-linkage clustering. The clustering result produced by this\n", "algorithm is identical to the clustering solution found by the\n", "conventional single-linkage algorithm. The single-linkage hierarchical\n", "clustering algorithm implemented in this library can be used to cluster\n", "large gene expression data sets, for which conventional hierarchical\n", "clustering algorithms fail due to excessive memory requirements and\n", "running time.\n", "\n", "### Representing a hierarchical clustering solution {#representing-a-hierarchical-clustering-solution .unnumbered}\n", "\n", "The result of hierarchical clustering consists of a tree of nodes, in\n", "which each node joins two items or subnodes. Usually, we are not only\n", "interested in which items or subnodes are joined at each node, but also\n", "in their similarity (or distance) as they are joined. To store one node\n", "in the hierarchical clustering tree, we make use of the class `Node`,\n", "which defined in `Bio.Cluster`. An instance of `Node` has three\n", "attributes:\n", "\n", "- `left`\n", "\n", "- `right`\n", "\n", "- `distance`\n", "\n", "Here, `left` and `right` are integers referring to the two items or\n", "subnodes that are joined at this node, and `distance` is the distance\n", "between them. The items being clustered are numbered from 0 to\n", "$\\left(\\textrm{number of items} - 1\\right)$, while clusters are numbered\n", "from -1 to $-\\left(\\textrm{number of items}-1\\right)$. Note that the\n", "number of nodes is one less than the number of items.\n", "\n", "To create a new `Node` object, we need to specify `left` and `right`;\n", "`distance` is optional.\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(2, 3): 0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from Bio.Cluster import Node\n", "Node(2, 3)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(2, 3): 0.91" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Node(2, 3, 0.91)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The attributes `left`, `right`, and `distance` of an existing `Node`\n", "object can be modified directly:\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(6, 2): 0.73" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "node = Node(4, 5)\n", "node.left = 6\n", "node.right = 2\n", "node.distance = 0.73\n", "node" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "An error is raised if `left` and `right` are not integers, or if\n", "`distance` cannot be converted to a floating-point value.\n", "\n", "The Python class `Tree` represents a full hierarchical clustering\n", "solution. A `Tree` object can be created from a list of `Node` objects:\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1, 2): 0.2\n", "(0, 3): 0.5\n", "(-2, 4): 0.6\n", "(-1, -3): 0.9\n" ] } ], "source": [ "from Bio.Cluster import Node, Tree\n", "nodes = [Node(1, 2, 0.2), Node(0, 3, 0.5), Node(-2, 4, 0.6), Node(-1, -3, 0.9)]\n", "tree = Tree(nodes)\n", "print(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The `Tree` initializer checks if the list of nodes is a valid\n", "hierarchical clustering result:\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This tree is problematic\n" ] } ], "source": [ "nodes = [Node(1, 2, 0.2), Node(0, 2, 0.5)]\n", "try:\n", " Tree(nodes)\n", " raise Exception(\"Should not arrive here\")\n", "except ValueError:\n", " print(\"This tree is problematic\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Individual nodes in a `Tree` object can be accessed using square\n", "brackets:\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(1, 2): 0.2" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nodes = [Node(1, 2, 0.2), Node(0, -1, 0.5)]\n", "tree = Tree(nodes)\n", "tree[0]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(0, -1): 0.5" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree[1]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(0, -1): 0.5" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "As a `Tree` object is read-only, we cannot change individual nodes in a\n", "`Tree` object. However, we can convert the tree to a list of nodes,\n", "modify this list, and create a new tree from this list:\n", "\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1, 2): 0.1\n", "(0, -1): 0.5\n", "(-2, 3): 0.9\n" ] } ], "source": [ "tree = Tree([Node(1, 2, 0.1), Node(0, -1, 0.5), Node(-2, 3, 0.9)])\n", "print(tree)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "ename": "TypeError", "evalue": "sequence index must be integer, not 'slice'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mnodes\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mtree\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2\u001b[0m \u001b[0mnodes\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mNode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m0.2\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[0mnodes\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mleft\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;36m2\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[0mtree\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mTree\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnodes\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtree\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mTypeError\u001b[0m: sequence index must be integer, not 'slice'" ] } ], "source": [ "nodes = tree[:]\n", "nodes[0] = Node(0, 1, 0.2)\n", "nodes[1].left = 2\n", "tree = Tree(nodes)\n", "print(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This guarantees that any `Tree` object is always well-formed.\n", "\n", "To display a hierarchical clustering solution with visualization\n", "programs such as Java Treeview, it is better to scale all node distances\n", "such that they are between zero and one. This can be accomplished by\n", "calling the `scale` method on an existing `Tree` object:\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tree.scale()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This method takes no arguments, and returns `None`.\n", "\n", "After hierarchical clustering, the items can be grouped into $k$\n", "clusters based on the tree structure stored in the `Tree` object by\n", "cutting the tree:\n", "\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "ename": "TypeError", "evalue": "cut() takes no keyword arguments", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mclusterid\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mtree\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcut\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnclusters\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mTypeError\u001b[0m: cut() takes no keyword arguments" ] } ], "source": [ "clusterid = tree.cut(nclusters=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "where `nclusters` (defaulting to `1`) is the desired number of clusters\n", "$k$. This method ignores the top $k-1$ linking events in the tree\n", "structure, resulting in $k$ separated clusters of items. The number of\n", "clusters $k$ should be positive, and less than or equal to the number of\n", "items. This method returns an array `clusterid` containing the number of\n", "the cluster to which each item is assigned.\n", "\n", "### Performing hierarchical clustering {#performing-hierarchical-clustering .unnumbered}\n", "\n", "To perform hierarchical clustering, use the `treecluster` function in\n", "`Bio.Cluster`.\n", "\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'data' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mBio\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mCluster\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mtreecluster\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mtree\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mtreecluster\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'data' is not defined" ] } ], "source": [ "from Bio.Cluster import treecluster\n", "tree = treecluster(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "where the following arguments are defined:\n", "\n", "- `data`\\\n", " Array containing the data for the items.\n", "\n", "- `mask` (default: `None`)\\\n", " Array of integers showing which data are missing. If `mask[i,j]==0`,\n", " then `data[i,j]` is missing. If `mask==None`, then all data\n", " are present.\n", "\n", "- `weight` (default: `None`)\\\n", " The weights to be used when calculating distances. If\n", " `weight==None`, then equal weights are assumed.\n", "\n", "- `transpose` (default: `0`)\\\n", " Determines if rows (`transpose==0`) or columns (`transpose==1`) are\n", " to be clustered.\n", "\n", "- `method` (default: `'m'`)\\\n", " defines the linkage method to be used:\n", "\n", " - `method=='s'`: pairwise single-linkage clustering\n", "\n", " - `method=='m'`: pairwise maximum- (or complete-) linkage\n", " clustering\n", "\n", " - `method=='c'`: pairwise centroid-linkage clustering\n", "\n", " - `method=='a'`: pairwise average-linkage clustering\n", "\n", "- `dist` (default: `'e'`, Euclidean distance)\\\n", " Defines the distance function to be used\n", " (see \\[sec:distancefunctions\\]).\n", "\n", "To apply hierarchical clustering on a precalculated distance matrix,\n", "specify the `distancematrix` argument when calling `treecluster`\n", "function instead of the `data` argument:\n", "\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'distance' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mBio\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mCluster\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mtreecluster\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mtree\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mtreecluster\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdistancematrix\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdistance\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'distance' is not defined" ] } ], "source": [ "from Bio.Cluster import treecluster\n", "tree = treecluster(distancematrix=distance)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "In this case, the following arguments are defined:\n", "\n", "- `distancematrix`\\\n", " The distance matrix, which can be specified in three ways:\n", "\n", " - as a 2D Numerical Python array (in which only the left-lower\n", " part of the array will be accessed):\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "distance = array([[0.0, 1.1, 2.3],\n", " [1.1, 0.0, 4.5],\n", " [2.3, 4.5, 0.0]])\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " - as a 1D Numerical Python array containing consecutively the\n", " distances in the left-lower part of the distance matrix:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "distance = array([1.1, 2.3, 4.5])\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " - as a list containing the rows of the left-lower part of the\n", " distance matrix:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "distance = [array([]),\n", " array([1.1]),\n", " array([2.3, 4.5])\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " These three expressions correspond to the same distance matrix. As\n", " `treecluster` may shuffle the values in the distance matrix as part\n", " of the clustering algorithm, be sure to save this array in a\n", " different variable before calling `treecluster` if you need\n", " it later.\n", "\n", "- `method`\\\n", " The linkage method to be used:\n", "\n", " - `method=='s'`: pairwise single-linkage clustering\n", "\n", " - `method=='m'`: pairwise maximum- (or complete-) linkage\n", " clustering\n", "\n", " - `method=='a'`: pairwise average-linkage clustering\n", "\n", " While pairwise single-, maximum-, and average-linkage clustering can\n", " be calculated from the distance matrix alone, pairwise\n", " centroid-linkage cannot.\n", "\n", "When calling `treecluster`, either `data` or `distancematrix` should be\n", "`None`.\n", "\n", "This function returns a `Tree` object. This object contains\n", "$\\left(\\textrm{number of items} - 1\\right)$ nodes, where the number of\n", "items is the number of rows if rows were clustered, or the number of\n", "columns if columns were clustered. Each node describes a pairwise\n", "linking event, where the node attributes `left` and `right` each contain\n", "the number of one item or subnode, and `distance` the distance between\n", "them. Items are numbered from 0 to\n", "$\\left(\\textrm{number of items} - 1\\right)$, while clusters are numbered\n", "-1 to $-\\left(\\textrm{number of items}-1\\right)$.\n", "\n", "Self-Organizing Maps\n", "--------------------\n", "\n", "Self-Organizing Maps (SOMs) were invented by Kohonen to describe neural\n", "networks (see for instance Kohonen, 1997 @kohonen1997). Tamayo (1999)\n", "first applied Self-Organizing Maps to gene expression data @tamayo1999.\n", "\n", "SOMs organize items into clusters that are situated in some topology.\n", "Usually a rectangular topology is chosen. The clusters generated by SOMs\n", "are such that neighboring clusters in the topology are more similar to\n", "each other than clusters far from each other in the topology.\n", "\n", "The first step to calculate a SOM is to randomly assign a data vector to\n", "each cluster in the topology. If rows are being clustered, then the\n", "number of elements in each data vector is equal to the number of\n", "columns.\n", "\n", "An SOM is then generated by taking rows one at a time, and finding which\n", "cluster in the topology has the closest data vector. The data vector of\n", "that cluster, as well as those of the neighboring clusters, are adjusted\n", "using the data vector of the row under consideration. The adjustment is\n", "given by\n", "$$\\Delta \\underline{x}_{\\textrm{cell}} = \\tau \\cdot \\left(\\underline{x}_{\\textrm{row}} - \\underline{x}_{\\textrm{cell}} \\right).$$\n", "The parameter $\\tau$ is a parameter that decreases at each iteration\n", "step. We have used a simple linear function of the iteration step:\n", "$$\\tau = \\tau_{\\textrm{init}} \\cdot \\left(1 - {i \\over n}\\right),$$\n", "$\\tau_{\\textrm{init}}$ is the initial value of $\\tau$ as specified by\n", "the user, $i$ is the number of the current iteration step, and $n$ is\n", "the total number of iteration steps to be performed. While changes are\n", "made rapidly in the beginning of the iteration, at the end of iteration\n", "only small changes are made.\n", "\n", "All clusters within a radius $R$ are adjusted to the gene under\n", "consideration. This radius decreases as the calculation progresses as\n", "$$R = R_{\\textrm{max}} \\cdot \\left(1 - {i \\over n}\\right),$$ in which\n", "the maximum radius is defined as\n", "$$R_{\\textrm{max}} = \\sqrt{N_x^2 + N_y^2},$$ where\n", "$\\left(N_x, N_y\\right)$ are the dimensions of the rectangle defining the\n", "topology.\n", "\n", "The function `somcluster` implements the complete algorithm to calculate\n", "a Self-Organizing Map on a rectangular grid. First it initializes the\n", "random number generator. The node data are then initialized using the\n", "random number generator. The order in which genes or microarrays are\n", "used to modify the SOM is also randomized. The total number of\n", "iterations in the SOM algorithm is specified by the user.\n", "\n", "To run `somcluster`, use\n", "\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'data' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mBio\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mCluster\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0msomcluster\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mclusterid\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcelldata\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0msomcluster\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'data' is not defined" ] } ], "source": [ "from Bio.Cluster import somcluster\n", "clusterid, celldata = somcluster(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "where the following arguments are defined:\n", "\n", "- `data` (required)\\\n", " Array containing the data for the items.\n", "\n", "- `mask` (default: `None`)\\\n", " Array of integers showing which data are missing. If `mask[i,j]==0`,\n", " then `data[i,j]` is missing. If `mask==None`, then all data\n", " are present.\n", "\n", "- `weight` (default: `None`)\\\n", " contains the weights to be used when calculating distances. If\n", " `weight==None`, then equal weights are assumed.\n", "\n", "- `transpose` (default: `0`)\\\n", " Determines if rows (`transpose` is `0`) or columns (`transpose` is\n", " `1`) are to be clustered.\n", "\n", "- `nxgrid, nygrid` (default: `2, 1`)\\\n", " The number of cells horizontally and vertically in the rectangular\n", " grid on which the Self-Organizing Map is calculated.\n", "\n", "- `inittau` (default: `0.02`)\\\n", " The initial value for the parameter $\\tau$ that is used in the\n", " SOM algorithm. The default value for `inittau` is 0.02, which was\n", " used in Michael Eisen’s Cluster/TreeView program.\n", "\n", "- `niter` (default: `1`)\\\n", " The number of iterations to be performed.\n", "\n", "- `dist` (default: `'e'`, Euclidean distance)\\\n", " Defines the distance function to be used\n", " (see \\[sec:distancefunctions\\]).\n", "\n", "This function returns the tuple `(clusterid, celldata)`:\n", "\n", "- `clusterid`:\\\n", " An array with two columns, where the number of rows is equal to the\n", " number of items that were clustered. Each row contains the $x$ and\n", " $y$ coordinates of the cell in the rectangular SOM grid to which the\n", " item was assigned.\n", "\n", "- `celldata`:\\\n", " An array with dimensions\n", " $\\left(\\verb|nxgrid|, \\verb|nygrid|, \\textrm{number of columns}\\right)$\n", " if rows are being clustered, or\n", " $\\left(\\verb|nxgrid|, \\verb|nygrid|, \\textrm{number of rows}\\right)$\n", " if columns are being clustered. Each element `[ix][iy]` of this\n", " array is a 1D vector containing the gene expression data for the\n", " centroid of the cluster in the grid cell with coordinates\n", " `[ix][iy]`.\n", "\n", "Principal Component Analysis\n", "----------------------------\n", "\n", "Principal Component Analysis (PCA) is a widely used technique for\n", "analyzing multivariate data. A practical example of applying Principal\n", "Component Analysis to gene expression data is presented by Yeung and\n", "Ruzzo (2001) @yeung2001.\n", "\n", "In essence, PCA is a coordinate transformation in which each row in the\n", "data matrix is written as a linear sum over basis vectors called\n", "principal components, which are ordered and chosen such that each\n", "maximally explains the remaining variance in the data vectors. For\n", "example, an $n \\times 3$ data matrix can be represented as an\n", "ellipsoidal cloud of $n$ points in three dimensional space. The first\n", "principal component is the longest axis of the ellipsoid, the second\n", "principal component the second longest axis of the ellipsoid, and the\n", "third principal component is the shortest axis. Each row in the data\n", "matrix can be reconstructed as a suitable linear combination of the\n", "principal components. However, in order to reduce the dimensionality of\n", "the data, usually only the most important principal components are\n", "retained. The remaining variance present in the data is then regarded as\n", "unexplained variance.\n", "\n", "The principal components can be found by calculating the eigenvectors of\n", "the covariance matrix of the data. The corresponding eigenvalues\n", "determine how much of the variance present in the data is explained by\n", "each principal component.\n", "\n", "Before applying principal component analysis, typically the mean is\n", "subtracted from each column in the data matrix. In the example above,\n", "this effectively centers the ellipsoidal cloud around its centroid in 3D\n", "space, with the principal components describing the variation of points\n", "in the ellipsoidal cloud with respect to their centroid.\n", "\n", "The function `pca` below first uses the singular value decomposition to\n", "calculate the eigenvalues and eigenvectors of the data matrix. The\n", "singular value decomposition is implemented as a translation in C of the\n", "Algol procedure `svd` @golub1971, which uses Householder\n", "bidiagonalization and a variant of the QR algorithm. The principal\n", "components, the coordinates of each data vector along the principal\n", "components, and the eigenvalues corresponding to the principal\n", "components are then evaluated and returned in decreasing order of the\n", "magnitude of the eigenvalue. If data centering is desired, the mean\n", "should be subtracted from each column in the data matrix before calling\n", "the `pca` routine.\n", "\n", "To apply Principal Component Analysis to a rectangular matrix `data`,\n", "use\n", "\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'data' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mBio\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mCluster\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mpca\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mcolumnmean\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcoordinates\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcomponents\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0meigenvalues\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpca\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'data' is not defined" ] } ], "source": [ "from Bio.Cluster import pca\n", "columnmean, coordinates, components, eigenvalues = pca(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This function returns a tuple\n", "`columnmean, coordinates, components, eigenvalues`:\n", "\n", "- `columnmean`\\\n", " Array containing the mean over each column in `data`.\n", "\n", "- `coordinates`\\\n", " The coordinates of each row in `data` with respect to the\n", " principal components.\n", "\n", "- `components`\\\n", " The principal components.\n", "\n", "- `eigenvalues`\\\n", " The eigenvalues corresponding to each of the principal components.\n", "\n", "The original matrix `data` can be recreated by calculating\n", "`columnmean + dot(coordinates, components)`.\n", "\n", "Handling Cluster/TreeView-type files\n", "------------------------------------\n", "\n", "Cluster/TreeView are GUI-based codes for clustering gene expression\n", "data. They were originally written by [Michael\n", "Eisen](http://rana.lbl.gov) while at Stanford University. `Bio.Cluster`\n", "contains functions for reading and writing data files that correspond to\n", "the format specified for Cluster/TreeView. In particular, by saving a\n", "clustering result in that format, TreeView can be used to visualize the\n", "clustering results. We recommend using Alok Saldanha’s\n", "Java TreeView program,\n", "which can display hierarchical as well as $k$-means clustering results.\n", "\n", "An object of the class `Record` contains all information stored in a\n", "Cluster/TreeView-type data file. To store the information contained in\n", "the data file in a `Record` object, we first open the file and then read\n", "it:\n", "\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "ename": "FileNotFoundError", "evalue": "[Errno 2] No such file or directory: 'mydatafile.txt'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mBio\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mCluster\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mhandle\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"mydatafile.txt\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 3\u001b[0m \u001b[0mrecord\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mCluster\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mhandle\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[0mhandle\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mclose\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'mydatafile.txt'" ] } ], "source": [ "from Bio import Cluster\n", "handle = open(\"mydatafile.txt\")\n", "record = Cluster.read(handle)\n", "handle.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This two-step process gives you some flexibility in the source of the\n", "data. For example, you can use\n", "\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "ename": "FileNotFoundError", "evalue": "[Errno 2] No such file or directory: 'mydatafile.txt.gz'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mgzip\u001b[0m \u001b[1;31m# Python standard library\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mhandle\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mgzip\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"mydatafile.txt.gz\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;32m/home/tiago_antao/miniconda/lib/python3.5/gzip.py\u001b[0m in \u001b[0;36mopen\u001b[1;34m(filename, mode, compresslevel, encoding, errors, newline)\u001b[0m\n\u001b[0;32m 51\u001b[0m \u001b[0mgz_mode\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mmode\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreplace\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"t\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 52\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfilename\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0mstr\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbytes\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 53\u001b[1;33m \u001b[0mbinary_file\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mGzipFile\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfilename\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mgz_mode\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcompresslevel\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 54\u001b[0m \u001b[1;32melif\u001b[0m \u001b[0mhasattr\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfilename\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"read\"\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mor\u001b[0m \u001b[0mhasattr\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfilename\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"write\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 55\u001b[0m \u001b[0mbinary_file\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mGzipFile\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mgz_mode\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcompresslevel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfilename\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m/home/tiago_antao/miniconda/lib/python3.5/gzip.py\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, filename, mode, compresslevel, fileobj, mtime)\u001b[0m\n\u001b[0;32m 161\u001b[0m \u001b[0mmode\u001b[0m \u001b[1;33m+=\u001b[0m \u001b[1;34m'b'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 162\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mfileobj\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 163\u001b[1;33m \u001b[0mfileobj\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmyfileobj\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mbuiltins\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfilename\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmode\u001b[0m \u001b[1;32mor\u001b[0m \u001b[1;34m'rb'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 164\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mfilename\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 165\u001b[0m \u001b[0mfilename\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfileobj\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'name'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m''\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'mydatafile.txt.gz'" ] } ], "source": [ "import gzip # Python standard library\n", "handle = gzip.open(\"mydatafile.txt.gz\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "to open a gzipped file, or\n", "\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "ename": "AttributeError", "evalue": "module 'urllib' has no attribute 'urlopen'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0murllib\u001b[0m \u001b[1;31m# Python standard library\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mhandle\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0murllib\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0murlopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"http://somewhere.org/mydatafile.txt\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mAttributeError\u001b[0m: module 'urllib' has no attribute 'urlopen'" ] } ], "source": [ "import urllib # Python standard library\n", "handle = urllib.urlopen(\"http://somewhere.org/mydatafile.txt\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "to open a file stored on the Internet before calling `read`.\n", "\n", "The `read` command reads the tab-delimited text file `mydatafile.txt`\n", "containing gene expression data in the format specified for Michael\n", "Eisen’s Cluster/TreeView program. For a description of this file format,\n", "see the manual to Cluster/TreeView. It is available at [Michael Eisen’s\n", "lab website](http://rana.lbl.gov/manuals/ClusterTreeView.pdf) and at\n", "[our\n", "website](http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/cluster3.pdf).\n", "\n", "A `Record` object has the following attributes:\n", "\n", "- `data`\\\n", " The data array containing the gene expression data. Genes are stored\n", " row-wise, while microarrays are stored column-wise.\n", "\n", "- `mask`\\\n", " This array shows which elements in the `data` array, if any,\n", " are missing. If `mask[i,j]==0`, then `data[i,j]` is missing. If no\n", " data were found to be missing, `mask` is set to `None`.\n", "\n", "- `geneid`\\\n", " This is a list containing a unique description for each gene (i.e.,\n", " ORF numbers).\n", "\n", "- `genename`\\\n", " This is a list containing a description for each gene (i.e.,\n", " gene name). If not present in the data file, `genename` is set to\n", " `None`.\n", "\n", "- `gweight`\\\n", " The weights that are to be used to calculate the distance in\n", " expression profile between genes. If not present in the data file,\n", " `gweight` is set to `None`.\n", "\n", "- `gorder`\\\n", " The preferred order in which genes should be stored in an\n", " output file. If not present in the data file, `gorder` is set to\n", " `None`.\n", "\n", "- `expid`\\\n", " This is a list containing a description of each microarray, e.g.\n", " experimental condition.\n", "\n", "- `eweight`\\\n", " The weights that are to be used to calculate the distance in\n", " expression profile between microarrays. If not present in the data\n", " file, `eweight` is set to `None`.\n", "\n", "- `eorder`\\\n", " The preferred order in which microarrays should be stored in an\n", " output file. If not present in the data file, `eorder` is set to\n", " `None`.\n", "\n", "- `uniqid`\\\n", " The string that was used instead of UNIQID in the data file.\n", "\n", "After loading a `Record` object, each of these attributes can be\n", "accessed and modified directly. For example, the data can be\n", "log-transformed by taking the logarithm of `record.data`.\n", "\n", "### Calculating the distance matrix {#calculating-the-distance-matrix .unnumbered}\n", "\n", "To calculate the distance matrix between the items stored in the record,\n", "use\n", "\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'record' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mmatrix\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mrecord\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdistancematrix\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'record' is not defined" ] } ], "source": [ "matrix = record.distancematrix()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "where the following arguments are defined:\n", "\n", "- `transpose` (default: `0`)\\\n", " Determines if the distances between the rows of `data` are to be\n", " calculated (`transpose==0`), or between the columns of `data`\n", " (`transpose==1`).\n", "\n", "- `dist` (default: `'e'`, Euclidean distance)\\\n", " Defines the distance function to be used\n", " (see \\[sec:distancefunctions\\]).\n", "\n", "This function returns the distance matrix as a list of rows, where the\n", "number of columns of each row is equal to the row number (see section\n", "\\[subsec:distancematrix\\]).\n", "\n", "### Calculating the cluster centroids {#calculating-the-cluster-centroids .unnumbered}\n", "\n", "To calculate the centroids of clusters of items stored in the record,\n", "use\n", "\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'record' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mcdata\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcmask\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mrecord\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mclustercentroids\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'record' is not defined" ] } ], "source": [ "cdata, cmask = record.clustercentroids()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "- `clusterid` (default: `None`)\\\n", " Vector of integers showing to which cluster each item belongs. If\n", " `clusterid` is not given, then all items are assumed to belong to\n", " the same cluster.\n", "\n", "- `method` (default: `'a'`)\\\n", " Specifies whether the arithmetic mean (`method=='a'`) or the median\n", " (`method=='m'`) is used to calculate the cluster center.\n", "\n", "- `transpose` (default: `0`)\\\n", " Determines if the centroids of the rows of `data` are to be\n", " calculated (`transpose==0`), or the centroids of the columns of\n", " `data` (`transpose==1`).\n", "\n", "This function returns the tuple `cdata, cmask`; see section\n", "\\[subsec:clustercentroids\\] for a description.\n", "\n", "### Calculating the distance between clusters {#calculating-the-distance-between-clusters-1 .unnumbered}\n", "\n", "To calculate the distance between clusters of items stored in the\n", "record, use\n", "\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'record' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mdistance\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mrecord\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mclusterdistance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'record' is not defined" ] } ], "source": [ "distance = record.clusterdistance()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "where the following arguments are defined:\n", "\n", "- `index1` (default: `0`)\\\n", " A list containing the indices of the items belonging to the\n", " first cluster. A cluster containing only one item $i$ can be\n", " represented either as a list `[i]`, or as an integer `i`.\n", "\n", "- `index2` (default: `0`)\\\n", " A list containing the indices of the items belonging to the\n", " second cluster. A cluster containing only one item $i$ can be\n", " represented either as a list `[i]`, or as an integer `i`.\n", "\n", "- `method` (default: `'a'`)\\\n", " Specifies how the distance between clusters is defined:\n", "\n", " - `'a'`: Distance between the two cluster centroids (arithmetic\n", " mean);\n", "\n", " - `'m'`: Distance between the two cluster centroids (median);\n", "\n", " - `'s'`: Shortest pairwise distance between items in the two\n", " clusters;\n", "\n", " - `'x'`: Longest pairwise distance between items in the two\n", " clusters;\n", "\n", " - `'v'`: Average over the pairwise distances between items in the\n", " two clusters.\n", "\n", "- `dist` (default: `'e'`, Euclidean distance)\\\n", " Defines the distance function to be used\n", " (see \\[sec:distancefunctions\\]).\n", "\n", "- `transpose` (default: `0`)\\\n", " If `transpose==0`, calculate the distance between the rows of\n", " `data`. If `transpose==1`, calculate the distance between the\n", " columns of `data`.\n", "\n", "### Performing hierarchical clustering {#performing-hierarchical-clustering-1 .unnumbered}\n", "\n", "To perform hierarchical clustering on the items stored in the record,\n", "use\n", "\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'record' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mtree\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mrecord\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtreecluster\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'record' is not defined" ] } ], "source": [ "tree = record.treecluster()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "where the following arguments are defined:\n", "\n", "- `transpose` (default: `0`)\\\n", " Determines if rows (`transpose==0`) or columns (`transpose==1`) are\n", " to be clustered.\n", "\n", "- `method` (default: `'m'`)\\\n", " defines the linkage method to be used:\n", "\n", " - `method=='s'`: pairwise single-linkage clustering\n", "\n", " - `method=='m'`: pairwise maximum- (or complete-) linkage\n", " clustering\n", "\n", " - `method=='c'`: pairwise centroid-linkage clustering\n", "\n", " - `method=='a'`: pairwise average-linkage clustering\n", "\n", "- `dist` (default: `'e'`, Euclidean distance)\\\n", " Defines the distance function to be used\n", " (see \\[sec:distancefunctions\\]).\n", "\n", "- `transpose`\\\n", " Determines if genes or microarrays are being clustered. If\n", " `transpose==0`, genes (rows) are being clustered. If `transpose==1`,\n", " microarrays (columns) are clustered.\n", "\n", "This function returns a `Tree` object. This object contains\n", "$\\left(\\textrm{number of items} - 1\\right)$ nodes, where the number of\n", "items is the number of rows if rows were clustered, or the number of\n", "columns if columns were clustered. Each node describes a pairwise\n", "linking event, where the node attributes `left` and `right` each contain\n", "the number of one item or subnode, and `distance` the distance between\n", "them. Items are numbered from 0 to\n", "$\\left(\\textrm{number of items} - 1\\right)$, while clusters are numbered\n", "-1 to $-\\left(\\textrm{number of items}-1\\right)$.\n", "\n", "### Performing $k$-means or $k$-medians clustering {#performing-k-means-or-k-medians-clustering .unnumbered}\n", "\n", "To perform $k$-means or $k$-medians clustering on the items stored in\n", "the record, use\n", "\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'record' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mclusterid\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merror\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnfound\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mrecord\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mkcluster\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'record' is not defined" ] } ], "source": [ "clusterid, error, nfound = record.kcluster()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "where the following arguments are defined:\n", "\n", "- `nclusters` (default: `2`)\\\n", " The number of clusters $k$.\n", "\n", "- `transpose` (default: `0`)\\\n", " Determines if rows (`transpose` is `0`) or columns (`transpose` is\n", " `1`) are to be clustered.\n", "\n", "- `npass` (default: `1`)\\\n", " The number of times the $k$-means/-medians clustering algorithm is\n", " performed, each time with a different (random) initial condition. If\n", " `initialid` is given, the value of `npass` is ignored and the\n", " clustering algorithm is run only once, as it behaves\n", " deterministically in that case.\n", "\n", "- `method` (default: `a`)\\\n", " describes how the center of a cluster is found:\n", "\n", " - `method=='a'`: arithmetic mean ($k$-means clustering);\n", "\n", " - `method=='m'`: median ($k$-medians clustering).\n", "\n", " For other values of `method`, the arithmetic mean is used.\n", "\n", "- `dist` (default: `'e'`, Euclidean distance)\\\n", " Defines the distance function to be used\n", " (see \\[sec:distancefunctions\\]).\n", "\n", "This function returns a tuple `(clusterid, error, nfound)`, where\n", "`clusterid` is an integer array containing the number of the cluster to\n", "which each row or cluster was assigned, `error` is the within-cluster\n", "sum of distances for the optimal clustering solution, and `nfound` is\n", "the number of times this optimal solution was found.\n", "\n", "### Calculating a Self-Organizing Map {#calculating-a-self-organizing-map .unnumbered}\n", "\n", "To calculate a Self-Organizing Map of the items stored in the record,\n", "use\n", "\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'record' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mclusterid\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcelldata\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mrecord\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msomcluster\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'record' is not defined" ] } ], "source": [ "clusterid, celldata = record.somcluster()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "where the following arguments are defined:\n", "\n", "- `transpose` (default: `0`)\\\n", " Determines if rows (`transpose` is `0`) or columns (`transpose` is\n", " `1`) are to be clustered.\n", "\n", "- `nxgrid, nygrid` (default: `2, 1`)\\\n", " The number of cells horizontally and vertically in the rectangular\n", " grid on which the Self-Organizing Map is calculated.\n", "\n", "- `inittau` (default: `0.02`)\\\n", " The initial value for the parameter $\\tau$ that is used in the\n", " SOM algorithm. The default value for `inittau` is 0.02, which was\n", " used in Michael Eisen’s Cluster/TreeView program.\n", "\n", "- `niter` (default: `1`)\\\n", " The number of iterations to be performed.\n", "\n", "- `dist` (default: `'e'`, Euclidean distance)\\\n", " Defines the distance function to be used\n", " (see \\[sec:distancefunctions\\]).\n", "\n", "This function returns the tuple `(clusterid, celldata)`:\n", "\n", "- `clusterid`:\\\n", " An array with two columns, where the number of rows is equal to the\n", " number of items that were clustered. Each row contains the $x$ and\n", " $y$ coordinates of the cell in the rectangular SOM grid to which the\n", " item was assigned.\n", "\n", "- `celldata`:\\\n", " An array with dimensions\n", " $\\left(\\verb|nxgrid|, \\verb|nygrid|, \\textrm{number of columns}\\right)$\n", " if rows are being clustered, or\n", " $\\left(\\verb|nxgrid|, \\verb|nygrid|, \\textrm{number of rows}\\right)$\n", " if columns are being clustered. Each element `[ix][iy]` of this\n", " array is a 1D vector containing the gene expression data for the\n", " centroid of the cluster in the grid cell with coordinates\n", " `[ix][iy]`.\n", "\n", "### Saving the clustering result {#saving-the-clustering-result .unnumbered}\n", "\n", "To save the clustering result, use\n", "\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'record' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mrecord\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msave\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mjobname\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mgeneclusters\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mexpclusters\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'record' is not defined" ] } ], "source": [ "record.save(jobname, geneclusters, expclusters)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "where the following arguments are defined:\n", "\n", "- `jobname`\\\n", " The string `jobname` is used as the base name for names of the files\n", " that are to be saved.\n", "\n", "- `geneclusters`\\\n", " This argument describes the gene (row-wise) clustering result. In\n", " case of $k$-means clustering, this is a 1D array containing the\n", " number of the cluster each gene belongs to. It can be calculated\n", " using `kcluster`. In case of hierarchical clustering, `geneclusters`\n", " is a `Tree` object.\n", "\n", "- `expclusters`\\\n", " This argument describes the (column-wise) clustering result for the\n", " experimental conditions. In case of $k$-means clustering, this is a\n", " 1D array containing the number of the cluster each experimental\n", " condition belongs to. It can be calculated using `kcluster`. In case\n", " of hierarchical clustering, `expclusters` is a `Tree` object.\n", "\n", "This method writes the text file `jobname.cdt`, `jobname.gtr`,\n", "`jobname.atr`, `jobname*.kgg`, and/or `jobname*.kag` for subsequent\n", "reading by the Java TreeView program. If `geneclusters` and\n", "`expclusters` are both `None`, this method only writes the text file\n", "`jobname.cdt`; this file can subsequently be read into a new `Record`\n", "object.\n", "\n", "Example calculation\n", "-------------------\n", "\n", "This is an example of a hierarchical clustering calculation, using\n", "single linkage clustering for genes and maximum linkage clustering for\n", "experimental conditions. As the Euclidean distance is being used for\n", "gene clustering, it is necessary to scale the node distances `genetree`\n", "such that they are all between zero and one. This is needed for the Java\n", "TreeView code to display the tree diagram correctly. To cluster the\n", "experimental conditions, the uncentered correlation is being used. No\n", "scaling is needed in this case, as the distances in `exptree` are\n", "already between zero and two. The example data `cyano.txt` can be found\n", "in the `data` subdirectory.\n", "\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "ename": "FileNotFoundError", "evalue": "[Errno 2] No such file or directory: 'cyano.txt'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mBio\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mCluster\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mhandle\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"cyano.txt\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 3\u001b[0m \u001b[0mrecord\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mCluster\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mhandle\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[0mhandle\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mclose\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[0mgenetree\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mrecord\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtreecluster\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m's'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'cyano.txt'" ] } ], "source": [ "from Bio import Cluster\n", "handle = open(\"cyano.txt\")\n", "record = Cluster.read(handle)\n", "handle.close()\n", "genetree = record.treecluster(method='s')\n", "genetree.scale()\n", "exptree = record.treecluster(dist='u', transpose=1)\n", "record.save(\"cyano_result\", genetree, exptree)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This will create the files `cyano_result.cdt`, `cyano_result.gtr`, and\n", "`cyano_result.atr`.\n", "\n", "Similarly, we can save a $k$-means clustering solution:\n", "\n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "ename": "FileNotFoundError", "evalue": "[Errno 2] No such file or directory: 'cyano.txt'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mBio\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mCluster\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mhandle\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"cyano.txt\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 3\u001b[0m \u001b[0mrecord\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mCluster\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mhandle\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[0mhandle\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mclose\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0mgeneclusters\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merror\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mifound\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mrecord\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mkcluster\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnclusters\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnpass\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m1000\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'cyano.txt'" ] } ], "source": [ "from Bio import Cluster\n", "handle = open(\"cyano.txt\")\n", "record = Cluster.read(handle)\n", "handle.close()\n", "(geneclusters, error, ifound) = record.kcluster(nclusters=5, npass=1000)\n", "(expclusters, error, ifound) = record.kcluster(nclusters=2, npass=100, transpose=1)\n", "record.save(\"cyano_result\", geneclusters, expclusters)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }