{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
ipyrad-analysis toolkit: distance
\n",
"\n",
"Genetic distance matrices are used in many contexts to study the evolutionary divergence of samples or populations. The `ipa.distance` module provides a simple and convenient framework to implement several distance based metrics. \n",
"\n",
"Key features:\n",
"\n",
"1. Filter SNPs to reduce missing data. \n",
"2. Impute missing data using population allele frequencies.\n",
"3. Calculate pairwise genetic distances between samples (e.g., p-dist, JC, HKY, Fst)\n",
"4. (coming soon) sliding window measurements along chromosomes. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### required software"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"# conda install ipyrad -c conda-forge -c bioconda\n",
"# conda install ipcoal -c conda-forge"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"import ipyrad.analysis as ipa\n",
"import ipcoal\n",
"import toyplot\n",
"import toytree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Species tree model"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# generate and draw an imbalanced 5 tip tree\n",
"tree = toytree.rtree.imbtree(ntips=5, treeheight=500000)\n",
"tree.draw(ts='p');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Coalescent simulations \n",
"The SNPs output is saved to an HDF5 database file."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"wrote 1044 SNPs to /tmp/test-dist.snps.hdf5\n"
]
}
],
"source": [
"# setup a model to simulate 8 haploid samples per species\n",
"model = ipcoal.Model(tree=tree, Ne=1e4, nsamples=8)\n",
"model.sim_loci(1000, 50)\n",
"model.write_snps_to_hdf5(name=\"test-dist\", outdir=\"/tmp\", diploid=True)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# the path to the HDF5 formatted snps file\n",
"SNPS = \"/tmp/test-dist.snps.hdf5\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### [optional] Build an IMAP dictionary\n",
"A dictionary mapping of population names to sample names. "
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'r0': ['r0-0', 'r0-1', 'r0-2', 'r0-3'],\n",
" 'r1': ['r1-0', 'r1-1', 'r1-2', 'r1-3'],\n",
" 'r2': ['r2-0', 'r2-1', 'r2-2', 'r2-3'],\n",
" 'r3': ['r3-0', 'r3-1', 'r3-2', 'r3-3'],\n",
" 'r4': ['r4-0', 'r4-1', 'r4-2', 'r4-3']}"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from itertools import groupby\n",
"\n",
"# load sample names from SNPs file\n",
"tool = ipa.snps_extracter(SNPS)\n",
"\n",
"# group names by prefix before '-'\n",
"groups = groupby(tool.names, key=lambda x: x.split(\"-\")[0])\n",
"\n",
"# arrange into a dictionary\n",
"IMAP = {i[0]: list(i[1]) for i in groups}\n",
"\n",
"# show the dict\n",
"IMAP"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### calculate distances with missing values filtered and/or imputed, and *corrected*\n",
"The correction applies a model of sequence substitution where more complex models can apply a greater penalty for unobserved changes (e.g., HKY or GTR). This allows you to use either SNPs or SEQUENCES as input. Here we are using SNPs. More on this later... (TODO). "
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"dist = ipa.distance(\n",
" data=SNPS, \n",
" imap=IMAP, \n",
" minmap={i: 1 for i in IMAP},\n",
" mincov=0.5,\n",
" impute_method=None,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Samples: 20\n",
"Sites before filtering: 1044\n",
"Filtered (indels): 0\n",
"Filtered (bi-allel): 9\n",
"Filtered (mincov): 0\n",
"Filtered (minmap): 0\n",
"Filtered (subsample invariant): 0\n",
"Filtered (combined): 9\n",
"Sites after filtering: 1035\n",
"Sites containing missing values: 0 (0.00%)\n",
"Missing values in SNP matrix: 0 (0.00%)\n",
"Imputation: 'None'; (0, 1, 2) = 100.0%, 0.0%, 0.0%\n"
]
}
],
"source": [
"# infer the distance matrix from sequence data\n",
"dist.run()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" r0-0 | \n",
" r0-1 | \n",
" r0-2 | \n",
" r0-3 | \n",
" r1-0 | \n",
" r1-1 | \n",
" r1-2 | \n",
" r1-3 | \n",
" r2-0 | \n",
" r2-1 | \n",
" r2-2 | \n",
" r2-3 | \n",
"
\n",
" \n",
" \n",
" \n",
" r0-0 | \n",
" 0.000 | \n",
" 0.033 | \n",
" 0.045 | \n",
" 0.036 | \n",
" 0.510 | \n",
" 0.499 | \n",
" 0.510 | \n",
" 0.518 | \n",
" 0.911 | \n",
" 0.915 | \n",
" 0.892 | \n",
" 0.908 | \n",
"
\n",
" \n",
" r0-1 | \n",
" 0.033 | \n",
" 0.000 | \n",
" 0.032 | \n",
" 0.024 | \n",
" 0.508 | \n",
" 0.497 | \n",
" 0.508 | \n",
" 0.516 | \n",
" 0.909 | \n",
" 0.913 | \n",
" 0.890 | \n",
" 0.906 | \n",
"
\n",
" \n",
" r0-2 | \n",
" 0.045 | \n",
" 0.032 | \n",
" 0.000 | \n",
" 0.031 | \n",
" 0.509 | \n",
" 0.498 | \n",
" 0.509 | \n",
" 0.517 | \n",
" 0.910 | \n",
" 0.914 | \n",
" 0.891 | \n",
" 0.907 | \n",
"
\n",
" \n",
" r0-3 | \n",
" 0.036 | \n",
" 0.024 | \n",
" 0.031 | \n",
" 0.000 | \n",
" 0.505 | \n",
" 0.494 | \n",
" 0.505 | \n",
" 0.513 | \n",
" 0.906 | \n",
" 0.910 | \n",
" 0.887 | \n",
" 0.903 | \n",
"
\n",
" \n",
" r1-0 | \n",
" 0.510 | \n",
" 0.508 | \n",
" 0.509 | \n",
" 0.505 | \n",
" 0.000 | \n",
" 0.043 | \n",
" 0.050 | \n",
" 0.046 | \n",
" 0.990 | \n",
" 0.994 | \n",
" 0.969 | \n",
" 0.984 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" r0-0 r0-1 r0-2 r0-3 r1-0 r1-1 r1-2 r1-3 r2-0 r2-1 \\\n",
"r0-0 0.000 0.033 0.045 0.036 0.510 0.499 0.510 0.518 0.911 0.915 \n",
"r0-1 0.033 0.000 0.032 0.024 0.508 0.497 0.508 0.516 0.909 0.913 \n",
"r0-2 0.045 0.032 0.000 0.031 0.509 0.498 0.509 0.517 0.910 0.914 \n",
"r0-3 0.036 0.024 0.031 0.000 0.505 0.494 0.505 0.513 0.906 0.910 \n",
"r1-0 0.510 0.508 0.509 0.505 0.000 0.043 0.050 0.046 0.990 0.994 \n",
"\n",
" r2-2 r2-3 \n",
"r0-0 0.892 0.908 \n",
"r0-1 0.890 0.906 \n",
"r0-2 0.891 0.907 \n",
"r0-3 0.887 0.903 \n",
"r1-0 0.969 0.984 "
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# show the first few data cells\n",
"dist.dists.iloc[:5, :12]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Infer a tree from distance matrix"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"ename": "ToytreeError",
"evalue": "node idx or name 6 not in tree",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mToytreeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtool\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mipa\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mneighbor_joining\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmatrix\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdist\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdists\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/home/deren/Documents/ipyrad/ipyrad/analysis/neighbor_joining.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, matrix, steps)\u001b[0m\n\u001b[1;32m 69\u001b[0m \u001b[0;31m# relabel tips\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 70\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmatrix\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDataFrame\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 71\u001b[0;31m \t\t tree = tree.set_node_values(\n\u001b[0m\u001b[1;32m 72\u001b[0m \u001b[0mfeature\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"name\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 73\u001b[0m \u001b[0mvalues\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m{\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mj\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mj\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32min\u001b[0m \u001b[0menumerate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmatrix\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/home/deren/miniconda3/envs/py38/lib/python3.7/site-packages/toytree/Toytree.py\u001b[0m in \u001b[0;36mset_node_values\u001b[0;34m(self, feature, values, default)\u001b[0m\n\u001b[1;32m 613\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mnidx\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mndict\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 614\u001b[0m raise ToytreeError(\n\u001b[0;32m--> 615\u001b[0;31m \"node idx or name {} not in tree\".format(nidx))\n\u001b[0m\u001b[1;32m 616\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 617\u001b[0m \u001b[0;31m# or, set everyone to a null value\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mToytreeError\u001b[0m: node idx or name 6 not in tree"
]
}
],
"source": [
"tool = ipa.neighbor_joining(matrix=dist.dists)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Draw tree and distance matrix"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# create a canvas\n",
"canvas = toyplot.Canvas(width=500, height=450);\n",
"\n",
"# add tree\n",
"axes = canvas.cartesian(bounds=(\"10%\", \"35%\", \"10%\", \"90%\"))\n",
"gtree.draw(axes=axes, tip_labels=True, tip_labels_align=True)\n",
"\n",
"# add matrix\n",
"table = canvas.table(\n",
" rows=matrix.shape[0],\n",
" columns=matrix.shape[1],\n",
" margin=0,\n",
" bounds=(\"40%\", \"95%\", \"9%\", \"91%\"),\n",
")\n",
"\n",
"colormap = toyplot.color.brewer.map(\"BlueRed\")\n",
"\n",
"# apply a color to each cell in the table\n",
"for ridx in range(matrix.shape[0]):\n",
" for cidx in range(matrix.shape[1]):\n",
" cell = table.cells.cell[ridx, cidx]\n",
" cell.style = {\n",
" \"fill\": colormap.colors(matrix.iloc[ridx, cidx], 0, 1),\n",
" }"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" r0-0 | \n",
" r0-1 | \n",
" r1-0 | \n",
" r1-1 | \n",
" r2-0 | \n",
" r2-1 | \n",
" r3-0 | \n",
" r3-1 | \n",
" r4-0 | \n",
" r4-1 | \n",
" r5-0 | \n",
" r5-1 | \n",
" r6-0 | \n",
" r6-1 | \n",
" r7-0 | \n",
" r7-1 | \n",
"
\n",
" \n",
" \n",
" \n",
" r0-0 | \n",
" 0.000 | \n",
" 0.079 | \n",
" 0.556 | \n",
" 0.587 | \n",
" 0.873 | \n",
" 0.778 | \n",
" 0.714 | \n",
" 0.794 | \n",
" 0.730 | \n",
" 0.730 | \n",
" 0.825 | \n",
" 0.778 | \n",
" 0.778 | \n",
" 0.730 | \n",
" 0.746 | \n",
" 0.841 | \n",
"
\n",
" \n",
" r0-1 | \n",
" 0.079 | \n",
" 0.000 | \n",
" 0.508 | \n",
" 0.540 | \n",
" 0.825 | \n",
" 0.730 | \n",
" 0.667 | \n",
" 0.746 | \n",
" 0.683 | \n",
" 0.683 | \n",
" 0.778 | \n",
" 0.730 | \n",
" 0.730 | \n",
" 0.683 | \n",
" 0.698 | \n",
" 0.794 | \n",
"
\n",
" \n",
" r1-0 | \n",
" 0.556 | \n",
" 0.508 | \n",
" 0.000 | \n",
" 0.063 | \n",
" 0.825 | \n",
" 0.730 | \n",
" 0.667 | \n",
" 0.746 | \n",
" 0.683 | \n",
" 0.683 | \n",
" 0.778 | \n",
" 0.730 | \n",
" 0.730 | \n",
" 0.683 | \n",
" 0.698 | \n",
" 0.794 | \n",
"
\n",
" \n",
" r1-1 | \n",
" 0.587 | \n",
" 0.540 | \n",
" 0.063 | \n",
" 0.000 | \n",
" 0.857 | \n",
" 0.762 | \n",
" 0.698 | \n",
" 0.778 | \n",
" 0.714 | \n",
" 0.714 | \n",
" 0.810 | \n",
" 0.762 | \n",
" 0.762 | \n",
" 0.714 | \n",
" 0.730 | \n",
" 0.825 | \n",
"
\n",
" \n",
" r2-0 | \n",
" 0.873 | \n",
" 0.825 | \n",
" 0.825 | \n",
" 0.857 | \n",
" 0.000 | \n",
" 0.063 | \n",
" 0.476 | \n",
" 0.556 | \n",
" 0.746 | \n",
" 0.746 | \n",
" 0.841 | \n",
" 0.794 | \n",
" 0.794 | \n",
" 0.746 | \n",
" 0.762 | \n",
" 0.857 | \n",
"
\n",
" \n",
" r2-1 | \n",
" 0.778 | \n",
" 0.730 | \n",
" 0.730 | \n",
" 0.762 | \n",
" 0.063 | \n",
" 0.000 | \n",
" 0.381 | \n",
" 0.460 | \n",
" 0.651 | \n",
" 0.651 | \n",
" 0.746 | \n",
" 0.698 | \n",
" 0.698 | \n",
" 0.651 | \n",
" 0.667 | \n",
" 0.762 | \n",
"
\n",
" \n",
" r3-0 | \n",
" 0.714 | \n",
" 0.667 | \n",
" 0.667 | \n",
" 0.698 | \n",
" 0.476 | \n",
" 0.381 | \n",
" 0.000 | \n",
" 0.079 | \n",
" 0.587 | \n",
" 0.587 | \n",
" 0.683 | \n",
" 0.635 | \n",
" 0.635 | \n",
" 0.587 | \n",
" 0.603 | \n",
" 0.698 | \n",
"
\n",
" \n",
" r3-1 | \n",
" 0.794 | \n",
" 0.746 | \n",
" 0.746 | \n",
" 0.778 | \n",
" 0.556 | \n",
" 0.460 | \n",
" 0.079 | \n",
" 0.000 | \n",
" 0.667 | \n",
" 0.667 | \n",
" 0.762 | \n",
" 0.714 | \n",
" 0.714 | \n",
" 0.667 | \n",
" 0.683 | \n",
" 0.778 | \n",
"
\n",
" \n",
" r4-0 | \n",
" 0.730 | \n",
" 0.683 | \n",
" 0.683 | \n",
" 0.714 | \n",
" 0.746 | \n",
" 0.651 | \n",
" 0.587 | \n",
" 0.667 | \n",
" 0.000 | \n",
" 0.095 | \n",
" 0.317 | \n",
" 0.270 | \n",
" 0.397 | \n",
" 0.286 | \n",
" 0.492 | \n",
" 0.587 | \n",
"
\n",
" \n",
" r4-1 | \n",
" 0.730 | \n",
" 0.683 | \n",
" 0.683 | \n",
" 0.714 | \n",
" 0.746 | \n",
" 0.651 | \n",
" 0.587 | \n",
" 0.667 | \n",
" 0.095 | \n",
" 0.000 | \n",
" 0.254 | \n",
" 0.206 | \n",
" 0.397 | \n",
" 0.286 | \n",
" 0.492 | \n",
" 0.587 | \n",
"
\n",
" \n",
" r5-0 | \n",
" 0.825 | \n",
" 0.778 | \n",
" 0.778 | \n",
" 0.810 | \n",
" 0.841 | \n",
" 0.746 | \n",
" 0.683 | \n",
" 0.762 | \n",
" 0.317 | \n",
" 0.254 | \n",
" 0.000 | \n",
" 0.016 | \n",
" 0.492 | \n",
" 0.381 | \n",
" 0.587 | \n",
" 0.683 | \n",
"
\n",
" \n",
" r5-1 | \n",
" 0.778 | \n",
" 0.730 | \n",
" 0.730 | \n",
" 0.762 | \n",
" 0.794 | \n",
" 0.698 | \n",
" 0.635 | \n",
" 0.714 | \n",
" 0.270 | \n",
" 0.206 | \n",
" 0.016 | \n",
" 0.000 | \n",
" 0.444 | \n",
" 0.333 | \n",
" 0.540 | \n",
" 0.635 | \n",
"
\n",
" \n",
" r6-0 | \n",
" 0.778 | \n",
" 0.730 | \n",
" 0.730 | \n",
" 0.762 | \n",
" 0.794 | \n",
" 0.698 | \n",
" 0.635 | \n",
" 0.714 | \n",
" 0.397 | \n",
" 0.397 | \n",
" 0.492 | \n",
" 0.444 | \n",
" 0.000 | \n",
" 0.111 | \n",
" 0.667 | \n",
" 0.762 | \n",
"
\n",
" \n",
" r6-1 | \n",
" 0.730 | \n",
" 0.683 | \n",
" 0.683 | \n",
" 0.714 | \n",
" 0.746 | \n",
" 0.651 | \n",
" 0.587 | \n",
" 0.667 | \n",
" 0.286 | \n",
" 0.286 | \n",
" 0.381 | \n",
" 0.333 | \n",
" 0.111 | \n",
" 0.000 | \n",
" 0.556 | \n",
" 0.651 | \n",
"
\n",
" \n",
" r7-0 | \n",
" 0.746 | \n",
" 0.698 | \n",
" 0.698 | \n",
" 0.730 | \n",
" 0.762 | \n",
" 0.667 | \n",
" 0.603 | \n",
" 0.683 | \n",
" 0.492 | \n",
" 0.492 | \n",
" 0.587 | \n",
" 0.540 | \n",
" 0.667 | \n",
" 0.556 | \n",
" 0.000 | \n",
" 0.063 | \n",
"
\n",
" \n",
" r7-1 | \n",
" 0.841 | \n",
" 0.794 | \n",
" 0.794 | \n",
" 0.825 | \n",
" 0.857 | \n",
" 0.762 | \n",
" 0.698 | \n",
" 0.778 | \n",
" 0.587 | \n",
" 0.587 | \n",
" 0.683 | \n",
" 0.635 | \n",
" 0.762 | \n",
" 0.651 | \n",
" 0.063 | \n",
" 0.000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" r0-0 r0-1 r1-0 r1-1 r2-0 r2-1 r3-0 r3-1 r4-0 r4-1 \\\n",
"r0-0 0.000 0.079 0.556 0.587 0.873 0.778 0.714 0.794 0.730 0.730 \n",
"r0-1 0.079 0.000 0.508 0.540 0.825 0.730 0.667 0.746 0.683 0.683 \n",
"r1-0 0.556 0.508 0.000 0.063 0.825 0.730 0.667 0.746 0.683 0.683 \n",
"r1-1 0.587 0.540 0.063 0.000 0.857 0.762 0.698 0.778 0.714 0.714 \n",
"r2-0 0.873 0.825 0.825 0.857 0.000 0.063 0.476 0.556 0.746 0.746 \n",
"r2-1 0.778 0.730 0.730 0.762 0.063 0.000 0.381 0.460 0.651 0.651 \n",
"r3-0 0.714 0.667 0.667 0.698 0.476 0.381 0.000 0.079 0.587 0.587 \n",
"r3-1 0.794 0.746 0.746 0.778 0.556 0.460 0.079 0.000 0.667 0.667 \n",
"r4-0 0.730 0.683 0.683 0.714 0.746 0.651 0.587 0.667 0.000 0.095 \n",
"r4-1 0.730 0.683 0.683 0.714 0.746 0.651 0.587 0.667 0.095 0.000 \n",
"r5-0 0.825 0.778 0.778 0.810 0.841 0.746 0.683 0.762 0.317 0.254 \n",
"r5-1 0.778 0.730 0.730 0.762 0.794 0.698 0.635 0.714 0.270 0.206 \n",
"r6-0 0.778 0.730 0.730 0.762 0.794 0.698 0.635 0.714 0.397 0.397 \n",
"r6-1 0.730 0.683 0.683 0.714 0.746 0.651 0.587 0.667 0.286 0.286 \n",
"r7-0 0.746 0.698 0.698 0.730 0.762 0.667 0.603 0.683 0.492 0.492 \n",
"r7-1 0.841 0.794 0.794 0.825 0.857 0.762 0.698 0.778 0.587 0.587 \n",
"\n",
" r5-0 r5-1 r6-0 r6-1 r7-0 r7-1 \n",
"r0-0 0.825 0.778 0.778 0.730 0.746 0.841 \n",
"r0-1 0.778 0.730 0.730 0.683 0.698 0.794 \n",
"r1-0 0.778 0.730 0.730 0.683 0.698 0.794 \n",
"r1-1 0.810 0.762 0.762 0.714 0.730 0.825 \n",
"r2-0 0.841 0.794 0.794 0.746 0.762 0.857 \n",
"r2-1 0.746 0.698 0.698 0.651 0.667 0.762 \n",
"r3-0 0.683 0.635 0.635 0.587 0.603 0.698 \n",
"r3-1 0.762 0.714 0.714 0.667 0.683 0.778 \n",
"r4-0 0.317 0.270 0.397 0.286 0.492 0.587 \n",
"r4-1 0.254 0.206 0.397 0.286 0.492 0.587 \n",
"r5-0 0.000 0.016 0.492 0.381 0.587 0.683 \n",
"r5-1 0.016 0.000 0.444 0.333 0.540 0.635 \n",
"r6-0 0.492 0.444 0.000 0.111 0.667 0.762 \n",
"r6-1 0.381 0.333 0.111 0.000 0.556 0.651 \n",
"r7-0 0.587 0.540 0.667 0.556 0.000 0.063 \n",
"r7-1 0.683 0.635 0.762 0.651 0.063 0.000 "
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dist.dists"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"ename": "KeyError",
"evalue": "(0, 0)",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/home/deren/miniconda3/envs/py38/lib/python3.7/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2888\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2889\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2890\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: (0, 0)",
"\nThe above exception was the direct cause of the following exception:\n",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 21\u001b[0m \u001b[0mcell\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtable\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcells\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcell\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mridx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcidx\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 22\u001b[0m cell.style = {\n\u001b[0;32m---> 23\u001b[0;31m \u001b[0;34m\"fill\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mcolormap\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolors\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmatrix\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mridx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcidx\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m100\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 24\u001b[0m }\n\u001b[1;32m 25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/home/deren/miniconda3/envs/py38/lib/python3.7/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2897\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2898\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2899\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2900\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2901\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/home/deren/miniconda3/envs/py38/lib/python3.7/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2889\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2890\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2891\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2892\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2893\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mtolerance\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: (0, 0)"
]
},
{
"data": {
"text/html": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"\n",
"\n",
"# style the gaps between cells\n",
"table.body.gaps.columns[:] = 3\n",
"table.body.gaps.rows[:] = 3\n",
"\n",
"# hide axes coordinates\n",
"axes.show = False"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Samples: 29\n",
"Sites before filtering: 349914\n",
"Filtered (indels): 0\n",
"Filtered (bi-allel): 13379\n",
"Filtered (mincov): 30459\n",
"Filtered (minmap): 111825\n",
"Filtered (combined): 120177\n",
"Sites after filtering: 229737\n",
"Sites containing missing values: 219551 (95.57%)\n",
"Missing values in SNP matrix: 814369 (12.22%)\n",
"Imputation: 'sampled'; (0, 1, 2) = 77.3%, 10.7%, 12.0%\n"
]
}
],
"source": [
"# load the snp data into distance tool with arguments\n",
"dist = Distance(\n",
" data=data, \n",
" imap=imap,\n",
" minmap=minmap,\n",
" mincov=0.5,\n",
" impute_method=\"sample\",\n",
" subsample_snps=False,\n",
")\n",
"dist.run()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### save results"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# save to a CSV file\n",
"dist.dists.to_csv(\"distances.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" BJSB3 | \n",
" BJSL25 | \n",
" BJVL19 | \n",
" BZBB1 | \n",
" CRL0001 | \n",
" CRL0030 | \n",
" CUCA4 | \n",
" CUMM5 | \n",
" CUSV6 | \n",
" CUVN10 | \n",
" ... | \n",
" FLWO6 | \n",
" HNDA09 | \n",
" LALC2 | \n",
" MXED8 | \n",
" MXGT4 | \n",
" MXSA3017 | \n",
" SCCU3 | \n",
" TXGR3 | \n",
" TXMD3 | \n",
" TXWV2 | \n",
"
\n",
" \n",
" \n",
" \n",
" BJSB3 | \n",
" 0.000000 | \n",
" 0.250447 | \n",
" 0.253472 | \n",
" 0.592255 | \n",
" 0.530145 | \n",
" 0.572576 | \n",
" 0.601853 | \n",
" 0.597044 | \n",
" 0.591990 | \n",
" 0.579937 | \n",
" ... | \n",
" 0.594005 | \n",
" 0.582000 | \n",
" 0.568137 | \n",
" 0.464618 | \n",
" 0.443942 | \n",
" 0.579789 | \n",
" 0.603638 | \n",
" 0.487945 | \n",
" 0.487936 | \n",
" 0.590440 | \n",
"
\n",
" \n",
" BJSL25 | \n",
" 0.250447 | \n",
" 0.000000 | \n",
" 0.235900 | \n",
" 0.558630 | \n",
" 0.494291 | \n",
" 0.537193 | \n",
" 0.566156 | \n",
" 0.559675 | \n",
" 0.554665 | \n",
" 0.542768 | \n",
" ... | \n",
" 0.558769 | \n",
" 0.548897 | \n",
" 0.532239 | \n",
" 0.435050 | \n",
" 0.412694 | \n",
" 0.547182 | \n",
" 0.567323 | \n",
" 0.453606 | \n",
" 0.457105 | \n",
" 0.554882 | \n",
"
\n",
" \n",
" BJVL19 | \n",
" 0.253472 | \n",
" 0.235900 | \n",
" 0.000000 | \n",
" 0.567897 | \n",
" 0.502775 | \n",
" 0.547391 | \n",
" 0.576355 | \n",
" 0.569360 | \n",
" 0.563000 | \n",
" 0.554621 | \n",
" ... | \n",
" 0.564728 | \n",
" 0.555927 | \n",
" 0.539913 | \n",
" 0.441844 | \n",
" 0.417060 | \n",
" 0.556449 | \n",
" 0.575336 | \n",
" 0.464118 | \n",
" 0.465476 | \n",
" 0.562278 | \n",
"
\n",
" \n",
" BZBB1 | \n",
" 0.592255 | \n",
" 0.558630 | \n",
" 0.567897 | \n",
" 0.000000 | \n",
" 0.280691 | \n",
" 0.280569 | \n",
" 0.422670 | \n",
" 0.422962 | \n",
" 0.426266 | \n",
" 0.394242 | \n",
" ... | \n",
" 0.559152 | \n",
" 0.285883 | \n",
" 0.532918 | \n",
" 0.525701 | \n",
" 0.542381 | \n",
" 0.317450 | \n",
" 0.576455 | \n",
" 0.552554 | \n",
" 0.551579 | \n",
" 0.563571 | \n",
"
\n",
" \n",
" CRL0001 | \n",
" 0.530145 | \n",
" 0.494291 | \n",
" 0.502775 | \n",
" 0.280691 | \n",
" 0.000000 | \n",
" 0.239596 | \n",
" 0.347859 | \n",
" 0.322277 | \n",
" 0.342836 | \n",
" 0.213266 | \n",
" ... | \n",
" 0.470064 | \n",
" 0.262217 | \n",
" 0.451717 | \n",
" 0.466429 | \n",
" 0.477764 | \n",
" 0.299538 | \n",
" 0.492224 | \n",
" 0.487327 | \n",
" 0.484123 | \n",
" 0.482726 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 29 columns
\n",
"
"
],
"text/plain": [
" BJSB3 BJSL25 BJVL19 BZBB1 CRL0001 CRL0030 CUCA4 \\\n",
"BJSB3 0.000000 0.250447 0.253472 0.592255 0.530145 0.572576 0.601853 \n",
"BJSL25 0.250447 0.000000 0.235900 0.558630 0.494291 0.537193 0.566156 \n",
"BJVL19 0.253472 0.235900 0.000000 0.567897 0.502775 0.547391 0.576355 \n",
"BZBB1 0.592255 0.558630 0.567897 0.000000 0.280691 0.280569 0.422670 \n",
"CRL0001 0.530145 0.494291 0.502775 0.280691 0.000000 0.239596 0.347859 \n",
"\n",
" CUMM5 CUSV6 CUVN10 ... FLWO6 HNDA09 LALC2 \\\n",
"BJSB3 0.597044 0.591990 0.579937 ... 0.594005 0.582000 0.568137 \n",
"BJSL25 0.559675 0.554665 0.542768 ... 0.558769 0.548897 0.532239 \n",
"BJVL19 0.569360 0.563000 0.554621 ... 0.564728 0.555927 0.539913 \n",
"BZBB1 0.422962 0.426266 0.394242 ... 0.559152 0.285883 0.532918 \n",
"CRL0001 0.322277 0.342836 0.213266 ... 0.470064 0.262217 0.451717 \n",
"\n",
" MXED8 MXGT4 MXSA3017 SCCU3 TXGR3 TXMD3 TXWV2 \n",
"BJSB3 0.464618 0.443942 0.579789 0.603638 0.487945 0.487936 0.590440 \n",
"BJSL25 0.435050 0.412694 0.547182 0.567323 0.453606 0.457105 0.554882 \n",
"BJVL19 0.441844 0.417060 0.556449 0.575336 0.464118 0.465476 0.562278 \n",
"BZBB1 0.525701 0.542381 0.317450 0.576455 0.552554 0.551579 0.563571 \n",
"CRL0001 0.466429 0.477764 0.299538 0.492224 0.487327 0.484123 0.482726 \n",
"\n",
"[5 rows x 29 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# show the upper corner \n",
"dist.dists.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Draw the matrix"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"toyplot.matrix(\n",
" dist.dists, \n",
" bshow=False,\n",
" tshow=False,\n",
" rlocator=toyplot.locator.Explicit(\n",
" range(len(dist.names)),\n",
" sorted(dist.names),\n",
"));"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Draw matrix reordered to match groups in imap"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# get list of concatenated names from each group\n",
"ordered_names = []\n",
"for group in dist.imap.values():\n",
" ordered_names += group\n",
"\n",
"# reorder matrix to match name order \n",
"ordered_matrix = dist.dists[ordered_names].T[ordered_names]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"toyplot.matrix(\n",
" ordered_matrix,\n",
" bshow=False,\n",
" tshow=False,\n",
" rlocator=toyplot.locator.Explicit(\n",
" range(len(ordered_names)),\n",
" ordered_names,\n",
"));"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}