{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# sourmash: working with private collections of signatures\n", "\n", "### Running this notebook.\n", "\n", "You can run this notebook interactively via mybinder; click on this button:\n", "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/dib-lab/sourmash/latest?filepath=doc%2Fsourmash-collections.ipynb)\n", "\n", "A rendered version of this notebook is available at [sourmash.readthedocs.io](https://sourmash.readthedocs.io) under \"Tutorials and notebooks\".\n", "\n", "You can also get this notebook from the [doc/ subdirectory of the sourmash github repository](https://github.com/dib-lab/sourmash/tree/latest/doc). See [binder/environment.yaml](https://github.com/dib-lab/sourmash/blob/latest/binder/environment.yml) for installation dependencies.\n", "\n", "### What is this?\n", "\n", "This is a Jupyter Notebook using Python 3. If you are running this via [binder](https://mybinder.org), you can use Shift-ENTER to run cells, and double click on code cells to edit them.\n", "\n", "Contact: C. Titus Brown, ctbrown@ucdavis.edu. Please [file issues on GitHub](https://github.com/dib-lab/sourmash/issues/) if you have any questions or comments!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## download a bunch of genomes" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/t/dev/sourmash/doc/big_genomes\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 459 100 459 0 0 750 0 --:--:-- --:--:-- --:--:-- 750\n", "100 61.1M 100 61.1M 0 0 2966k 0 0:00:21 0:00:21 --:--:-- 3496k\n" ] } ], "source": [ "!mkdir -p big_genomes\n", "!curl -L https://osf.io/8uxj9/?action=download | (cd big_genomes && tar xzf -)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## compute signatures for each file" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/t/dev/sourmash/doc/big_genomes\n", "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", "\u001b[Ksetting num_hashes to 0 because --scaled is set\n", "\u001b[Kcomputing signatures for files: 0.fa, 1.fa, 10.fa, 11.fa, 12.fa, 13.fa, 14.fa, 15.fa, 16.fa, 17.fa, 18.fa, 19.fa, 2.fa, 20.fa, 21.fa, 22.fa, 23.fa, 24.fa, 25.fa, 26.fa, 27.fa, 28.fa, 29.fa, 3.fa, 30.fa, 31.fa, 32.fa, 33.fa, 34.fa, 35.fa, 36.fa, 37.fa, 38.fa, 39.fa, 4.fa, 40.fa, 41.fa, 42.fa, 43.fa, 44.fa, 45.fa, 46.fa, 47.fa, 48.fa, 49.fa, 5.fa, 50.fa, 51.fa, 52.fa, 53.fa, 54.fa, 55.fa, 56.fa, 57.fa, 58.fa, 59.fa, 6.fa, 60.fa, 61.fa, 62.fa, 63.fa, 7.fa, 8.fa, 9.fa\n", "\u001b[KComputing signature for ksizes: [31]\n", "\u001b[KComputing only nucleotide (and not protein) signatures.\n", "\u001b[KComputing a total of 1 signature(s).\n", "\u001b[K... reading sequences from 0.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 0.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 1.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 1.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 10.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 10.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 11.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 11.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 12.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 12.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 13.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 13.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 14.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 14.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 15.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 15.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 16.fa\n", "\u001b[Kcalculated 1 signatures for 4 sequences in 16.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 17.fa\n", "\u001b[Kcalculated 1 signatures for 2 sequences in 17.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 18.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 18.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 19.fa\n", "\u001b[Kcalculated 1 signatures for 9 sequences in 19.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 2.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 2.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 20.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 20.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 21.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 21.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 22.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 22.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 23.fa\n", "\u001b[Kcalculated 1 signatures for 5 sequences in 23.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 24.fa\n", "\u001b[Kcalculated 1 signatures for 3 sequences in 24.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 25.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 25.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 26.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 26.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 27.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 27.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 28.fa\n", "\u001b[Kcalculated 1 signatures for 3 sequences in 28.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 29.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 29.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 3.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 3.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 30.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 30.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 31.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 31.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 32.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 32.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 33.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 33.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 34.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 34.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 35.fa\n", "\u001b[Kcalculated 1 signatures for 7 sequences in 35.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 36.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 36.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 37.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 37.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 38.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 38.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 39.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 39.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 4.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 4.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 40.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 40.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 41.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 41.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 42.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 42.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 43.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 43.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 44.fa\n", "\u001b[Kcalculated 1 signatures for 2 sequences in 44.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 45.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 45.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 46.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 46.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 47.fa\n", "\u001b[Kcalculated 1 signatures for 2 sequences in 47.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 48.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 48.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 49.fa\n", "\u001b[Kcalculated 1 signatures for 228 sequences in 49.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 5.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 5.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 50.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 50.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 51.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 51.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 52.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 52.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 53.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 53.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 54.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 54.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 55.fa\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[Kcalculated 1 signatures for 1 sequences in 55.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 56.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 56.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 57.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 57.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 58.fa\n", "\u001b[Kcalculated 1 signatures for 30 sequences in 58.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 59.fa\n", "\u001b[Kcalculated 1 signatures for 5 sequences in 59.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 6.fa\n", "\u001b[Kcalculated 1 signatures for 76 sequences in 6.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 60.fa\n", "\u001b[Kcalculated 1 signatures for 11 sequences in 60.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 61.fa\n", "\u001b[Kcalculated 1 signatures for 47 sequences in 61.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 62.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 62.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 63.fa\n", "\u001b[Kcalculated 1 signatures for 4 sequences in 63.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 7.fa\n", "\u001b[Kcalculated 1 signatures for 3 sequences in 7.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 8.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 8.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", "\u001b[K... reading sequences from 9.fa\n", "\u001b[Kcalculated 1 signatures for 3 sequences in 9.fa\n", "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n" ] } ], "source": [ "!cd big_genomes/ && sourmash compute -k 31 --scaled=1000 --name-from-first *.fa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compare them all" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", "\u001b[Kloaded 64 signatures total. \n", "\u001b[Kdownsampling to scaled value of 1000\n", "\u001b[K\n", "min similarity in matrix: 0.000\n", "\u001b[Ksaving labels to: compare_all.mat.labels.txt\n", "\u001b[Ksaving distance matrix to: compare_all.mat\n", "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", "\u001b[Kloading comparison matrix from compare_all.mat...\n", "\u001b[K...got 64 x 64 matrix.\n", "\u001b[Kloading labels from compare_all.mat.labels.txt\n", "\u001b[Ksaving histogram of matrix values => compare_all.mat.hist.png\n", "\u001b[Kwrote dendrogram to: compare_all.mat.dendro.png\n", "\u001b[Kwrote numpy distance matrix to: compare_all.mat.matrix.png\n" ] } ], "source": [ "!sourmash compare big_genomes/*.sig -o compare_all.mat\n", "!sourmash plot compare_all.mat" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import Image\n", "Image(filename='compare_all.mat.matrix.png') " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## make a fast(er) search database for all of them" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", "\u001b[Kloading 64 files into SBT\n", "\u001b[Kreading from big_genomes/9.fa.sig (63 signatures so far))\n", "\u001b[Kloaded 64 sigs; saving SBT under \"all-genomes\"\n", "\u001b[K127 of 127 nodes saved\n", "Finished saving nodes, now saving SBT json file.\n" ] } ], "source": [ "!sourmash index -k 31 all-genomes big_genomes/*.sig" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now use this to search, and gather." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", "\u001b[Kselecting default query k=31.\n", "\u001b[Kloaded query: NC_009665.1 Shewanella baltica... (k=31, DNA)\n", "\u001b[Kloaded 1 databases. \n", "\n", "2 matches:\n", "similarity match\n", "---------- -----\n", " 9.5% NC_009665.1 Shewanella baltica OS185, complete genome\n", " 4.4% NC_011663.1 Shewanella baltica OS223, complete genome\n" ] } ], "source": [ "!sourmash search shew_os185.fa.sig all-genomes --threshold=0.001" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\r\n", "\r", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\r\n", "\r\n", "\r", "\u001b[Ksetting num_hashes to 0 because --scaled is set\r\n", "\r", "\u001b[Kcomputing signatures for files: fake-metagenome.fa\r\n", "\r", "\u001b[KComputing signature for ksizes: [31]\r\n", "\r", "\u001b[KComputing only nucleotide (and not protein) signatures.\r\n", "\r", "\u001b[KComputing a total of 1 signature(s).\r\n", "\r", "\u001b[Kskipping fake-metagenome.fa - already done\r\n" ] } ], "source": [ "# (make fake metagenome again, just in case)\n", "!cat genomes/*.fa > fake-metagenome.fa\n", "!sourmash compute -k 31 --scaled=1000 fake-metagenome.fa" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", "\u001b[Kselect query k=31 automatically.\n", "\u001b[Kloaded query: fake-metagenome.fa... (k=31, DNA)\n", "\u001b[Kloaded 1 databases. \n", "\n", "\n", "overlap p_query p_match\n", "--------- ------- -------\n", "0.5 Mbp 42.2% 10.5% NC_011663.1 Shewanella baltica OS223,...\n", "499.0 kbp 38.4% 18.5% CP001071.1 Akkermansia muciniphila AT...\n", "0.5 Mbp 19.4% 4.9% NC_009665.1 Shewanella baltica OS185,...\n", "\n", "found 3 matches total;\n", "the recovered matches hit 100.0% of the query\n", "\n" ] } ], "source": [ "!sourmash gather fake-metagenome.fa.sig all-genomes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## build a database with taxonomic information --\n", "\n", "for this, we need to provide a metadata file that contains accession => tax information." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
accessiontaxidsuperkingdomphylumclassorderfamilygenusspeciesstrain
0AE000782224325ArchaeaEuryarchaeotaArchaeoglobiArchaeoglobalesArchaeoglobaceaeArchaeoglobusArchaeoglobus fulgidusArchaeoglobus fulgidus DSM 4304
1NC_000909243232ArchaeaEuryarchaeotaMethanococciMethanococcalesMethanocaldococcaceaeMethanocaldococcusMethanocaldococcus jannaschiiMethanocaldococcus jannaschii DSM 2661
2NC_003272103690BacteriaCyanobacteriaNaNNostocalesNostocaceaeNostocNostoc sp. PCC 7120NaN
3AE009441178306ArchaeaCrenarchaeotaThermoproteiThermoprotealesThermoproteaceaePyrobaculumPyrobaculum aerophilumPyrobaculum aerophilum str. IM2
4AE009950186497ArchaeaEuryarchaeotaThermococciThermococcalesThermococcaceaePyrococcusPyrococcus furiosusPyrococcus furiosus DSM 3638
5AE009951190304BacteriaFusobacteriaFusobacteriiaFusobacterialesFusobacteriaceaeFusobacteriumFusobacterium nucleatumNaN
6AE010299188937ArchaeaEuryarchaeotaMethanomicrobiaMethanosarcinalesMethanosarcinaceaeMethanosarcinaMethanosarcina acetivoransMethanosarcina acetivorans C2A
7AE009439190192ArchaeaEuryarchaeotaMethanopyriMethanopyralesMethanopyraceaeMethanopyrusMethanopyrus kandleriMethanopyrus kandleri AV19
8NC_003911246200BacteriaProteobacteriaAlphaproteobacteriaRhodobacteralesRhodobacteraceaeRuegeriaRuegeria pomeroyiRuegeria pomeroyi DSS-3
9AE006470194439BacteriaChlorobiChlorobiaChlorobialesChlorobiaceaeChlorobaculumChlorobaculum tepidumChlorobaculum tepidum TLS
10AE015928226186BacteriaBacteroidetesBacteroidiaBacteroidalesBacteroidaceaeBacteroidesBacteroides thetaiotaomicronBacteroides thetaiotaomicron VPI-5482
11AL954747228410BacteriaProteobacteriaBetaproteobacteriaNitrosomonadalesNitrosomonadaceaeNitrosomonasNitrosomonas europaeaNitrosomonas europaea ATCC 19718
12BX119912243090BacteriaPlanctomycetesPlanctomycetiaPlanctomycetalesPlanctomycetaceaeRhodopirellulaRhodopirellula balticaRhodopirellula baltica SH 1
13BX571656273121BacteriaProteobacteriaEpsilonproteobacteriaCampylobacteralesHelicobacteraceaeWolinellaWolinella succinogenesWolinella succinogenes DSM 1740
14AE017180243231BacteriaProteobacteriaDeltaproteobacteriaDesulfuromonadalesGeobacteraceaeGeobacterGeobacter sulfurreducensGeobacter sulfurreducens PCA
15AE017226243275BacteriaSpirochaetesSpirochaetiaSpirochaetalesSpirochaetaceaeTreponemaTreponema denticolaTreponema denticola ATCC 35405
16BX950229267377ArchaeaEuryarchaeotaMethanococciMethanococcalesMethanococcaceaeMethanococcusMethanococcus maripaludisMethanococcus maripaludis S2
17AE017221262724BacteriaDeinococcus-ThermusDeinococciThermalesThermaceaeThermusThermus thermophilusThermus thermophilus HB27
18BA00000170601ArchaeaEuryarchaeotaThermococciThermococcalesThermococcaceaePyrococcusPyrococcus horikoshiiPyrococcus horikoshii OT3
19BA000023273063ArchaeaCrenarchaeotaThermoproteiSulfolobalesSulfolobaceaeSulfolobusSulfolobus tokodaiiSulfolobus tokodaii str. 7
20NC_007951266265BacteriaProteobacteriaBetaproteobacteriaBurkholderialesBurkholderiaceaeParaburkholderiaParaburkholderia xenovoransParaburkholderia xenovorans LB400
21CP000492290317BacteriaChlorobiChlorobiaChlorobialesChlorobiaceaeChlorobiumChlorobium phaeobacteroidesChlorobium phaeobacteroides DSM 266
22NC_008751391774BacteriaProteobacteriaDeltaproteobacteriaDesulfovibrionalesDesulfovibrionaceaeDesulfovibrioDesulfovibrio vulgarisDesulfovibrio vulgaris DP4
23CP000568203119BacteriaFirmicutesClostridiaClostridialesRuminococcaceaeRuminiclostridiumRuminiclostridium thermocellumRuminiclostridium thermocellum ATCC 27405
24CP000561410359ArchaeaCrenarchaeotaThermoproteiThermoprotealesThermoproteaceaePyrobaculumPyrobaculum calidifontisPyrobaculum calidifontis JCM 11548
25CP000609402880ArchaeaEuryarchaeotaMethanococciMethanococcalesMethanococcaceaeMethanococcusMethanococcus maripaludisMethanococcus maripaludis C5
26CP000607290318BacteriaChlorobiChlorobiaChlorobialesChlorobiaceaeChlorobiumChlorobium phaeovibrioidesChlorobium phaeovibrioides DSM 265
27CP000660340102ArchaeaCrenarchaeotaThermoproteiThermoprotealesThermoproteaceaePyrobaculumPyrobaculum arsenaticumPyrobaculum arsenaticum DSM 13514
28CP000667369723BacteriaActinobacteriaActinobacteriaMicromonosporalesMicromonosporaceaeSalinisporaSalinispora tropicaSalinispora tropica CNB-440
29CP000679351627BacteriaFirmicutesClostridiaThermoanaerobacteralesThermoanaerobacterales Family III. Incertae SedisCaldicellulosiruptorCaldicellulosiruptor saccharolyticusCaldicellulosiruptor saccharolyticus DSM 8903
.................................
34CP000850391037BacteriaActinobacteriaActinobacteriaMicromonosporalesMicromonosporaceaeSalinisporaSalinispora arenicolaSalinispora arenicola CNS-205
35CP000909324602BacteriaChloroflexiChloroflexiaChloroflexalesChloroflexaceaeChloroflexusChloroflexus aurantiacusChloroflexus aurantiacus J-10-fl
36CP000924340099BacteriaFirmicutesClostridiaThermoanaerobacteralesThermoanaerobacteraceaeThermoanaerobacterThermoanaerobacter pseudethanolicusThermoanaerobacter pseudethanolicus ATCC 33223
37CP000969126740BacteriaThermotogaeThermotogaeThermotogalesThermotogaceaeThermotogaThermotoga sp. RQ2NaN
38CP001013395495BacteriaProteobacteriaBetaproteobacteriaBurkholderialesNaNLeptothrixLeptothrix cholodniiLeptothrix cholodnii SP-6
39CP001071349741BacteriaVerrucomicrobiaVerrucomicrobiaeVerrucomicrobialesAkkermansiaceaeAkkermansiaAkkermansia muciniphilaAkkermansia muciniphila ATCC BAA-835
40AP009380431947BacteriaBacteroidetesBacteroidiaBacteroidalesPorphyromonadaceaePorphyromonasPorphyromonas gingivalisPorphyromonas gingivalis ATCC 33277
41NC_010730436114BacteriaAquificaeAquificaeAquificalesHydrogenothermaceaeSulfurihydrogenibiumSulfurihydrogenibium sp. YO3AOP1NaN
42CP001097290315BacteriaChlorobiChlorobiaChlorobialesChlorobiaceaeChlorobiumChlorobium limicolaChlorobium limicola DSM 245
43CP001110324925BacteriaChlorobiChlorobiaChlorobialesChlorobiaceaePelodictyonPelodictyon phaeoclathratiformePelodictyon phaeoclathratiforme BU-1
44CP001130380749BacteriaAquificaeAquificaeAquificalesAquificaceaeHydrogenobaculumHydrogenobaculum sp. Y04AAS1NaN
45NZ_CH95931152598BacteriaProteobacteriaAlphaproteobacteriaRhodobacteralesRhodobacteraceaeSulfitobacterSulfitobacter sp. EE-36NaN
46NZ_CH959317314267BacteriaProteobacteriaAlphaproteobacteriaRhodobacteralesRhodobacteraceaeSulfitobacterSulfitobacter sp. NAS-14.1NaN
47CP001251515635BacteriaDictyoglomiDictyoglomiaDictyoglomalesDictyoglomaceaeDictyoglomusDictyoglomus turgidumDictyoglomus turgidum DSM 6724
48NC_011663407976BacteriaProteobacteriaGammaproteobacteriaAlteromonadalesShewanellaceaeShewanellaShewanella balticaShewanella baltica OS223
49CP000916309803BacteriaThermotogaeThermotogaeThermotogalesThermotogaceaeThermotogaThermotoga neapolitanaThermotoga neapolitana DSM 4359
50NZ_DS996397411464BacteriaProteobacteriaDeltaproteobacteriaDesulfovibrionalesDesulfovibrionaceaeDesulfovibrioDesulfovibrio pigerDesulfovibrio piger ATCC 29098
51CP001230123214BacteriaAquificaeAquificaeAquificalesHydrogenothermaceaePersephonellaPersephonella marinaPersephonella marina EX-H1
52CP001472240015BacteriaAcidobacteriaAcidobacteriiaAcidobacterialesAcidobacteriaceaeAcidobacteriumAcidobacterium capsulatumAcidobacterium capsulatum ATCC 51196
53AP009153379066BacteriaGemmatimonadetesGemmatimonadetesGemmatimonadalesGemmatimonadaceaeGemmatimonasGemmatimonas aurantiacaGemmatimonas aurantiaca T-27
54CP001941439481ArchaeaEuryarchaeotaNaNNaNNaNAciduliprofundumAciduliprofundum booneiAciduliprofundum boonei T469
55NC_013968309800ArchaeaEuryarchaeotaHalobacteriaHaloferacalesHaloferacaceaeHaloferaxHaloferax volcaniiHaloferax volcanii DS2
56NZ_KE136524226185BacteriaFirmicutesBacilliLactobacillalesEnterococcaceaeEnterococcusEnterococcus faecalisEnterococcus faecalis V583
57NZ_KQ961402542BacteriaProteobacteriaAlphaproteobacteriaSphingomonadalesSphingomonadaceaeZymomonasZymomonas mobilisNaN
58NZ_CP015081243230BacteriaDeinococcus-ThermusDeinococciDeinococcalesDeinococcaceaeDeinococcusDeinococcus radioduransDeinococcus radiodurans R1
59NZ_ABZS01000228432331BacteriaAquificaeAquificaeAquificalesHydrogenothermaceaeSulfurihydrogenibiumSulfurihydrogenibium yellowstonenseSulfurihydrogenibium yellowstonense SS-5
60NZ_JGWU010000011458259BacteriaProteobacteriaBetaproteobacteriaBurkholderialesAlcaligenaceaeBordetellaBordetella bronchisepticaBordetella bronchiseptica D989
61NZ_FWDH0100000331899BacteriaFirmicutesClostridiaThermoanaerobacteralesThermoanaerobacterales Family III. Incertae SedisCaldicellulosiruptorCaldicellulosiruptor besciiNaN
62NC_009972316274BacteriaChloroflexiChloroflexiaHerpetosiphonalesHerpetosiphonaceaeHerpetosiphonHerpetosiphon aurantiacusHerpetosiphon aurantiacus DSM 785
63NC_005213228908ArchaeaNanoarchaeotaNaNNanoarchaealesNanoarchaeaceaeNanoarchaeumNanoarchaeum equitansNanoarchaeum equitans Kin4-M
\n", "

64 rows × 10 columns

\n", "
" ], "text/plain": [ " accession taxid superkingdom phylum \\\n", "0 AE000782 224325 Archaea Euryarchaeota \n", "1 NC_000909 243232 Archaea Euryarchaeota \n", "2 NC_003272 103690 Bacteria Cyanobacteria \n", "3 AE009441 178306 Archaea Crenarchaeota \n", "4 AE009950 186497 Archaea Euryarchaeota \n", "5 AE009951 190304 Bacteria Fusobacteria \n", "6 AE010299 188937 Archaea Euryarchaeota \n", "7 AE009439 190192 Archaea Euryarchaeota \n", "8 NC_003911 246200 Bacteria Proteobacteria \n", "9 AE006470 194439 Bacteria Chlorobi \n", "10 AE015928 226186 Bacteria Bacteroidetes \n", "11 AL954747 228410 Bacteria Proteobacteria \n", "12 BX119912 243090 Bacteria Planctomycetes \n", "13 BX571656 273121 Bacteria Proteobacteria \n", "14 AE017180 243231 Bacteria Proteobacteria \n", "15 AE017226 243275 Bacteria Spirochaetes \n", "16 BX950229 267377 Archaea Euryarchaeota \n", "17 AE017221 262724 Bacteria Deinococcus-Thermus \n", "18 BA000001 70601 Archaea Euryarchaeota \n", "19 BA000023 273063 Archaea Crenarchaeota \n", "20 NC_007951 266265 Bacteria Proteobacteria \n", "21 CP000492 290317 Bacteria Chlorobi \n", "22 NC_008751 391774 Bacteria Proteobacteria \n", "23 CP000568 203119 Bacteria Firmicutes \n", "24 CP000561 410359 Archaea Crenarchaeota \n", "25 CP000609 402880 Archaea Euryarchaeota \n", "26 CP000607 290318 Bacteria Chlorobi \n", "27 CP000660 340102 Archaea Crenarchaeota \n", "28 CP000667 369723 Bacteria Actinobacteria \n", "29 CP000679 351627 Bacteria Firmicutes \n", ".. ... ... ... ... \n", "34 CP000850 391037 Bacteria Actinobacteria \n", "35 CP000909 324602 Bacteria Chloroflexi \n", "36 CP000924 340099 Bacteria Firmicutes \n", "37 CP000969 126740 Bacteria Thermotogae \n", "38 CP001013 395495 Bacteria Proteobacteria \n", "39 CP001071 349741 Bacteria Verrucomicrobia \n", "40 AP009380 431947 Bacteria Bacteroidetes \n", "41 NC_010730 436114 Bacteria Aquificae \n", "42 CP001097 290315 Bacteria Chlorobi \n", "43 CP001110 324925 Bacteria Chlorobi \n", "44 CP001130 380749 Bacteria Aquificae \n", "45 NZ_CH959311 52598 Bacteria Proteobacteria \n", "46 NZ_CH959317 314267 Bacteria Proteobacteria \n", "47 CP001251 515635 Bacteria Dictyoglomi \n", "48 NC_011663 407976 Bacteria Proteobacteria \n", "49 CP000916 309803 Bacteria Thermotogae \n", "50 NZ_DS996397 411464 Bacteria Proteobacteria \n", "51 CP001230 123214 Bacteria Aquificae \n", "52 CP001472 240015 Bacteria Acidobacteria \n", "53 AP009153 379066 Bacteria Gemmatimonadetes \n", "54 CP001941 439481 Archaea Euryarchaeota \n", "55 NC_013968 309800 Archaea Euryarchaeota \n", "56 NZ_KE136524 226185 Bacteria Firmicutes \n", "57 NZ_KQ961402 542 Bacteria Proteobacteria \n", "58 NZ_CP015081 243230 Bacteria Deinococcus-Thermus \n", "59 NZ_ABZS01000228 432331 Bacteria Aquificae \n", "60 NZ_JGWU01000001 1458259 Bacteria Proteobacteria \n", "61 NZ_FWDH01000003 31899 Bacteria Firmicutes \n", "62 NC_009972 316274 Bacteria Chloroflexi \n", "63 NC_005213 228908 Archaea Nanoarchaeota \n", "\n", " class order \\\n", "0 Archaeoglobi Archaeoglobales \n", "1 Methanococci Methanococcales \n", "2 NaN Nostocales \n", "3 Thermoprotei Thermoproteales \n", "4 Thermococci Thermococcales \n", "5 Fusobacteriia Fusobacteriales \n", "6 Methanomicrobia Methanosarcinales \n", "7 Methanopyri Methanopyrales \n", "8 Alphaproteobacteria Rhodobacterales \n", "9 Chlorobia Chlorobiales \n", "10 Bacteroidia Bacteroidales \n", "11 Betaproteobacteria Nitrosomonadales \n", "12 Planctomycetia Planctomycetales \n", "13 Epsilonproteobacteria Campylobacterales \n", "14 Deltaproteobacteria Desulfuromonadales \n", "15 Spirochaetia Spirochaetales \n", "16 Methanococci Methanococcales \n", "17 Deinococci Thermales \n", "18 Thermococci Thermococcales \n", "19 Thermoprotei Sulfolobales \n", "20 Betaproteobacteria Burkholderiales \n", "21 Chlorobia Chlorobiales \n", "22 Deltaproteobacteria Desulfovibrionales \n", "23 Clostridia Clostridiales \n", "24 Thermoprotei Thermoproteales \n", "25 Methanococci Methanococcales \n", "26 Chlorobia Chlorobiales \n", "27 Thermoprotei Thermoproteales \n", "28 Actinobacteria Micromonosporales \n", "29 Clostridia Thermoanaerobacterales \n", ".. ... ... \n", "34 Actinobacteria Micromonosporales \n", "35 Chloroflexia Chloroflexales \n", "36 Clostridia Thermoanaerobacterales \n", "37 Thermotogae Thermotogales \n", "38 Betaproteobacteria Burkholderiales \n", "39 Verrucomicrobiae Verrucomicrobiales \n", "40 Bacteroidia Bacteroidales \n", "41 Aquificae Aquificales \n", "42 Chlorobia Chlorobiales \n", "43 Chlorobia Chlorobiales \n", "44 Aquificae Aquificales \n", "45 Alphaproteobacteria Rhodobacterales \n", "46 Alphaproteobacteria Rhodobacterales \n", "47 Dictyoglomia Dictyoglomales \n", "48 Gammaproteobacteria Alteromonadales \n", "49 Thermotogae Thermotogales \n", "50 Deltaproteobacteria Desulfovibrionales \n", "51 Aquificae Aquificales \n", "52 Acidobacteriia Acidobacteriales \n", "53 Gemmatimonadetes Gemmatimonadales \n", "54 NaN NaN \n", "55 Halobacteria Haloferacales \n", "56 Bacilli Lactobacillales \n", "57 Alphaproteobacteria Sphingomonadales \n", "58 Deinococci Deinococcales \n", "59 Aquificae Aquificales \n", "60 Betaproteobacteria Burkholderiales \n", "61 Clostridia Thermoanaerobacterales \n", "62 Chloroflexia Herpetosiphonales \n", "63 NaN Nanoarchaeales \n", "\n", " family genus \\\n", "0 Archaeoglobaceae Archaeoglobus \n", "1 Methanocaldococcaceae Methanocaldococcus \n", "2 Nostocaceae Nostoc \n", "3 Thermoproteaceae Pyrobaculum \n", "4 Thermococcaceae Pyrococcus \n", "5 Fusobacteriaceae Fusobacterium \n", "6 Methanosarcinaceae Methanosarcina \n", "7 Methanopyraceae Methanopyrus \n", "8 Rhodobacteraceae Ruegeria \n", "9 Chlorobiaceae Chlorobaculum \n", "10 Bacteroidaceae Bacteroides \n", "11 Nitrosomonadaceae Nitrosomonas \n", "12 Planctomycetaceae Rhodopirellula \n", "13 Helicobacteraceae Wolinella \n", "14 Geobacteraceae Geobacter \n", "15 Spirochaetaceae Treponema \n", "16 Methanococcaceae Methanococcus \n", "17 Thermaceae Thermus \n", "18 Thermococcaceae Pyrococcus \n", "19 Sulfolobaceae Sulfolobus \n", "20 Burkholderiaceae Paraburkholderia \n", "21 Chlorobiaceae Chlorobium \n", "22 Desulfovibrionaceae Desulfovibrio \n", "23 Ruminococcaceae Ruminiclostridium \n", "24 Thermoproteaceae Pyrobaculum \n", "25 Methanococcaceae Methanococcus \n", "26 Chlorobiaceae Chlorobium \n", "27 Thermoproteaceae Pyrobaculum \n", "28 Micromonosporaceae Salinispora \n", "29 Thermoanaerobacterales Family III. Incertae Sedis Caldicellulosiruptor \n", ".. ... ... \n", "34 Micromonosporaceae Salinispora \n", "35 Chloroflexaceae Chloroflexus \n", "36 Thermoanaerobacteraceae Thermoanaerobacter \n", "37 Thermotogaceae Thermotoga \n", "38 NaN Leptothrix \n", "39 Akkermansiaceae Akkermansia \n", "40 Porphyromonadaceae Porphyromonas \n", "41 Hydrogenothermaceae Sulfurihydrogenibium \n", "42 Chlorobiaceae Chlorobium \n", "43 Chlorobiaceae Pelodictyon \n", "44 Aquificaceae Hydrogenobaculum \n", "45 Rhodobacteraceae Sulfitobacter \n", "46 Rhodobacteraceae Sulfitobacter \n", "47 Dictyoglomaceae Dictyoglomus \n", "48 Shewanellaceae Shewanella \n", "49 Thermotogaceae Thermotoga \n", "50 Desulfovibrionaceae Desulfovibrio \n", "51 Hydrogenothermaceae Persephonella \n", "52 Acidobacteriaceae Acidobacterium \n", "53 Gemmatimonadaceae Gemmatimonas \n", "54 NaN Aciduliprofundum \n", "55 Haloferacaceae Haloferax \n", "56 Enterococcaceae Enterococcus \n", "57 Sphingomonadaceae Zymomonas \n", "58 Deinococcaceae Deinococcus \n", "59 Hydrogenothermaceae Sulfurihydrogenibium \n", "60 Alcaligenaceae Bordetella \n", "61 Thermoanaerobacterales Family III. Incertae Sedis Caldicellulosiruptor \n", "62 Herpetosiphonaceae Herpetosiphon \n", "63 Nanoarchaeaceae Nanoarchaeum \n", "\n", " species \\\n", "0 Archaeoglobus fulgidus \n", "1 Methanocaldococcus jannaschii \n", "2 Nostoc sp. PCC 7120 \n", "3 Pyrobaculum aerophilum \n", "4 Pyrococcus furiosus \n", "5 Fusobacterium nucleatum \n", "6 Methanosarcina acetivorans \n", "7 Methanopyrus kandleri \n", "8 Ruegeria pomeroyi \n", "9 Chlorobaculum tepidum \n", "10 Bacteroides thetaiotaomicron \n", "11 Nitrosomonas europaea \n", "12 Rhodopirellula baltica \n", "13 Wolinella succinogenes \n", "14 Geobacter sulfurreducens \n", "15 Treponema denticola \n", "16 Methanococcus maripaludis \n", "17 Thermus thermophilus \n", "18 Pyrococcus horikoshii \n", "19 Sulfolobus tokodaii \n", "20 Paraburkholderia xenovorans \n", "21 Chlorobium phaeobacteroides \n", "22 Desulfovibrio vulgaris \n", "23 Ruminiclostridium thermocellum \n", "24 Pyrobaculum calidifontis \n", "25 Methanococcus maripaludis \n", "26 Chlorobium phaeovibrioides \n", "27 Pyrobaculum arsenaticum \n", "28 Salinispora tropica \n", "29 Caldicellulosiruptor saccharolyticus \n", ".. ... \n", "34 Salinispora arenicola \n", "35 Chloroflexus aurantiacus \n", "36 Thermoanaerobacter pseudethanolicus \n", "37 Thermotoga sp. RQ2 \n", "38 Leptothrix cholodnii \n", "39 Akkermansia muciniphila \n", "40 Porphyromonas gingivalis \n", "41 Sulfurihydrogenibium sp. YO3AOP1 \n", "42 Chlorobium limicola \n", "43 Pelodictyon phaeoclathratiforme \n", "44 Hydrogenobaculum sp. Y04AAS1 \n", "45 Sulfitobacter sp. EE-36 \n", "46 Sulfitobacter sp. NAS-14.1 \n", "47 Dictyoglomus turgidum \n", "48 Shewanella baltica \n", "49 Thermotoga neapolitana \n", "50 Desulfovibrio piger \n", "51 Persephonella marina \n", "52 Acidobacterium capsulatum \n", "53 Gemmatimonas aurantiaca \n", "54 Aciduliprofundum boonei \n", "55 Haloferax volcanii \n", "56 Enterococcus faecalis \n", "57 Zymomonas mobilis \n", "58 Deinococcus radiodurans \n", "59 Sulfurihydrogenibium yellowstonense \n", "60 Bordetella bronchiseptica \n", "61 Caldicellulosiruptor bescii \n", "62 Herpetosiphon aurantiacus \n", "63 Nanoarchaeum equitans \n", "\n", " strain \n", "0 Archaeoglobus fulgidus DSM 4304 \n", "1 Methanocaldococcus jannaschii DSM 2661 \n", "2 NaN \n", "3 Pyrobaculum aerophilum str. IM2 \n", "4 Pyrococcus furiosus DSM 3638 \n", "5 NaN \n", "6 Methanosarcina acetivorans C2A \n", "7 Methanopyrus kandleri AV19 \n", "8 Ruegeria pomeroyi DSS-3 \n", "9 Chlorobaculum tepidum TLS \n", "10 Bacteroides thetaiotaomicron VPI-5482 \n", "11 Nitrosomonas europaea ATCC 19718 \n", "12 Rhodopirellula baltica SH 1 \n", "13 Wolinella succinogenes DSM 1740 \n", "14 Geobacter sulfurreducens PCA \n", "15 Treponema denticola ATCC 35405 \n", "16 Methanococcus maripaludis S2 \n", "17 Thermus thermophilus HB27 \n", "18 Pyrococcus horikoshii OT3 \n", "19 Sulfolobus tokodaii str. 7 \n", "20 Paraburkholderia xenovorans LB400 \n", "21 Chlorobium phaeobacteroides DSM 266 \n", "22 Desulfovibrio vulgaris DP4 \n", "23 Ruminiclostridium thermocellum ATCC 27405 \n", "24 Pyrobaculum calidifontis JCM 11548 \n", "25 Methanococcus maripaludis C5 \n", "26 Chlorobium phaeovibrioides DSM 265 \n", "27 Pyrobaculum arsenaticum DSM 13514 \n", "28 Salinispora tropica CNB-440 \n", "29 Caldicellulosiruptor saccharolyticus DSM 8903 \n", ".. ... \n", "34 Salinispora arenicola CNS-205 \n", "35 Chloroflexus aurantiacus J-10-fl \n", "36 Thermoanaerobacter pseudethanolicus ATCC 33223 \n", "37 NaN \n", "38 Leptothrix cholodnii SP-6 \n", "39 Akkermansia muciniphila ATCC BAA-835 \n", "40 Porphyromonas gingivalis ATCC 33277 \n", "41 NaN \n", "42 Chlorobium limicola DSM 245 \n", "43 Pelodictyon phaeoclathratiforme BU-1 \n", "44 NaN \n", "45 NaN \n", "46 NaN \n", "47 Dictyoglomus turgidum DSM 6724 \n", "48 Shewanella baltica OS223 \n", "49 Thermotoga neapolitana DSM 4359 \n", "50 Desulfovibrio piger ATCC 29098 \n", "51 Persephonella marina EX-H1 \n", "52 Acidobacterium capsulatum ATCC 51196 \n", "53 Gemmatimonas aurantiaca T-27 \n", "54 Aciduliprofundum boonei T469 \n", "55 Haloferax volcanii DS2 \n", "56 Enterococcus faecalis V583 \n", "57 NaN \n", "58 Deinococcus radiodurans R1 \n", "59 Sulfurihydrogenibium yellowstonense SS-5 \n", "60 Bordetella bronchiseptica D989 \n", "61 NaN \n", "62 Herpetosiphon aurantiacus DSM 785 \n", "63 Nanoarchaeum equitans Kin4-M \n", "\n", "[64 rows x 10 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas\n", "df = pandas.read_csv('podar-lineage.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", "\u001b[Kexamining spreadsheet headers...\n", "\u001b[K** assuming column 'accession' is identifiers in spreadsheet\n", "\u001b[K64 distinct identities in spreadsheet out of 64 rows.\n", "\u001b[K64 distinct lineages in spreadsheet out of 64 rows.\n", "\u001b[K64 assigned lineages out of 64 distinct lineages in spreadsheet. 64)\n", "\u001b[K64 identifiers used out of 64 distinct identifiers in spreadsheet.\n", "\u001b[Ksaving to LCA DB: taxdb.lca.json\n" ] } ], "source": [ "!sourmash lca index podar-lineage.csv taxdb big_genomes/*.sig -C 3 --split-identifiers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This database 'taxdb.lca.json' can be used for search and gather as above:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", "\u001b[Kselect query k=31 automatically.\n", "\u001b[Kloaded query: fake-metagenome.fa... (k=31, DNA)\n", "\u001b[Kloaded 1 databases. \n", "\n", "\n", "overlap p_query p_match\n", "--------- ------- -------\n", "0.6 Mbp 46.7% 11.6% NC_011663.1 Shewanella baltica OS223,...\n", "0.5 Mbp 38.7% 19.3% CP001071.1 Akkermansia muciniphila AT...\n", "0.5 Mbp 14.6% 3.9% NC_009665.1 Shewanella baltica OS185,...\n", "\n", "found 3 matches total;\n", "the recovered matches hit 100.0% of the query\n", "\n" ] } ], "source": [ "!sourmash gather fake-metagenome.fa.sig taxdb.lca.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "...but can also be used for taxonomic summarization:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", "\u001b[Kloaded 1 LCA databases. ksize=31, scaled=10000\n", "\u001b[Kfinding query signatures...\n", "\u001b[Kloaded 1 signatures from 1 files total.of 1)\n", "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835\n", "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n", "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia\n", "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae\n", "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales\n", "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae\n", "38.7% 53 Bacteria;Verrucomicrobia\n", "100.0% 137 Bacteria\n", "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica\n", "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella\n", "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae\n", "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales\n", "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria\n", "61.3% 84 Bacteria;Proteobacteria\n", "22.6% 31 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS223\n", "14.6% 20 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS185\n" ] } ], "source": [ "!sourmash lca summarize --query fake-metagenome.fa.sig --db taxdb.lca.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other pointers\n", "\n", "[Sourmash: a practical guide](https://sourmash.readthedocs.io/en/latest/using-sourmash-a-guide.html)\n", "\n", "[Classifying signatures taxonomically](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html)\n", "\n", "[Pre-built search databases](https://sourmash.readthedocs.io/en/latest/databases.html)\n", "\n", "## A full list of notebooks\n", "\n", "[An introduction to k-mers for genome comparison and analysis](kmers-and-minhash.ipynb)\n", "\n", "[Some sourmash command line examples!](sourmash-examples.ipynb)\n", "\n", "[Working with private collections of signatures.](sourmash-collections.ipynb)\n", "\n", "[Using the LCA_Database API.](using-LCA-database-API.ipynb)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }