{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using the `LCA_Database` API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`LCA_Database` objects combine a fast in-memory storage of signatures\n", "indexed by their hash values, with taxonomic lineage storage. They are\n", "limited to storing scaled DNA signatures with a single ksize; the scaled\n", "and ksize values are specified at creation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Running this notebook.\n", "\n", "You can run this notebook interactively via mybinder; click on this button:\n", "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/dib-lab/sourmash/latest?filepath=doc%2Fusing-LCA-database-API.ipynb)\n", "\n", "A rendered version of this notebook is available at [sourmash.readthedocs.io](https://sourmash.readthedocs.io) under \"Tutorials and notebooks\".\n", "\n", "You can also get this notebook from the [doc/ subdirectory of the sourmash github repository](https://github.com/dib-lab/sourmash/tree/latest/doc). See [binder/environment.yaml](https://github.com/dib-lab/sourmash/blob/latest/binder/environment.yml) for installation dependencies.\n", "\n", "### What is this?\n", "\n", "This is a Jupyter Notebook using Python 3. If you are running this via [binder](https://mybinder.org), you can use Shift-ENTER to run cells, and double click on code cells to edit them.\n", "\n", "Contact: C. Titus Brown, ctbrown@ucdavis.edu. Please [file issues on GitHub](https://github.com/dib-lab/sourmash/issues/) if you have any questions or comments!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating an `LCA_Database` object\n", "\n", "Create an `LCA_Database` like so:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sourmash\n", "db = sourmash.lca.LCA_Database(ksize=31, scaled=1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create signatures for some genomes, load them, and add them:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[K\r\n", "== This is sourmash version 3.2.4.dev5+g6484e78f. ==\r\n", "\r", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\r\n", "\r\n", "\r", "\u001b[Ksetting num_hashes to 0 because --scaled is set\r\n", "\r", "\u001b[Kcomputing signatures for files: genomes/akkermansia.fa, genomes/shew_os185.fa, genomes/shew_os223.fa\r\n", "\r", "\u001b[KComputing signature for ksizes: [31]\r\n", "\r", "\u001b[KComputing only nucleotide (and not protein) signatures.\r\n", "\r", "\u001b[KComputing a total of 1 signature(s).\r\n", "\r", "\u001b[Kskipping genomes/akkermansia.fa - already done\r\n", "\r", "\u001b[Kskipping genomes/shew_os185.fa - already done\r\n", "\r", "\u001b[Kskipping genomes/shew_os223.fa - already done\r\n" ] } ], "source": [ "!sourmash compute --name-from-first -k 31 --scaled=1000 genomes/*" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "sig1 = sourmash.load_one_signature('akkermansia.fa.sig', ksize=31)\n", "sig2 = sourmash.load_one_signature('shew_os185.fa.sig', ksize=31)\n", "sig3 = sourmash.load_one_signature('shew_os223.fa.sig', ksize=31)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "db.insert(sig1, ident='akkermansia')\n", "db.insert(sig2, ident='shew_os185')\n", "db.insert(sig3, ident='shew_os223')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run `search` and `gather` via the `Index` API" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(1.0,\n", " SourmashSignature('CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome', 6822e0b7),\n", " None)]\n" ] } ], "source": [ "from pprint import pprint\n", "pprint(db.search(sig1, threshold=0.1))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(1.0,\n", " SourmashSignature('NC_009665.1 Shewanella baltica OS185, complete genome', b47b13ef),\n", " None),\n", " (0.22846441947565543,\n", " SourmashSignature('NC_011663.1 Shewanella baltica OS223, complete genome', ae6659f6),\n", " None)]\n" ] } ], "source": [ "pprint(db.search(sig2, threshold=0.1))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(1.0,\n", " SourmashSignature('NC_011663.1 Shewanella baltica OS223, complete genome', ae6659f6),\n", " None)]\n" ] } ], "source": [ "pprint(db.gather(sig3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrieve all signatures with `signatures()`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SourmashSignature('CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome', 6822e0b7)\n", "SourmashSignature('NC_009665.1 Shewanella baltica OS185, complete genome', b47b13ef)\n", "SourmashSignature('NC_011663.1 Shewanella baltica OS223, complete genome', ae6659f6)\n" ] } ], "source": [ "for i in db.signatures():\n", " print(i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Access identifiers and names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The list of (unique) identifiers in the database can be accessed via the attribute `ident_to_idx`, which maps to integer identifiers; identifiers can also retrieve full names, which are taken from `sig.name()` upon insertion." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dict_keys(['akkermansia', 'shew_os185', 'shew_os223'])\n" ] } ], "source": [ "pprint(db.ident_to_idx.keys())" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'akkermansia': 'CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete '\n", " 'genome',\n", " 'shew_os185': 'NC_009665.1 Shewanella baltica OS185, complete genome',\n", " 'shew_os223': 'NC_011663.1 Shewanella baltica OS223, complete genome'}\n" ] } ], "source": [ "pprint(db.ident_to_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Access hash values directly" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The attribute `hashval_to_idx` contains a mapping from individual hash values to sets of `idx` indices.\n", "\n", "See the method `_find_signatures()` for an example of how this is used in `search` and `gather`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1300 hash values total in this database\n" ] } ], "source": [ "print('{} hash values total in this database'.format(len(db.hashval_to_idx)))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "belonging to signatures with idx {0, 1, 2}\n" ] } ], "source": [ "all_idx = set()\n", "for idx_set in db.hashval_to_idx.values():\n", " all_idx.update(idx_set)\n", "print('belonging to signatures with idx {}'.format(all_idx))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "first_three_hashvals = list(db.hashval_to_idx)[:3]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hashval 17302105753387 belongs to idxs {0}\n", "hashval 95741036335406 belongs to idxs {0}\n", "hashval 165640715598232 belongs to idxs {0}\n" ] } ], "source": [ "for hashval in first_three_hashvals:\n", " print('hashval {} belongs to idxs {}'.format(hashval, db.hashval_to_idx[hashval]))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "490 hashvals belong to query idx 2\n", "query idx 2 matches to ident shew_os223\n", "query idx 2 matches to name NC_011663.1 Shewanella baltica OS223, complete genome\n" ] } ], "source": [ "query_idx = 2\n", "hashval_set = set()\n", "for hashval, idx_set in db.hashval_to_idx.items():\n", " if query_idx in idx_set:\n", " hashval_set.add(hashval)\n", " \n", "print('{} hashvals belong to query idx {}'.format(len(hashval_set), query_idx))\n", "\n", "ident = db.idx_to_ident[query_idx]\n", "print('query idx {} matches to ident {}'.format(query_idx, ident))\n", "\n", "name = db.ident_to_name[ident]\n", "print('query idx {} matches to name {}'.format(query_idx, name))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lineage storage and retrieval" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "from sourmash.lca import LineagePair" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "superkingdom = LineagePair('superkingdom', 'Bacteria')\n", "phylum = LineagePair('phylum', 'Verrucomicrobia')\n", "klass = LineagePair('class', 'Verrucomicrobiae')\n", "\n", "lineage = (superkingdom, phylum, klass)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "db = sourmash.lca.LCA_Database(ksize=31, scaled=1000)\n", "db.insert(sig1, lineage=lineage)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ident 'CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome' has idx 0\n", "lid for idx 0 is 0\n", "lineage for lid 0 is Bacteria;Verrucomicrobia;Verrucomicrobiae\n" ] } ], "source": [ "# by default, the identifier is the signature name --\n", "ident = sig1.name()\n", "idx = db.ident_to_idx[ident]\n", "print(\"ident '{}' has idx {}\".format(ident, idx))\n", "\n", "lid = db.idx_to_lid[idx]\n", "print(\"lid for idx {} is {}\".format(idx, lid))\n", "\n", "lineage = db.lid_to_lineage[lid]\n", "display = sourmash.lca.display_lineage(lineage)\n", "print(\"lineage for lid {} is {}\".format(lid, display))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lineage manipulation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Default taxonomy ranks for lineages\n", "\n", "While sourmash lineage functions can work with any taxonomy ranks and any taxonomy names, both the NCBI and GTDB taxonomies use superkingdom/phylum/etc, so there is a hard coded list availalbe via `sourmash.lca.taxlist()`." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'strain']\n" ] } ], "source": [ "print(list(sourmash.lca.taxlist()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given a taxonomy as a list, you can then construct a lineage like so:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[LineagePair(rank='superkingdom', name='Bacteria'),\n", " LineagePair(rank='phylum', name='Verrucomicrobia'),\n", " LineagePair(rank='class', name='Verrucomicrobiae'),\n", " LineagePair(rank='order', name='Verrucomicrobiales'),\n", " LineagePair(rank='family', name='Akkermansiaceae'),\n", " LineagePair(rank='genus', name='Akkermansia'),\n", " LineagePair(rank='species', name='Akkermansia muciniphila'),\n", " LineagePair(rank='strain', name='Akkermansia muciniphila ATCC BAA-835')]\n" ] } ], "source": [ "linstr1 = [\"Bacteria\", \"Verrucomicrobia\", \"Verrucomicrobiae\",\n", " \"Verrucomicrobiales\", \"Akkermansiaceae\", \"Akkermansia\",\n", " \"Akkermansia muciniphila\", \"Akkermansia muciniphila ATCC BAA-835\"]\n", "\n", "lineage1 = [ LineagePair(*pair) for pair in zip(sourmash.lca.taxlist(), linstr1) ]\n", "pprint(lineage1)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# display lineages as strings with 'sourmash.lca.display_lineage()'\n", "sourmash.lca.display_lineage(lineage1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## sourmash lowest-common-ancestor functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The LCA functionality available in sourmash is built around some simple lineage manipulation functions -- `build_tree` and `find_lca`.\n", "\n", "First, let's define some more lineages --" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "lineage2 is Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS185\n", "lineage3 is Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS223\n" ] } ], "source": [ "linstr2 = [\"Bacteria\", \"Proteobacteria\", \"Gammaproteobacteria\",\n", " \"Alteromonadales\", \"Shewanellaceae\", \"Shewanella\",\n", " \"Shewanella baltica\", \"Shewanella baltica OS185\"]\n", "lineage2 = [ LineagePair(*pair) for pair in zip(sourmash.lca.taxlist(), linstr2) ]\n", "\n", "linstr3 = [\"Bacteria\", \"Proteobacteria\", \"Gammaproteobacteria\",\n", " \"Alteromonadales\", \"Shewanellaceae\", \"Shewanella\",\n", " \"Shewanella baltica\", \"Shewanella baltica OS223\"]\n", "lineage3 = [ LineagePair(*pair) for pair in zip(sourmash.lca.taxlist(), linstr3) ]\n", "\n", "print('lineage2 is', sourmash.lca.display_lineage(lineage2))\n", "print('lineage3 is', sourmash.lca.display_lineage(lineage3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, build a tree structure that collapses these lineages where it can, and run some LCA analyses. Lineages 1 and 2 collapse to superkingdom:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((LineagePair(rank='superkingdom', name='Bacteria'),), 2)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree = sourmash.lca.build_tree([lineage1, lineage2])\n", "sourmash.lca.find_lca(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "while lineages 2 and 3 collapse to species:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((LineagePair(rank='superkingdom', name='Bacteria'),\n", " LineagePair(rank='phylum', name='Proteobacteria'),\n", " LineagePair(rank='class', name='Gammaproteobacteria'),\n", " LineagePair(rank='order', name='Alteromonadales'),\n", " LineagePair(rank='family', name='Shewanellaceae'),\n", " LineagePair(rank='genus', name='Shewanella'),\n", " LineagePair(rank='species', name='Shewanella baltica')),\n", " 2)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree = sourmash.lca.build_tree([lineage2, lineage3])\n", "sourmash.lca.find_lca(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convenience functions let you make use of LCA_Database stored lineages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's create a database from 3 signatures, and this time we'll store lineages in there:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "db = sourmash.lca.LCA_Database(ksize=31, scaled=1000)\n", "db.insert(sig1, lineage=lineage1)\n", "db.insert(sig2, lineage=lineage2)\n", "db.insert(sig3, lineage=lineage3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, for any collection of hashes, you can retrieve all the lineage assignments into a dictionary:\n", "`{ hashval: set of lineages }`" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "num hashvals: 494\n" ] } ], "source": [ "assignments = sourmash.lca.gather_assignments(sig2.minhash.get_mins(), [db])\n", "print('num hashvals:', len(assignments))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, this particular hashvalue belongs to two different lineages:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{(LineagePair(rank='superkingdom', name='Bacteria'),\n", " LineagePair(rank='phylum', name='Proteobacteria'),\n", " LineagePair(rank='class', name='Gammaproteobacteria'),\n", " LineagePair(rank='order', name='Alteromonadales'),\n", " LineagePair(rank='family', name='Shewanellaceae'),\n", " LineagePair(rank='genus', name='Shewanella'),\n", " LineagePair(rank='species', name='Shewanella baltica'),\n", " LineagePair(rank='strain', name='Shewanella baltica OS185')),\n", " (LineagePair(rank='superkingdom', name='Bacteria'),\n", " LineagePair(rank='phylum', name='Proteobacteria'),\n", " LineagePair(rank='class', name='Gammaproteobacteria'),\n", " LineagePair(rank='order', name='Alteromonadales'),\n", " LineagePair(rank='family', name='Shewanellaceae'),\n", " LineagePair(rank='genus', name='Shewanella'),\n", " LineagePair(rank='species', name='Shewanella baltica'),\n", " LineagePair(rank='strain', name='Shewanella baltica OS223'))}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "assignments[196037984804395]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "sourmash also includes functions to summarize assignments by counting the number of\n", "times a particular lineage occurs; this is\n", "used by `sourmash lca classify` and `sourmash lca summarize`." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "311 hashes have LCA: Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS185\n", "183 hashes have LCA: Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica\n" ] } ], "source": [ "counter = sourmash.lca.count_lca_for_assignments(assignments)\n", "\n", "# count_lca_for_assignments returns a collections.Counter object\n", "for lineage, count in counter.most_common():\n", " print('{} hashes have LCA: {}'.format(count, sourmash.lca.display_lineage(lineage)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other pointers\n", "\n", "[Sourmash: a practical guide](https://sourmash.readthedocs.io/en/latest/using-sourmash-a-guide.html)\n", "\n", "[Classifying signatures taxonomically](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html)\n", "\n", "[Pre-built search databases](https://sourmash.readthedocs.io/en/latest/databases.html)\n", "\n", "## A full list of notebooks\n", "\n", "[An introduction to k-mers for genome comparison and analysis](kmers-and-minhash.ipynb)\n", "\n", "[Some sourmash command line examples!](sourmash-examples.ipynb)\n", "\n", "[Working with private collections of signatures.](sourmash-collections.ipynb)\n", "\n", "[Using the LCA_Database API.](using-LCA-database-API.ipynb)" ] } ], "metadata": { "kernelspec": { "display_name": "Python (myenv)", "language": "python", "name": "myenv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }