{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "\n",
    "   * [Introduction](#introduction)\n",
    "   * [Metagenomic Pathogen Detection](#metagenomic-pathogen-detection)\n",
    "   * [Writing large scale sequence analysis workflows](#writing-large-scale-sequence-analysis-workflows)\n",
    "<!--- TOC END -->\n",
    "\n",
    "# Introduction\n",
    "\n",
    "![MMseqs2, Linclust and Plass](images/ThreeTools.png)\n",
    "\n",
    "Your new friends! **MMseqs2** ultra fast and sensitive protein search,\n",
    "**Linclust** clustering huge protein sequence sets in linear time and\n",
    "**Plass** Protein-level assembly to increases protein sequence recovery\n",
    "from complex metagenomes.\n",
    "\n",
    "Here you will learn the basic usage of MMseqs2 as well as expert tricks\n",
    "to take advantage of the ability of chaining different MMseqs2 modules\n",
    "to produce custom workflows.\n",
    "\n",
    "## Required software for the tutorials\n",
    "\n",
    "We will use [Conda](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html)\n",
    "to setup all required software for this tutorial. If you haven't setup\n",
    "Conda yet, please do so first and then execute:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "    conda create -n tutorial mmseqs2 plass megahit prodigal hmmer sra-tools\n",
    "    conda activate tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The generic syntax for `mmseqs` and `plass` calls is always the\n",
    "following:\n",
    "\n",
    "    mmseqs <command> <db1> [<db2> ...] --flag --parameter 42\n",
    "\n",
    "The help text of `mmseqs` shows, by default, only the most important\n",
    "parameters and modules. To see a full list of parameters and modules use\n",
    "the `-h` flag.\n",
    "\n",
    "If you are using Bash as your shell, you can activate tab-auto-completion of commands and parameters:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "    source $CONDA_PREFIX/util/bash-completion.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Metagenomic Pathogen Detection\n",
    "\n",
    "## The Patient\n",
    "\n",
    "A 61-year-old man was admitted in December 2016 with bilateral headache,\n",
    "gait instability, lethargy, and confusion. Because of multiple tick\n",
    "bites in the preceding 2 weeks, he was prescribed the antibiotic\n",
    "doxycycline for presumed Lyme disease. Over the next 48 hours, he\n",
    "developed worsening confusion, weakness, and ataxia. He returned to the\n",
    "referring hospital and was admitted. He lived in a heavily wooded area\n",
    "in New Hampshire, had frequent tick exposures, and worked as a\n",
    "construction contractor in basements with uncertain rodent and bat\n",
    "exposures. His symptoms were diagnosed as Encephalitis and the\n",
    "causative agent --- not known.\n",
    "\n",
    "* **Your task will be to identify the pathogenic root cause of the\n",
    "disease.**\n",
    "\n",
    "This pathogen is usually confirmed by a screening antibody test,\n",
    "followed by a plaque reduction neutralization test. However, this takes\n",
    "5 weeks, which was too slow to affect the patient's care. As traditional\n",
    "tests done in the first week of the patient's hospital stay did not\n",
    "reveal any conclusive disease cause, the doctors were running out of\n",
    "options. Therefore a novel metagenomic analysis was performed.\n",
    "\n",
    "## The Dataset\n",
    "\n",
    "Metagenomic sequencing from cerebrospinal fluid was performed on\n",
    "hospital day 8. It returned 14 million short nucleotide sequences\n",
    "(reads).\n",
    "\n",
    "The authors of the study removed all human reads using Kraken (Wood and\n",
    "Salzberg 2014) and released a much smaller set of 226,908 reads on the\n",
    "SRA (<https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi>). Kraken\n",
    "extracts short nucleotide subsequences of length k, also called\n",
    "k-mers, and compares them to a reference database where k-mers point\n",
    "to taxonomic labels. In case of exact matching it is able to assign\n",
    "taxonomy.\n",
    "\n",
    "* **Why didn't the authors release the complete dataset?**\n",
    "* **Demanding exact k-mer matching in Kraken has benefits for removing\n",
    "human reads. Why?**\n",
    "* **What is the SRA? How many samples are there in the SRA currently? How\n",
    "many bases are publicly available on the SRA in total?**\n",
    "\n",
    "## Metagenomic pathogen detection using MMseqs2\n",
    "\n",
    "We will use the sequence search tool MMseqs2 (Steinegger and Soeding\n",
    "2017) to find the cause of this patient's disease. MMseqs2 translates\n",
    "the nucleotide reads to putative protein fragments, searches against a\n",
    "protein reference database and assigns taxonomic labels based on the\n",
    "found reference database hits.\n",
    "\n",
    "* **Why might a protein-protein search be useful for finding bacterial or\n",
    "viral pathogens? How does this compare with Kraken's approach?**\n",
    "\n",
    "### Assigning taxonomic labels\n",
    "\n",
    "To not spoil the mystery to early, we prepared a FASTA file containing the reads.\n",
    "Download this file first:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2019-12-02 21:24:11--  http://wwwuser.gwdg.de/~compbiol/mmseqs2/tutorials/mystery_reads.fasta\n",
      "Resolving wwwuser.gwdg.de (wwwuser.gwdg.de)... 134.76.10.111\n",
      "Connecting to wwwuser.gwdg.de (wwwuser.gwdg.de)|134.76.10.111|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 49930921 (48M)\n",
      "Saving to: ‘mystery_reads.fasta’\n",
      "\n",
      "mystery_reads.fasta 100%[===================>]  47.62M  1.15MB/s    in 16s     \n",
      "\n",
      "2019-12-02 21:24:28 (2.93 MB/s) - ‘mystery_reads.fasta’ saved [49930921/49930921]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget -c http://wwwuser.gwdg.de/~compbiol/mmseqs2/tutorials/mystery_reads.fasta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The sequencing machine returns paired-end reads where sequencing\n",
    "starts in opposite directions from two close-by points to cover the\n",
    "same genomic region. Some of these paired reads overlap enough to be\n",
    "merged into a single read with FLASH (Magoc and Salzberg 2011).\n",
    "\n",
    "We will also need a reference database of proteins. For this we will use the\n",
    "Swiss-Prot which is the manually curated, high-quality part of the\n",
    "Uniprot (Consortium 2014) protein reference database. We are using this\n",
    "smaller subset of about 500,000 proteins, since the full Uniprot with\n",
    "over 175,000,000 sequences requires too many computational resources.\n",
    "Each protein in Uniprot has a taxonomic label. Let us prepare this database:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download the database:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget -c http://wwwuser.gwdg.de/~compbiol/mmseqs2/tutorials/uniprot_sprot_2018_03.fasta.gz\n",
    "!wget -c http://wwwuser.gwdg.de/~compbiol/mmseqs2/tutorials/uniprot_sprot_2018_03_mapping.tsv.gz\n",
    "!gunzip uniprot_sprot_2018_03_mapping.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a taxonomically annotated sequence database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33muniprot_sprot exists and will be overwritten.\n",
      "\u001b[39mcreatedb uniprot_sprot_2018_03.fasta.gz uniprot_sprot \n",
      "\n",
      "MMseqs Version:              \t10.6d92c\n",
      "Max sequence length          \t65535\n",
      "Split seq. by length         \ttrue\n",
      "Database type                \t0\n",
      "Do not shuffle input database\ttrue\n",
      "Offset of numeric ids        \t0\n",
      "Compressed                   \t0\n",
      "Verbosity                    \t3\n",
      "\n",
      "Converting sequences\n",
      "[556915] 3s 18mss\n",
      "Time for merging into uniprot_sprot_h by mergeResults: 0h 0m 0s 470ms\n",
      "Time for merging into uniprot_sprot by mergeResults: 0h 0m 0s 583ms\n",
      "Time for merging into uniprot_sprot.lookup by mergeResults: 0h 0m 0s 123ms\n",
      "Time for processing: 0h 0m 5s 311ms\n"
     ]
    }
   ],
   "source": [
    "!mmseqs createdb uniprot_sprot_2018_03.fasta.gz uniprot_sprot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "createtaxdb uniprot_sprot tmp --tax-mapping-file uniprot_sprot_2018_03_mapping.tsv \n",
      "\n",
      "MMseqs Version:         \t10.6d92c\n",
      "NCBI tax dump directory \t\n",
      "Taxonomical mapping file\tuniprot_sprot_2018_03_mapping.tsv\n",
      "Threads                 \t32\n",
      "Verbosity               \t3\n",
      "\n",
      "Download taxdump.tar.gz\n",
      "2019-12-02 21:32:36 URL: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz [50529876] -> \"-\" [1]\n",
      "Database created\n"
     ]
    }
   ],
   "source": [
    "!mmseqs createtaxdb uniprot_sprot tmp --tax-mapping-file uniprot_sprot_2018_03_mapping.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Through a similarity search we will transfer the annotation of the\n",
    "reference protein to our unknown reads."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mreads exists and will be overwritten.\n",
      "\u001b[39mcreatedb mystery_reads.fasta reads \n",
      "\n",
      "MMseqs Version:              \t10.6d92c\n",
      "Max sequence length          \t65535\n",
      "Split seq. by length         \ttrue\n",
      "Database type                \t0\n",
      "Do not shuffle input database\ttrue\n",
      "Offset of numeric ids        \t0\n",
      "Compressed                   \t0\n",
      "Verbosity                    \t3\n",
      "\n",
      "\u001b[33mAssuming DNA database, forcing parameter --dont-split-seq-by-len true\n",
      "\u001b[39mConverting sequences\n",
      "[410667] 0s 692ms\n",
      "Time for merging into reads_h by mergeResults: 0h 0m 0s 275ms\n",
      "Time for merging into reads by mergeResults: 0h 0m 0s 297ms\n",
      "Time for merging into reads.lookup by mergeResults: 0h 0m 0s 94ms\n",
      "Time for processing: 0h 0m 2s 95ms\n"
     ]
    }
   ],
   "source": [
    "!mmseqs createdb mystery_reads.fasta reads"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "taxonomy reads uniprot_sprot lca_result tmp -s 2 \n",
      "\n",
      "MMseqs Version:                       \t10.6d92c\n",
      "Substitution matrix                   \tnucl:nucleotide.out,aa:blosum62.out\n",
      "Add backtrace                         \tfalse\n",
      "Alignment mode                        \t2\n",
      "E-value threshold                     \t1\n",
      "Seq. id. threshold                    \t0\n",
      "Min. alignment length                 \t0\n",
      "Seq. id. mode                         \t0\n",
      "Alternative alignments                \t0\n",
      "Coverage threshold                    \t0\n",
      "Coverage mode                         \t0\n",
      "Max sequence length                   \t65535\n",
      "Compositional bias                    \t1\n",
      "Realign hits                          \tfalse\n",
      "Max reject                            \t2147483647\n",
      "Max accept                            \t2147483647\n",
      "Include identical seq. id.            \tfalse\n",
      "Preload mode                          \t0\n",
      "Pseudo count a                        \t1\n",
      "Pseudo count b                        \t1.5\n",
      "Score bias                            \t0\n",
      "Gap open cost                         \t11\n",
      "Gap extension cost                    \t1\n",
      "Threads                               \t32\n",
      "Compressed                            \t0\n",
      "Verbosity                             \t3\n",
      "Seed substitution matrix              \tnucl:nucleotide.out,aa:VTML80.out\n",
      "Sensitivity                           \t2\n",
      "K-mer size                            \t0\n",
      "K-score                               \t2147483647\n",
      "Alphabet size                         \t21\n",
      "Max results per query                 \t300\n",
      "Split database                        \t0\n",
      "Split mode                            \t2\n",
      "Split memory limit                    \t0\n",
      "Diagonal scoring                      \ttrue\n",
      "Exact k-mer matching                  \t0\n",
      "Mask residues                         \t1\n",
      "Mask lower case residues              \t0\n",
      "Minimum diagonal score                \t15\n",
      "Spaced k-mers                         \t1\n",
      "Spaced k-mer pattern                  \t\n",
      "Local temporary path                  \t\n",
      "Rescore mode                          \t0\n",
      "Remove hits by seq. id. and coverage  \tfalse\n",
      "Sort results                          \t0\n",
      "Mask profile                          \t1\n",
      "Profile e-value threshold             \t0.001\n",
      "Use global sequence weighting         \tfalse\n",
      "Filter MSA                            \t1\n",
      "Maximum seq. id. threshold            \t0.9\n",
      "Minimum seq. id.                      \t0\n",
      "Minimum score per column              \t-20\n",
      "Minimum coverage                      \t0\n",
      "Select N most diverse seqs            \t1000\n",
      "Omit consensus                        \tfalse\n",
      "Min codons in orf                     \t30\n",
      "Max codons in length                  \t32734\n",
      "Max orf gaps                          \t2147483647\n",
      "Contig start mode                     \t2\n",
      "Contig end mode                       \t2\n",
      "Orf start mode                        \t1\n",
      "Forward frames                        \t1,2,3\n",
      "Reverse frames                        \t1,2,3\n",
      "Translation table                     \t1\n",
      "Translate orf                         \t0\n",
      "Use all table starts                  \tfalse\n",
      "Offset of numeric ids                 \t0\n",
      "Add orf stop                          \tfalse\n",
      "Chain overlapping alignments          \t0\n",
      "Merge query                           \t1\n",
      "Search type                           \t0\n",
      "Number search iterations              \t1\n",
      "Start sensitivity                     \t4\n",
      "Search steps                          \t1\n",
      "Run a seq-profile search in slice mode\tfalse\n",
      "Strand selection                      \t1\n",
      "Disk space limit                      \t0\n",
      "MPI runner                            \t\n",
      "Force restart with latest tmp         \tfalse\n",
      "Remove temporary files                \tfalse\n",
      "LCA ranks                             \t\n",
      "Blacklisted taxa                      \t12908,28384\n",
      "Show taxon lineage                    \tfalse\n",
      "LCA mode                              \t4\n",
      "Taxonomy output mode                  \t0\n",
      "Match sequences by their id.          \tfalse\n",
      "\n",
      "lca_result.dbtype exists already!\n"
     ]
    }
   ],
   "source": [
    "!mmseqs taxonomy reads uniprot_sprot lca_result tmp -s 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "MMseqs2 will create a result database in your current working directory.\n",
    "This database consists of files, whose names start with `lca_result`. We\n",
    "can convert this database into a human readable tab separated values\n",
    "file (TSV), a common format in bioinformatics:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "createtsv reads lca_result lca.tsv \n",
      "\n",
      "MMseqs Version:                 \t10.6d92c\n",
      "First sequence as representative\tfalse\n",
      "Target column                   \t1\n",
      "Add full header                 \tfalse\n",
      "Sequence source                 \t0\n",
      "Database output                 \tfalse\n",
      "Threads                         \t32\n",
      "Compressed                      \t0\n",
      "Verbosity                       \t3\n",
      "\n",
      "Time for merging into lca.tsv by mergeResults: 0h 0m 0s 216ms\n",
      "Time for processing: 0h 0m 0s 527ms\n"
     ]
    }
   ],
   "source": [
    "!mmseqs createtsv reads lca_result lca.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this file you see for every read a numeric taxonomic identifier, a\n",
    "taxonomic rank and a taxonomic label. However, due to the large number\n",
    "of reads, it is hard to gain insight by skimming the file. MMseqs2\n",
    "offers a module to summarize the data into a single file `report.txt`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "taxonomyreport uniprot_sprot lca_result report.txt \n",
      "\n",
      "MMseqs Version:\t10.6d92c\n",
      "Threads   \t32\n",
      "Compressed\t0\n",
      "Verbosity \t3\n",
      "\n",
      "Loading NCBI taxonomy\n",
      "Loading nodes file ... Done, got 2192766 nodes\n",
      "Loading merged file ... Done, added 55744 merged nodes.\n",
      "Loading names file ... Done\n",
      "Making matrix ... Done\n",
      "Init RMQ ...Done\n",
      "Reading LCA results\n",
      "[=================================================================] 100.00% 410.70K 0s 51ms     \n",
      "\n",
      "Found 1297 different taxa for 410696 different reads.\n",
      "385417 reads are unclassified.\n",
      "Calculating clade counts ...  Done\n",
      "Time for processing: 0h 0m 7s 777ms\n"
     ]
    }
   ],
   "source": [
    "!mmseqs taxonomyreport uniprot_sprot lca_result report.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* **What is the most common species in this dataset?**\n",
    "* **Why are there so many different eukaryotic sequences? Were they really\n",
    "in the spinal fluid sample?**\n",
    "\n",
    "### Visualizing taxonomic results\n",
    "\n",
    "MMseqs2 can also generate an interactive visualization of the data using\n",
    "Krona (Ondov, Bergman, and Phillippy 2011). Adapt the previous call to\n",
    "generate a Krona report:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Usage: mmseqs taxonomyreport <i:targetDB> <i:taxDB> <o:taxonomyReport> [options]\n",
      "\n",
      "Create Kraken-style taxonomy report.\n",
      " By Florian Breitwieser <florian.bw@gmail.com>\n",
      "\n",
      "Options: \n",
      " Common:         \n",
      "   --threads INT      number of cores used for the computation (uses all cores by default) [32]\n",
      "   --compressed INT   write results in compressed format [0]\n",
      "   -v INT             verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info [3]\n",
      "Unrecognized parameter --report-mode\n",
      "\u001b[33mDid you mean \"--threads\"?\n",
      "\u001b[39m"
     ]
    }
   ],
   "source": [
    "!mmseqs taxonomyreport uniprot_sprot lca_result --report-mode 1 report.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This generates a `HTML` file that can be opened in a browser. This\n",
    "offers an interactive circular visualization where you can click on each\n",
    "label to zoom into different parts of the hierarchy.\n",
    "\n",
    "[Click to open report](report.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is the pathogen?\n",
    "\n",
    "Look up the following encephalitis causing agents in Wikipedia.\n",
    "\n",
    "1.  Borrelia bacterium\n",
    "\n",
    "2.  Herpes simplex virus\n",
    "\n",
    "3.  Powassan virus\n",
    "\n",
    "4.  West Nile virus\n",
    "\n",
    "5.  Mycoplasma\n",
    "\n",
    "6.  Angiostrongylus cantonensis\n",
    "\n",
    "* **Based on the literature, which one is the most likely pathogen?**\n",
    "* **For which species do you find evidence in the metagenomic reads?**\n",
    "* **Approximately how many reads belong to the pathogen? Based on this\n",
    "number, how would you determine if it is significant evidence for an\n",
    "actual presence of this agent?**\n",
    "\n",
    "## Investigating the pathogen\n",
    "\n",
    "We now want to take a closer look only at the reads of the pathogen. To\n",
    "filter the result database, we will need the pathogen's numeric\n",
    "taxonomic identifier. Use the [NCBI Taxonomy Browser](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi) to find it, by searching for its name.\n",
    "\n",
    "* **What is the taxon identifier of the pathogen? Did you find one or\n",
    "more?**\n",
    "\n",
    "Now we can call a different MMseqs2 module to retrieve only the reads\n",
    "that belong to this pathogen. Replace **XXX** with the number(s) you\n",
    "just found. If you found multiple, concatenate them with a comma\n",
    "character."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "filtertaxdb uniprot_sprot lca_result lca_only_pathogen --taxon-list 11083 \n",
      "\n",
      "MMseqs Version:\t10.6d92c\n",
      "Compressed     \t0\n",
      "Selected taxons\t11083\n",
      "\n",
      "Loading NCBI taxonomy\n",
      "Loading nodes file ... Done, got 2192766 nodes\n",
      "Loading merged file ... Done, added 55744 merged nodes.\n",
      "Loading names file ... Done\n",
      "Making matrix ... Done\n",
      "Init RMQ ...Done\n",
      "Computing LCA\n",
      "[=================================================================] 100.00% 410.70K 0s 128ms    ===========================================>             ] 78.56% 322.66K eta 0s       \n",
      "\n",
      "Time for merging into lca_only_pathogen by mergeResults: 0h 0m 0s 206ms\n",
      "Time for processing: 0h 0m 7s 863ms\n"
     ]
    }
   ],
   "source": [
    "!mmseqs filtertaxdb uniprot_sprot lca_result lca_only_pathogen --taxon-list 11083"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now get a list of all queries that were **filtered out**, meaning\n",
    "they were annotated as pathogenic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "!grep -Pv '\\t1$' lca_only_pathogen.index > pathogenic_read_ids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With a few more commands we can convert our taxonomic labels back into a\n",
    "FASTA file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mreads_pathogen exists and will be overwritten.\n",
      "\u001b[39mcreatesubdb pathogenic_read_ids reads reads_pathogen \n",
      "\n",
      "MMseqs Version:\t10.6d92c\n",
      "Subdb mode\t0\n",
      "Verbosity \t3\n",
      "\n",
      "Time for merging into reads_pathogen by mergeResults: 0h 0m 0s 0ms\n",
      "Time for processing: 0h 0m 0s 36ms\n"
     ]
    }
   ],
   "source": [
    "!mmseqs createsubdb pathogenic_read_ids reads reads_pathogen"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31mDatabase reads_pathogen need header information.\n",
      "The reads_pathogen_h is missing.\n",
      "\u001b[39m"
     ]
    }
   ],
   "source": [
    "!mmseqs convert2fasta reads_pathogen reads_pathogen.fasta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* **How many reads of the pathogen are in this resulting FASTA file?**\n",
    "\n",
    "### Assembling reads to proteins\n",
    "\n",
    "We want to try to recover the protein sequences of the pathogen.\n",
    "\n",
    "* **Which proteins do you expect to find in the pathogen you discovered?\n",
    "Search the internet.**\n",
    "\n",
    "We will use the protein assembly method Plass (Steinegger, Mirdita, and\n",
    "Söding 2019) to find overlapping reads and generate whole proteins out\n",
    "of the best matching ones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Program call:\n",
      "assemble reads_pathogen.fasta pathogen_assembly.fasta tmp \n",
      "\n",
      "MMseqs Version:                                                           \t2.c7e35\n",
      "Sub Matrix                                                                \tblosum62.out\n",
      "Rescore mode                                                              \t0\n",
      "Remove hits by seq.id. and coverage                                       \tfalse\n",
      "E-value threshold                                                         \t1e-05\n",
      "Coverage threshold                                                        \t0\n",
      "Coverage Mode                                                             \t0\n",
      "Seq. Id Threshold                                                         \t0.9\n",
      "Seq. Id. Mode                                                             \t0\n",
      "Include identical Seq. Id.                                                \tfalse\n",
      "Sort results                                                              \t0\n",
      "In substitution scoring mode, performs global alignment along the diagonal\tfalse\n",
      "No preload                                                                \tfalse\n",
      "Threads                                                                   \t32\n",
      "Verbosity                                                                 \t3\n",
      "Alphabet size                                                             \t13\n",
      "Kmer per sequence                                                         \t60\n",
      "Mask Residues                                                             \t0\n",
      "K-mer size                                                                \t14\n",
      "Max. sequence length                                                      \t65535\n",
      "Shift hash                                                                \t5\n",
      "Split Memory Limit                                                        \t0\n",
      "Include only extendable                                                   \ttrue\n",
      "Skip sequence with n repeating k-mers                                     \t8\n",
      "Min codons in orf                                                         \t45\n",
      "Max codons in length                                                      \t2147483647\n",
      "Max orf gaps                                                              \t2147483647\n",
      "Contig start mode                                                         \t2\n",
      "Contig end mode                                                           \t2\n",
      "Orf start mode                                                            \t0\n",
      "Forward Frames                                                            \t1,2,3\n",
      "Reverse Frames                                                            \t1,2,3\n",
      "Translation Table                                                         \t1\n",
      "Use all table starts                                                      \tfalse\n",
      "Offset of numeric ids                                                     \t0\n",
      "Protein Filter Threshold                                                  \t0.2\n",
      "Filter Proteins                                                           \t1\n",
      "Number search iterations                                                  \t12\n",
      "Remove Temporary Files                                                    \tfalse\n",
      "Sets the MPI runner                                                       \t\n",
      "\n",
      "Program call:\n",
      "createdb reads_pathogen.fasta /home/chick/work/MMseqs2/tutorial/tmp/8597800890735853730/nucl_reads --max-seq-len 65535 --dont-split-seq-by-len 0 --dont-shuffle 1 --id-offset 0 -v 3 \n",
      "\n",
      "MMseqs Version:              \t2.c7e35\n",
      "Max. sequence length         \t65535\n",
      "Split Seq. by len            \tfalse\n",
      "Do not shuffle input database\ttrue\n",
      "Offset of numeric ids        \t0\n",
      "Verbosity                    \t3\n",
      "\n",
      "File reads_pathogen.fasta does not exist.\n",
      "Error: createdb failed\n"
     ]
    }
   ],
   "source": [
    "!plass assemble reads_pathogen.fasta pathogen_assembly.fasta tmp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take a look at the generated FASTA file `pathogen_assembly.fasta`.\n",
    "\n",
    "* **How many sequences were assembled?**\n",
    "* **Do some of the sequences look similar to each other?**\n",
    "\n",
    "### Clustering to find representative proteins\n",
    "\n",
    "Plass will uncover a lot of variation in the reads and output many\n",
    "similar proteins. We can use the sequence clustering module in MMseqs2\n",
    "to get only representative sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31mInput pathogen_assembly.fasta does not exist.\n",
      "\u001b[39m"
     ]
    }
   ],
   "source": [
    "!mmseqs easy-cluster pathogen_assembly.fasta assembly_clustered tmp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You will see three files starting with `assembly_clustered`:\n",
    "\n",
    "1.  `assembly_clustered_all_seqs.fasta`\n",
    "\n",
    "2.  `assembly_clustered_cluster.tsv`\n",
    "\n",
    "3.  `assembly_clustered_rep_seq.fasta`\n",
    "\n",
    "Take a look at the last one `assembly_clustered_rep_seq.fasta`. This\n",
    "file contains all representative sequences, meaning the sequence that\n",
    "the algorithm chose as the most representative within this cluster.\n",
    "\n",
    "* **How many sequences remain now?**\n",
    "* **How well does this agree with what you expected according to your internet search?**\n",
    "\n",
    "### Annotating the proteins\n",
    "\n",
    "We will look for known protein domains to identify the proteins we\n",
    "found. Instead of the MMseqs2 command line, we use the MMseqs2\n",
    "webserver, which will visualize the results. Paste the content of the\n",
    "FASTA file containing the representative sequences into the webserver\n",
    "and make sure to select all three target databases (PFAM, PDB,\n",
    "Uniclust): <https://search.mmseqs.com>\n",
    "\n",
    "* **Which of the expected proteins do you find?**\n",
    "\n",
    "## Aftermath\n",
    "\n",
    "Despite being able to identify the causative agent. The pathogen is very\n",
    "hard to treat. The patient had minimal neurological recovery and was\n",
    "discharged to an acute care facility on hospital day 30. Seven months\n",
    "after discharge, he was reportedly able to nod his head to questions and\n",
    "slightly move his upper extremities and toes.\n",
    "\n",
    "You can find the publication about this patient and dataset here\n",
    "(Piantadosi et al. 2017). Please look at it only after trying to answer the questions yourself.\n",
    "\n",
    "## References\n",
    "\n",
    "Consortium, UniProt. 2014. \"UniProt: A Hub for Protein Information.\"\n",
    "*Nucleic Acids Research* 43 (D1): D204--D212.\n",
    "\n",
    "Magoc, Tanja, and Steven L. Salzberg. 2011. \"FLASH: Fast Length\n",
    "Adjustment of Short Reads to Improve Genome Assemblies.\"\n",
    "*Bioinformatics* 27 (21): 2957--63.\n",
    "\n",
    "Ondov, Brian D, Nicholas H Bergman, and Adam M Phillippy. 2011.\n",
    "\"Interactive metagenomic visualization in a Web browser.\" *BMC\n",
    "Bioinformatics* 12 (1): 385.\n",
    "\n",
    "Piantadosi, Anne, Sanjat Kanjilal, Vijay Ganesh, Arjun Khanna, Emily P\n",
    "Hyle, Jonathan Rosand, Tyler Bold, et al. 2017. \"Rapid Detection of\n",
    "Powassan Virus in a Patient With Encephalitis by Metagenomic\n",
    "Sequencing.\" *Clinical Infectious Diseases* 66 (5): 789--92.\n",
    "\n",
    "Steinegger, Martin, Milot Mirdita, and Johannes Söding. 2019.\n",
    "\"Protein-level assembly increases protein sequence recovery from\n",
    "metagenomic samples manyfold.\" *Nature Methods* 16 (7): 603--6.\n",
    "\n",
    "Steinegger, Martin, and Johannes Soeding. 2017. \"MMseqs2: Sensitive\n",
    "Protein Sequence Searching for the Analysis of Massive Data Sets.\"\n",
    "*bioRxiv*, 079681.\n",
    "\n",
    "Wood, Derrick E, and Steven L Salzberg. 2014. \"Kraken: ultrafast\n",
    "metagenomic sequence classification using exact alignments.\" *Genome\n",
    "Biology* 15 (3): R46."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}