## Summarizing annotations to a term and descendants

This notebook demonstrates summarizing annotation counts for a term and its descendants.

An example use of this is a GO annotator exploring refactoring a subtree in GO

Of course, if this were a regular thing we would make a command line or even web interface,
but keeping as a notebook gives us some flexibility in logic, and anyway is intended largely
as a demonstration

### boilerplate

 * importing relevant ontobiolibraries
 * set up key objects

In [1]:
import pandas as pd

## Create an ontology factory in order to fetch GO
from ontobio.ontol_factory import OntologyFactory
ofactory = OntologyFactory()

## GOLR queries
from ontobio.golr.golr_query import GolrAssociationQuery

## rendering ontologies
from ontobio import GraphRenderer

In [2]:
## Load GO. Note the first time this runs Jupyter will show '*' - be patient
ont = ofactory.create("go")  

### Finding descendants

Here we are using the in-memory ontology object, no external service calls are executed

Change the value of `term_id` to what you like

In [3]:
term_id = "GO:0009070" ## serine family amino acid biosynthetic process
descendants = ont.descendants(term_id, reflexive=True, relations=['subClassOf', 'BFO:0000050'])

In [4]:
descendants

['GO:0016260',
 'GO:0004124',
 'GO:0019343',
 'GO:0019265',
 'GO:0019345',
 'GO:0006564',
 'GO:0006545',
 'GO:0071269',
 'GO:0006535',
 'GO:0009090',
 'GO:0070179',
 'GO:0019264',
 'GO:0009070',
 'GO:0019344']

### rendering subtrees

We use the good-old-fashioned Tree renderer

(this doesn't scale well for latticey-subontologies)

In [5]:
renderer = GraphRenderer.create('tree')

In [6]:
print(renderer.render_subgraph(ont, nodes=descendants))

. GO:0009070 ! serine family amino acid biosynthetic process
 % GO:0006545 ! glycine biosynthetic process
  % GO:0019264 ! glycine biosynthetic process from serine
  % GO:0019265 ! glycine biosynthetic process, by transamination of glyoxylate
 % GO:0006564 ! L-serine biosynthetic process
 % GO:0019344 ! cysteine biosynthetic process
  % GO:0006535 ! cysteine biosynthetic process from serine
  % GO:0019343 ! cysteine biosynthetic process via cystathionine
  % GO:0019345 ! cysteine biosynthetic process via S-sulfo-L-cysteine
  < GO:0004124 ! cysteine synthase activity
 % GO:0009090 ! homoserine biosynthetic process
 % GO:0016260 ! selenocysteine biosynthetic process
 % GO:0070179 ! D-serine biosynthetic process
 % GO:0071269 ! L-homocysteine biosynthetic process




### summarizing annotations

We write a short procedure to wrap calling Golr and returning a summary dict

The dict is keyed by taxon label. We also include an entry for `ALL`


In [58]:
DEFAULT_FACET_FIELDS = ['taxon_subset_closure_label', 'evidence_label', 'assigned_by']
def summarize(t: str, 
              evidence_closure='ECO:0000269', ## restrict to experimental
              facet_fields=None) -> dict:
    """
    Summarize a term
    """
    if facet_fields == None:
        facet_fields  = DEFAULT_FACET_FIELDS
    q = GolrAssociationQuery(object=t, rows=0, object_category='function', 
                             fq={'evidence_closur'taxon_subset_closure_label'e_label':'experimental evidence'},
                             facet_fields=facet_fields)
    #params = q.solr_params()
    #print(params)
    result = q.exec()
    fc = result['facet_counts']
    item = {'ALL': result['numFound']}  ## make sure this is the first entry
    for ff in facet_fields:
        if ff in fc:
            item.update(fc[ff])
    return item

In [59]:
print(summarize(term_id))

{'ALL': 144, 'Eukaryota': 92, 'Bacteria': 52, 'Metazoa': 33, 'Fungi': 32, 'Escherichia coli K-12': 27, 'Viridiplantae': 23, 'Mammalia': 22, 'Vertebrata <vertebrates>': 22, 'Arabidopsis thaliana': 21, 'Saccharomyces cerevisiae S288C': 17, 'Mycobacterium tuberculosis H37Rv': 11, 'Schizosaccharomyces pombe': 11, 'Homo sapiens': 8, 'Caenorhabditis elegans': 7, 'Mus musculus': 7, 'Rattus norvegicus': 7, 'Bacillus subtilis subsp. subtilis str. 168': 6, 'Pseudomonas aeruginosa PAO1': 4, 'Aspergillus nidulans FGSC A4': 3, 'Apis mellifera': 2, 'Leishmania major strain Friedlin': 2, 'Bombyx mori': 1, 'Candida albicans SC5314': 1, 'Dictyostelium discoideum': 1, 'Drosophila melanogaster': 1, 'direct assay evidence used in manual assertion': 81, 'mutant phenotype evidence used in manual assertion': 53, 'genetic interaction evidence used in manual assertion': 10, 'EcoCyc': 20, 'TAIR': 19, 'SGD': 17, 'UniProt': 16, 'PomBase': 11, 'MTBBASE': 10, 'EcoliWiki': 7, 'RGD': 7, 'WB': 7, 'MGI': 6, 'CAFA': 5, 

In [63]:
def summarize_set(ids, facet_fields=None) -> pd.DataFrame:
    """
    Summarize a set of annotations, return a dataframe
    """
    items = []
    for id in ids:
        item = {'id': id, 'name:': ont.label(id)}
        for k,v in summarize(id, facet_fields=facet_fields).items():
            item[k] = v
        items.append(item)
    df =  pd.DataFrame(items).fillna(0)
    # sort using total number
    df.sort_values('ALL', axis=0, ascending=False, inplace=True)
    return df

## Summarize GO term and descendants

More advanced visualziations are easy with plotly etc. We leave as an exercise to the reader...

As an example, for the first query we bundle all facets (species, evidence, assigned by) together

In [64]:
pd.options.display.max_columns = None
df = summarize_set(descendants)
df

Unnamed: 0,id,name:,ALL,Bacteria,Escherichia coli K-12,Eukaryota,Mammalia,Metazoa,Mus musculus,Trypanosoma brucei brucei TREU927,Vertebrata <vertebrates>,mutant phenotype evidence used in manual assertion,direct assay evidence used in manual assertion,EcoCyc,GeneDB,MGI,Viridiplantae,Arabidopsis thaliana,Fungi,Caenorhabditis elegans,Schizosaccharomyces pombe,Bacillus subtilis subsp. subtilis str. 168,Mycobacterium tuberculosis H37Rv,Saccharomyces cerevisiae S288C,Solanum tuberosum,Spinacia oleracea,Streptomyces lavendulae,genetic interaction evidence used in manual assertion,TAIR,PomBase,UniProt,WB,CAFA,EcoliWiki,MTBBASE,SGD,Aspergillus nidulans FGSC A4,Apis mellifera,Homo sapiens,Leishmania major strain Friedlin,AspGD,BHF-UCL,Rattus norvegicus,RGD,Pseudomonas aeruginosa PAO1,Thermus thermophilus HB27,PseudoCAP,Bombyx mori,Drosophila melanogaster,FlyBase,Lactobacillus casei,Candida albicans SC5314,CGD,Dictyostelium discoideum,dictyBase,GOC,Pseudomonas aeruginosa
12,GO:0009070,serine family amino acid biosynthetic process,144,52.0,27.0,92.0,22.0,33.0,7.0,0.0,22.0,53.0,81.0,20.0,3.0,6.0,23.0,21.0,32.0,7.0,11.0,6.0,11.0,17.0,0.0,0.0,0.0,10.0,19.0,11.0,16.0,7.0,5.0,7.0,10.0,17.0,3.0,2.0,8.0,2.0,3.0,4.0,7.0,7.0,4.0,0.0,5.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0
13,GO:0019344,cysteine biosynthetic process,82,25.0,8.0,57.0,6.0,15.0,2.0,0.0,6.0,30.0,49.0,6.0,2.0,1.0,21.0,19.0,19.0,7.0,10.0,5.0,8.0,6.0,1.0,1.0,0.0,3.0,17.0,10.0,10.0,7.0,5.0,2.0,7.0,6.0,3.0,2.0,3.0,2.0,3.0,2.0,1.0,1.0,1.0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
1,GO:0004124,cysteine synthase activity,29,7.0,3.0,22.0,0.0,4.0,0.0,0.0,0.0,6.0,21.0,2.0,0.0,0.0,13.0,11.0,5.0,4.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,10.0,4.0,4.0,4.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,GO:0006564,L-serine biosynthetic process,21,16.0,8.0,5.0,1.0,1.0,0.0,0.0,1.0,9.0,6.0,3.0,0.0,0.0,1.0,1.0,3.0,0.0,0.0,1.0,3.0,3.0,0.0,0.0,0.0,6.0,1.0,0.0,2.0,0.0,0.0,5.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,3.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,GO:0006535,cysteine biosynthetic process from serine,20,9.0,4.0,11.0,0.0,4.0,0.0,0.0,0.0,6.0,13.0,3.0,1.0,0.0,0.0,0.0,6.0,3.0,5.0,2.0,2.0,1.0,0.0,0.0,0.0,1.0,0.0,5.0,2.0,3.0,2.0,1.0,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6,GO:0006545,glycine biosynthetic process,14,3.0,3.0,11.0,7.0,9.0,0.0,0.0,7.0,2.0,11.0,3.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0,0.0,0.0,1.0,5.0,5.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,GO:0019343,cysteine biosynthetic process via cystathionine,9,1.0,0.0,8.0,1.0,2.0,0.0,0.0,1.0,6.0,3.0,0.0,1.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,3.0,2.0,1.0,1.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,GO:0009090,homoserine biosynthetic process,8,4.0,4.0,4.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
10,GO:0070179,D-serine biosynthetic process,8,0.0,0.0,8.0,6.0,6.0,4.0,0.0,6.0,0.0,8.0,0.0,0.0,4.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
0,GO:0016260,selenocysteine biosynthetic process,6,4.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,4.0,2.0,4.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Summary by assigned by



In [67]:
summarize_set(descendants, facet_fields=['assigned_by'])

Unnamed: 0,id,name:,ALL,EcoCyc,GeneDB,MGI,TAIR,PomBase,UniProt,WB,CAFA,EcoliWiki,MTBBASE,SGD,AspGD,BHF-UCL,RGD,PseudoCAP,FlyBase,CGD,dictyBase,GOC
12,GO:0009070,serine family amino acid biosynthetic process,144,20.0,3.0,6.0,19.0,11.0,16.0,7.0,5.0,7.0,10.0,17.0,3.0,4.0,7.0,5.0,1.0,1.0,1.0,1.0
13,GO:0019344,cysteine biosynthetic process,82,6.0,2.0,1.0,17.0,10.0,10.0,7.0,5.0,2.0,7.0,6.0,3.0,2.0,1.0,2.0,0.0,0.0,0.0,1.0
1,GO:0004124,cysteine synthase activity,29,2.0,0.0,0.0,10.0,4.0,4.0,4.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,GO:0006564,L-serine biosynthetic process,21,3.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,5.0,3.0,3.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0
8,GO:0006535,cysteine biosynthetic process from serine,20,3.0,1.0,0.0,0.0,5.0,2.0,3.0,2.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,GO:0006545,glycine biosynthetic process,14,3.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,0.0,1.0,5.0,0.0,1.0,0.0,0.0,0.0
2,GO:0019343,cysteine biosynthetic process via cystathionine,9,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,3.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
9,GO:0009090,homoserine biosynthetic process,8,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
10,GO:0070179,D-serine biosynthetic process,8,0.0,0.0,4.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
0,GO:0016260,selenocysteine biosynthetic process,6,4.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Summarize by species

use `taxon_subset_closure_label` facet

In [69]:
summarize_set(descendants, facet_fields=['taxon_subset_closure_label'])

Unnamed: 0,id,name:,ALL,Bacteria,Escherichia coli K-12,Eukaryota,Mammalia,Metazoa,Mus musculus,Trypanosoma brucei brucei TREU927,Vertebrata <vertebrates>,Viridiplantae,Arabidopsis thaliana,Fungi,Caenorhabditis elegans,Schizosaccharomyces pombe,Bacillus subtilis subsp. subtilis str. 168,Mycobacterium tuberculosis H37Rv,Saccharomyces cerevisiae S288C,Solanum tuberosum,Spinacia oleracea,Streptomyces lavendulae,Aspergillus nidulans FGSC A4,Apis mellifera,Homo sapiens,Leishmania major strain Friedlin,Rattus norvegicus,Pseudomonas aeruginosa PAO1,Thermus thermophilus HB27,Bombyx mori,Drosophila melanogaster,Lactobacillus casei,Candida albicans SC5314,Dictyostelium discoideum,Pseudomonas aeruginosa
12,GO:0009070,serine family amino acid biosynthetic process,144,52.0,27.0,92.0,22.0,33.0,7.0,0.0,22.0,23.0,21.0,32.0,7.0,11.0,6.0,11.0,17.0,0.0,0.0,0.0,3.0,2.0,8.0,2.0,7.0,4.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0
13,GO:0019344,cysteine biosynthetic process,82,25.0,8.0,57.0,6.0,15.0,2.0,0.0,6.0,21.0,19.0,19.0,7.0,10.0,5.0,8.0,6.0,1.0,1.0,0.0,3.0,2.0,3.0,2.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1,GO:0004124,cysteine synthase activity,29,7.0,3.0,22.0,0.0,4.0,0.0,0.0,0.0,13.0,11.0,5.0,4.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,GO:0006564,L-serine biosynthetic process,21,16.0,8.0,5.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,3.0,0.0,0.0,1.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
8,GO:0006535,cysteine biosynthetic process from serine,20,9.0,4.0,11.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,6.0,3.0,5.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6,GO:0006545,glycine biosynthetic process,14,3.0,3.0,11.0,7.0,9.0,0.0,0.0,7.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,5.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
2,GO:0019343,cysteine biosynthetic process via cystathionine,9,1.0,0.0,8.0,1.0,2.0,0.0,0.0,1.0,0.0,0.0,5.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,GO:0009090,homoserine biosynthetic process,8,4.0,4.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
10,GO:0070179,D-serine biosynthetic process,8,0.0,0.0,8.0,6.0,6.0,4.0,0.0,6.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
0,GO:0016260,selenocysteine biosynthetic process,6,4.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
