{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "01c203ff",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Week 8 - Lab Notebook - Part 1 - Working with Ontologies & Functional Enrichment Analysis</font>\n",
    "\n",
    "- November 2023\n",
    "- https://https://github.com/tisimpson/bioinformatics1\n",
    "- ian.simpson@ed.ac.uk"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0d17ac3",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Introduction</font>\n",
    "In this notebook we’re going to be doing some manipulation and work with ontologies and then execute a functional enrichment analysis as we discussed in last week's lecture.\n",
    "\n",
    "##### <font color='darkblue'>Learning Outcomes</font>\n",
    "After this notebook you should be comfortable with:\n",
    "- loading an ontology file and some basic searching\n",
    "- Automating processes using APIs and basic usage of the GSEAPy python package\n",
    "\n",
    "##### <font color='darkblue'>Useful References</font>\n",
    "- [GSEAPy Documentation](https://gseapy.readthedocs.io/en/latest/index.html)\n",
    "- [KEGG API](https://www.kegg.jp/kegg/rest/keggapi.html)\n",
    "- [Pronto Library](https://pronto.readthedocs.io/en/stable/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Setting up the Programming Environment</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You may need to install these packages first\n",
    "# %pip install gseapy\n",
    "# %pip install pronto\n",
    "\n",
    "\n",
    "# handling www based requests (like APIs)\n",
    "import urllib as ul\n",
    "\n",
    "# standard Python data handling modules\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "# working with geen set enrichment analysis (GSEA)\n",
    "import gseapy\n",
    "# working with nicer tables\n",
    "import prettytable as PrettyTable"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Using Pronto to Parse and Explore Ontologies</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fetch the Gene Onotlogy (GO) OBO file and parse it with pronto\n",
    "\n",
    "# download the GO ontology OBO file\n",
    "import urllib.request\n",
    "\n",
    "# download the file\n",
    "urllib.request.urlretrieve('http://purl.obolibrary.org/obo/go.obo','../data/ontology/go.obo')\n",
    "\n",
    "# use pronto to parse the file and search for matches in the column moi_curie\n",
    "import pronto\n",
    "\n",
    "# parse the file\n",
    "go = pronto.Ontology('../data/ontology/go.obo')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# if we know the GO term, we can search for it\n",
    "goTerm = go['GO:0000002']\n",
    "\n",
    "# it has features GOTerm Accession, Name (Description), and the clade in GO\n",
    "print(goTerm.id, goTerm.name, goTerm.namespace)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# given a string of GO terms, we can search for them by accession in a simple loop\n",
    "\n",
    "# create a list of GO terms\n",
    "goTerms = ['GO:0000002', 'GO:0000003', 'GO:0000005', 'GO:0000006', 'GO:0000007']\n",
    "\n",
    "# loop through the list and print the GO term name and namespace\n",
    "for goTerm in goTerms:\n",
    "    print(f\"\",go[goTerm].name,\" clade :\",go[goTerm].namespace)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we can also search for GO terms by name\n",
    "# create a list of GO term names\n",
    "goNames = ['mitochondrial genome maintenance', 'developmental process', 'apoptotic process', 'reproduction', 'cell death']\n",
    "\n",
    "# loop through the go object and compare the strings in goNames to the name attribute of the GO term\n",
    "for goTerm in go:\n",
    "    try:\n",
    "        currentTerm = go.get_term(goTerm)\n",
    "        if currentTerm.name in goNames:\n",
    "            print(f\"{currentTerm.id} {currentTerm.name} {currentTerm.namespace}\")\n",
    "    except:\n",
    "        pass\n",
    "\n",
    "# it's a little clunky (pronto wasn't specifically designed for this), but it works well enough"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we can explore the tree strucutre of the onology using pronto\n",
    "\n",
    "# picking a starting GO term we can find all of the child terms underneath it\n",
    "# here we start with GO:0032502 (developmental process)\n",
    "# we can use the children() method to get all of the child terms\n",
    "\n",
    "# get the GO term children\n",
    "goTerms = go['GO:0032502'].subclasses().to_set()\n",
    "\n",
    "# how many terms are there?\n",
    "print(len(goTerms))\n",
    "\n",
    "# print the GO term children\n",
    "for goTerm in goTerms:\n",
    "    print(f\"{goTerm.id} {goTerm.name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce2a96a2",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Download KEGG Pathway Information</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee180b07",
   "metadata": {},
   "outputs": [],
   "source": [
    "# First we fetch the list of human pathways available at KEGG.\n",
    "\n",
    "human_pathways = pd.read_csv(ul.request.urlopen('http://rest.kegg.jp/list/pathway/hsa'),sep='\\t',header=0,names=['kegg_id','pathway_name'])\n",
    "human_pathways.head()\n",
    "\n",
    "# We specifically want the pathway data for the \"Dopaminergic Synapse\" pathway.\n",
    "pathway_info = human_pathways[human_pathways['pathway_name'].str.match('Dopaminergic synapse')]['kegg_id']\n",
    "\n",
    "# extract the exact pathway accession\n",
    "pathway_id = pathway_info.values[0].split('\\t')[0]\n",
    "\n",
    "# pull the pathway directly from KEGG, note we are saving this to a file 'dop_synapse.txt' that we will use later\n",
    "ul.request.urlretrieve('http://rest.kegg.jp/get/'+pathway_id,'../data/pathways/dop_synapse.txt');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05e89606",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now we will use this file which contains the full pathway details including the gene names.\n",
    "\n",
    "# open the file\n",
    "dop_file = open('../data/pathways/dop_synapse.txt','r')\n",
    "\n",
    "# I wanted to show you some basic python parsing and a simple for loop with a conditional in to demonstrate how you can quickly build simple parsers.\n",
    "# There are quicker ways to do this, but this is a good learning example.\n",
    "\n",
    "# create an empty dataframe with two columns\n",
    "dop_df = pd.DataFrame(columns=['gene_id','gene_symbol','description'])\n",
    "\n",
    "# set a flag for our parser\n",
    "flag=0\n",
    "\n",
    "# work through the text file one line at a time\n",
    "for line in dop_file:\n",
    "    # find the start of the gene entries\n",
    "    if 'GENE' in line:\n",
    "        # add the first gene tp the dataframe\n",
    "        gene_id,remain = line.strip('GENE').strip().split('  ')\n",
    "        gene_symbol,description = remain.split(';')\n",
    "        # add a new row to the dataframe containing the gene_id and description\n",
    "        dop_df = pd.concat([dop_df,pd.DataFrame([[gene_id,gene_symbol,description]],columns=['gene_id','gene_symbol','description'])],ignore_index=True)\n",
    "        # set the flag to 1, we are in the gene section of the file\n",
    "        flag = 1\n",
    "    # stop when we reach the end of the section and escape the file\n",
    "    elif 'COMPOUND' in line:\n",
    "        break\n",
    "    # continue adding the genes to the dataframe\n",
    "    elif flag == 1:\n",
    "        gene_id,remain = line.strip('GENE').strip().split('  ')\n",
    "        gene_symbol,description = remain.split(';')\n",
    "        # add the gene to the dataframe\n",
    "        dop_df = pd.concat([dop_df,pd.DataFrame([[gene_id,gene_symbol,description]],columns=['gene_id','gene_symbol','description'])],ignore_index=True)\n",
    "\n",
    "# close the file\n",
    "dop_file.close()\n",
    "\n",
    "# view the file\n",
    "dop_df.head()\n",
    "\n",
    "# you now have the gene_ids (NCBI EntrezIDs for the genes in the pathway)\n",
    "print('The Dopaminergic Synapse pathway has '+str(dop_df.shape[0])+' genes in it.\\n')\n",
    "\n",
    "# store the gene_ids in a numpy array\n",
    "gene_ids = dop_df['gene_id'].to_numpy()\n",
    "\n",
    "gene_symbols = dop_df['gene_symbol'].to_numpy()\n",
    "\n",
    "# show the gene_ids\n",
    "print(gene_ids)\n",
    "\n",
    "# write the gene_ids to a file\n",
    "with open('../data/pathways/dop_geneids.txt','w') as f:\n",
    "    for gene_id in gene_ids:\n",
    "        f.write(gene_id+'\\n')\n",
    "\n",
    "# show the gene_symbols\n",
    "print(gene_symbols)\n",
    "\n",
    "# write the gene_symbols to a file\n",
    "with open('../data/pathways/dop_symbols.txt','w') as f:\n",
    "    for symbol in gene_symbols:\n",
    "        f.write(symbol+'\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e578b669",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>GSEAPy and Gene Set Enrichement Analysis</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9427fcf8",
   "metadata": {},
   "source": [
    "[GSEApy](https://gseapy.readthedocs.io/en/latest/index.html) is a library to perform gene set enrichment analysis (GSEA) in python. There are two methods to perform enrichment analysis - over representation analysis and GSEA. The main difference between the two is that GSEA assumes your input list of genes is ordered by the most representative genes in that list. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17199cfa",
   "metadata": {},
   "source": [
    "In this step we use biomart which is an excellent service run at EBI-Ensembl that allows you to query and retrieve linked data for genomic data.\n",
    "\n",
    "The help for Biomart can be found here - [https://www.ensembl.org/info/data/biomart/how_to_use_biomart.html](https://www.ensembl.org/info/data/biomart/how_to_use_biomart.html)\n",
    "\n",
    "Biomart API functionality is nicely delivered through GSEApy and its use is described here - [https://gseapy.readthedocs.io/en/latest/gseapy_example.html#1.-Biomart-API](https://gseapy.readthedocs.io/en/latest/gseapy_example.html#1.-Biomart-API)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "78c20df8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# the GSEApy package also contains functions that allow you to use the EBI-Ensembl Biomart service\n",
    "# you can use this to directly query linked data for the genes, including Gene Ontology (GO) annotation data.\n",
    "from gseapy import Biomart\n",
    "\n",
    "# initiate a biomart connection\n",
    "bm = Biomart()\n",
    "\n",
    "# form a query for biomart from the Entrez Gene IDs, here we are using the gene_ids from the dopaminergic synapse pathway above\n",
    "queries = {'entrezgene_id' : gene_ids}\n",
    "\n",
    "# execute the biomart query\n",
    "# NB that the oddly named 'name_1006' attribute is in fact the 'GO Term name' attribute\n",
    "# NB a nice trick for finding the correct attribute names is to use the website to create the query and then\n",
    "# NB click on the XML link at the top of the page, this will show you the 'Attribute name'.\n",
    "results = bm.query(dataset='hsapiens_gene_ensembl',\n",
    "                   attributes=['ensembl_gene_id', 'external_gene_name', 'entrezgene_id', 'go_id','go_linkage_type','name_1006'],\n",
    "                   filters=queries)\n",
    "\n",
    "# change the name to something more useful\n",
    "results.rename({'name_1006' : 'go_name'},axis=1,inplace=True)\n",
    "\n",
    "results.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the top20 most common GO terms from the results table using seaborn\n",
    "# seaborn is a Python NB you need seaborn v 0.13+ for this to work\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# plot the top20 most common GO terms\n",
    "sns.countplot(y='go_name',data=results,order=results['go_name'].value_counts().iloc[:20].index,palette='bright',hue='go_name',legend=False)\n",
    "# add a title\n",
    "plt.title('Top 20 GO Terms for Dopaminergic Synapse Pathway')\n",
    "# add a label to the x axis\n",
    "plt.xlabel('Gene Count')\n",
    "# add a label to the y axis\n",
    "plt.ylabel('GO Term Name')\n",
    "# show the plot\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf88032c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# just to take extra advantage of the data lets break down the above by evidence code type\n",
    "annotations_by_evidence = pd.DataFrame(results[['go_name','go_linkage_type']].dropna())\n",
    "\n",
    "# we use the Pandas pivot_table function to do some powerful refactoring - count and sort the pairs go_term & evidence code\n",
    "annotations_by_evidence = annotations_by_evidence.pivot_table(index=['go_name','go_linkage_type'],aggfunc='size').reset_index(name='frequency').sort_values(by='frequency',ascending=False)\n",
    "\n",
    "# just take the top 30 rows\n",
    "top_annotations = annotations_by_evidence[:30]\n",
    "\n",
    "# pivot to create a nice matrix for a stacked plot\n",
    "top_annotations = top_annotations.pivot(index='go_name',columns='go_linkage_type',values='frequency').fillna(0)\n",
    "\n",
    "# create a staked bar plot using seaborn\n",
    "top_annotations.plot.barh(xlabel='Number of Entries',\n",
    "    ylabel='GO Term Name',\n",
    "    title='Top Frequent GO Annotations by Evidence Code in the Dopaminergic Synapse Pathway',\n",
    "    stacked=True).legend(loc='best');"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f02b9b15",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Enrichment Analysis Measures</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da2c1e60",
   "metadata": {},
   "source": [
    "In order to perform GSEA we need to impose an ordering on the lists of elements (such as genes) and compare the differential position of those in lists separated by two classes as we discussed in the lecture. In the next notebook we will do this with some network derived data, but here we are going to perform the simpler variant (which is still useful) over representation analysis (ORA) as (at the moment) we don't have suitable lists for GSEA.\n",
    "\n",
    "You can browse the available gene sets to perform enrichment analysis against using the GSEAPy package at the [Enrichr website](https://maayanlab.cloud/Enrichr/#libraries)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this is incredibly easy to do with the GSEApy package\n",
    "# we provide a list of gene_symbols and select the pathway/list to compare against\n",
    "# note that by default this is comparing against the entire genome annotation\n",
    "# in a real experimental set up we need to define the foreground and background gene sets as discussed in the lecture\n",
    "\n",
    "enr = gseapy.enrichr(gene_list=list(gene_symbols), # perform enrichment analysis using gsea\n",
    "                 gene_sets=['KEGG_2021_Human'],\n",
    "                 organism='human', # don't forget to set organism to the one you desired! e.g. Yeast\n",
    "                 outdir=None, # don't write to disk\n",
    "                )\n",
    "\n",
    "enr.results.head(10) #return the top 10 hits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we can list the available gene sets\n",
    "available_libraries = gseapy.get_library_name()\n",
    "\n",
    "# how many are there?\n",
    "print(f'There are ',len(available_libraries),' available gene sets to compare against!')\n",
    "\n",
    "# print the first 10 using prettytable\n",
    "from prettytable import PrettyTable\n",
    "table = PrettyTable(['Library Name'])\n",
    "for library in available_libraries[:10]:\n",
    "    table.add_row([library])\n",
    "print(table)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# perform ORA against the Hallmark gene sets\n",
    "enr = gseapy.enrichr(gene_list=list(gene_symbols), # perform enrichment analysis using gsea\n",
    "                 gene_sets=['MSigDB_Hallmark_2020'],\n",
    "                 organism='human', # don't forget to set organism to the one you desired! e.g. Yeast\n",
    "                 outdir=None, # don't write to disk\n",
    "                )\n",
    "\n",
    "enr.results.head(10) #return the top 10 hits"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.13 ('base')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  },
  "vscode": {
   "interpreter": {
    "hash": "40d3a090f54c6569ab1632332b64b2c03c39dcf918b08424e98f38b5ae0af88f"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}