{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "01c203ff",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Week 8 - Lab Notebook - Part 2 - Working with Networks</font>\n",
    "\n",
    "- November 2023\n",
    "- https://https://github.com/tisimpson/bioinformatics1\n",
    "- ian.simpson@ed.ac.uk"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0d17ac3",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Introduction</font>\n",
    "In this computing lab we’re going to be putting together what we've learned about biological databases and ontologies to do some summary analysis of genes invovled in the \"Dopaminergnic Synapse\" pathway. These can all be done using the KEGG and String-DB websites directly but we will show here that there is much greater power and flexibility available when you start using programmatic methods to carefully control your analyses.\n",
    "\n",
    "##### <font color='darkblue'>Learning Outcomes</font>\n",
    "After this tutorial you should be comfortable with:\n",
    "- Retrieving pathway information from KEGG\n",
    "- converting between accession IDs of different databases\n",
    "- Retrieving protein-protein interaction and network data from String-DB\n",
    "- Automating these processes using APIs and the NetworkX, and GSEAPy python packages"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9513cee1",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Setting up the Environment & Retrieving Data</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f012cb8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setting Up the Programming Environment\n",
    "# %pip install networkx\n",
    "# %pip install gseapy\n",
    "\n",
    "# import modules for use in the notebook\n",
    "\n",
    "# handling www based requests (like APIs)\n",
    "import urllib as ul\n",
    "\n",
    "# standard Python data handling modules\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import json\n",
    "\n",
    "# working with networks\n",
    "import networkx as nx\n",
    "\n",
    "# working with geen set enrichment analysis (GSEA)\n",
    "import gseapy as gp"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50c4336f",
   "metadata": {},
   "source": [
    "For Step1 you can either use the fully automated approcah using Steps 1a, and 1b, or working from files you generate from the KEGG and String-DB websites as described in Step 1c."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce2a96a2",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Fetch KEGG Pathway Information</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now we will use this file which contains the full pathway details including the gene names.\n",
    "\n",
    "# open the file\n",
    "dop_file = open('../data/pathways/dop_synapse.txt','r')\n",
    "\n",
    "# I wanted to show you some basic python parsing and a simple for loop with a conditional in to demonstrate how you can quickly build simple parsers.\n",
    "# There are quicker ways to do this, but this is a good learning example.\n",
    "\n",
    "# create an empty dataframe with two columns\n",
    "dop_df = pd.DataFrame(columns=['gene_id','gene_symbol','description'])\n",
    "\n",
    "# set a flag for our parser\n",
    "flag=0\n",
    "\n",
    "# work through the text file one line at a time\n",
    "for line in dop_file:\n",
    "    # find the start of the gene entries\n",
    "    if 'GENE' in line:\n",
    "        # add the first gene tp the dataframe\n",
    "        gene_id,remain = line.strip('GENE').strip().split('  ')\n",
    "        gene_symbol,description = remain.split(';')\n",
    "        # add a new row to the dataframe containing the gene_id and description\n",
    "        dop_df = pd.concat([dop_df,pd.DataFrame([[gene_id,gene_symbol,description]],columns=['gene_id','gene_symbol','description'])],ignore_index=True)\n",
    "        # set the flag to 1, we are in the gene section of the file\n",
    "        flag = 1\n",
    "    # stop when we reach the end of the section and escape the file\n",
    "    elif 'COMPOUND' in line:\n",
    "        break\n",
    "    # continue adding the genes to the dataframe\n",
    "    elif flag == 1:\n",
    "        gene_id,remain = line.strip('GENE').strip().split('  ')\n",
    "        gene_symbol,description = remain.split(';')\n",
    "        # add the gene to the dataframe\n",
    "        dop_df = pd.concat([dop_df,pd.DataFrame([[gene_id,gene_symbol,description]],columns=['gene_id','gene_symbol','description'])],ignore_index=True)\n",
    "\n",
    "# close the file\n",
    "dop_file.close()\n",
    "\n",
    "# view the file\n",
    "dop_df.head()\n",
    "\n",
    "# you now have the gene_ids (NCBI EntrezIDs for the genes in the pathway)\n",
    "print('The Dopaminergic Synapse pathway has '+str(dop_df.shape[0])+' genes in it.\\n')\n",
    "\n",
    "# store the gene_ids in a numpy array\n",
    "gene_ids = dop_df['gene_id'].to_numpy()\n",
    "\n",
    "gene_symbols = dop_df['gene_symbol'].to_numpy()\n",
    "\n",
    "# show the gene_ids\n",
    "print(gene_ids)\n",
    "\n",
    "# write the gene_ids to a file\n",
    "with open('../data/pathways/dop_geneids.txt','w') as f:\n",
    "    for gene_id in gene_ids:\n",
    "        f.write(gene_id+'\\n')\n",
    "\n",
    "# show the gene_symbols\n",
    "print(gene_symbols)\n",
    "\n",
    "# write the gene_symbols to a file\n",
    "with open('../data/pathways/dop_symbols.txt','w') as f:\n",
    "    for symbol in gene_symbols:\n",
    "        f.write(symbol+'\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cede081",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Retrieval of Protein-Protein Interactions from STRING</font>\n",
    "\n",
    "The details of the String-DB API can be found here - [https://string-db.org/help/api/](https://string-db.org/help/api/)\n",
    "\n",
    "APIs have specific formats required for their query URLs and it getting these correct in your code can take a little time until you get used to them. In this case we need to concatenate (stitch together) our gene IDs using a '%0D' string. This is actually the encoding for a line-return which is in effect mimicking the one gene per line entry that you would paste into the web page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a concatenated list of entrezIDs as strings\n",
    "# note we are taking integer gene_ids from the 'gene_id' column of the dataframe we generated above then using\n",
    "# the map function to convert each one into a string. The join function then concatenates them using the '%0D' string\n",
    "# to stitch them all together. This string will be used to help us build the API query URL.\n",
    "entrezIDs = '%0D'.join(map(str,dop_df['gene_id']))\n",
    "\n",
    "# pass the list of EntrezIDs to the String-DB API return the String-IDs\n",
    "# we first form the query url using the 'get_string_ids' API function which takes a list of identifiers and\n",
    "# converts them into the internal String-DB accession IDs. This massively speeds up the search and allows us to\n",
    "# search for more than 10 at once which is an API restriction for other API functions if String-DB internal accessions \n",
    "# aren't used.\n",
    "query_url = 'https://string-db.org/api/tsv-no-header/get_string_ids?identifiers='+entrezIDs+'&species=9606&format=only-ids'\n",
    "\n",
    "# use the urllib library to retrieve the String-DB internal IDs\n",
    "result = ul.request.urlopen(query_url).read().decode('utf-8')\n",
    "\n",
    "# now we want to query String-DB to retrieve interactions from this list of String-DB IDs\n",
    "# we create a concatenated list of stringdbIDs in much the same way as above for the Entrez Gene IDs\n",
    "stringdbIDs = '%0D'.join(result.splitlines())\n",
    "\n",
    "# again we build the query for interactions using the String-DB IDs\n",
    "query_url = 'https://string-db.org/api/tsv/network?identifiers='+stringdbIDs+'&species=9606'\n",
    "\n",
    "# again using urllib to retrieve the interactions these are returned in a standard tab delimied text format\n",
    "interactions = ul.request.urlopen(query_url).read().decode('utf-8').splitlines()\n",
    "\n",
    "# we need to split the result by these 'tabs' (\\t - is used to identfy them)\n",
    "int_test = [interaction.split('\\t') for interaction in interactions]\n",
    "\n",
    "# we extract the field names from the first row\n",
    "column_names = int_test[:1][0]\n",
    "\n",
    "# create a Pandas dataframe of the interaction data we have just retrieved from String-DB\n",
    "interactions_df = pd.DataFrame(int_test,columns=column_names)\n",
    "\n",
    "# delete the first row that held the fieldnames but we no longer need\n",
    "interactions_df = interactions_df.drop(labels=0,axis=0)\n",
    "\n",
    "# remove any duplicate rows\n",
    "final_interactions = interactions_df.drop_duplicates()\n",
    "\n",
    "# show the top of the protein-protein interaction table\n",
    "final_interactions.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fbe8cf9",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Generating the Protein-Protein Interaction Network</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "479e26d5",
   "metadata": {},
   "source": [
    "Next we are going to use the NetworkX Python library to create the protein-protein interaction network.\n",
    "\n",
    "NetworkX - Network Analysis in Python - [https://networkx.org/documentation/stable/index.html](https://networkx.org/documentation/stable/index.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cac0457f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a simple network view of the interactions using the NetworkX library\n",
    "# https://networkx.org/documentation/stable/index.html\n",
    "\n",
    "import networkx as nx\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    " #Create an empty graph\n",
    "G = nx.Graph()\n",
    "\n",
    "# add all nodes\n",
    "G.add_nodes_from(set(final_interactions['preferredName_A']) | set(final_interactions['preferredName_B'])) \n",
    "\n",
    "# add the edges (connections) to the network\n",
    "edges = []\n",
    "for edge1 , edge2  in zip(final_interactions['preferredName_A'] , final_interactions['preferredName_B']) : #add all edge to the network\n",
    "    edges.append((edge1 , edge2 ))\n",
    "G.add_edges_from(edges)\n",
    "\n",
    "# draw the network with a force directed layout specify plot size and node size and thin light gray edges\n",
    "plt.figure(figsize=(15,15))\n",
    "nx.draw(G, pos=nx.spring_layout(G,k=2), with_labels=True,node_size=100,edge_color='gray',node_color='lightblue',width=0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25b628d2",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Node Degree</font>\n",
    "\n",
    "Node degree is simply the number of connections a node has. Nodes with higher degree are often called `hub` nodes because they have many connections to other members of the network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c854567e",
   "metadata": {},
   "outputs": [],
   "source": [
    "#sort the genes (node names) by degree\n",
    "sorted_list = sorted(G.degree(), key=lambda item: item[1] , reverse=True)\n",
    "\n",
    "# print out the top10 using prettytable\n",
    "from prettytable import PrettyTable\n",
    "x = PrettyTable()\n",
    "x.field_names = [\"Gene\",\"Degree\"]\n",
    "for gene in sorted_list[:10]:\n",
    "    x.add_row(gene)\n",
    "print(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57304098",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Closeness Centrality</font>\n",
    "This is a measure of how close a node is to the centre of the network. The closer a node is to the centre the shorter its path to all other nodes and hence its more likely to be representative of the network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3335e553",
   "metadata": {},
   "outputs": [],
   "source": [
    "# sort the genes (node names) by proximity to center\n",
    "sorted_list = sorted(nx.closeness_centrality(G).items(), key=lambda item: item[1] , reverse=True) \n",
    "\n",
    "# print out the top10 using prettytable\n",
    "from prettytable import PrettyTable\n",
    "x = PrettyTable()\n",
    "x.field_names = [\"Gene\",\"Closeness\"]\n",
    "for gene in sorted_list[:10]:\n",
    "    x.add_row(gene)\n",
    "print(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55dfc741",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Clustering Coefficient</font>\n",
    "The clustering coefficient is a measure which combines centrality and degree. It measures the number of triangles a node can form ('the friend of my friend is my friend'). If a node has more common friends with other nodes it more likely to representative of the network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a91ab78",
   "metadata": {},
   "outputs": [],
   "source": [
    "# sort the genes (node names) by clustering coefficient\n",
    "sorted_list = sorted(nx.clustering(G).items(), key=lambda item: item[1] , reverse=True)\n",
    "\n",
    "# print out the top10 using prettytable\n",
    "from prettytable import PrettyTable\n",
    "x = PrettyTable()\n",
    "x.field_names = [\"Gene\",\"Clustering\"]\n",
    "for gene in sorted_list[:10]:\n",
    "    x.add_row(gene)\n",
    "print(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Clustering the Network</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import networkx as nx\n",
    "from prettytable import PrettyTable\n",
    "\n",
    "# we're going to cluster the networkx modularity clustering algorithm\n",
    "communities = nx.algorithms.community.modularity_max.greedy_modularity_communities(G)\n",
    "\n",
    "# print the number of communities\n",
    "print('The network has '+str(len(communities))+' communities.\\n')\n",
    "\n",
    "# create sub-grpahs for each community\n",
    "subgraphs = []\n",
    "for community in communities:\n",
    "    subgraphs.append(G.subgraph(community))\n",
    "    \n",
    "# print the number of nodes in each community\n",
    "for i, subgraph in enumerate(subgraphs):\n",
    "    print('Community '+str(i+1)+' has '+str(subgraph.number_of_nodes())+' nodes.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Plot the Graph with Clusters Coloured</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a dict with the gene_id as key and community membership list as value\n",
    "communityDict = dict()\n",
    "\n",
    "# loop through the communities\n",
    "for i, community in enumerate(communities):\n",
    "    # loop through the diseases in the community\n",
    "    for gene_id in community:\n",
    "        # add the disease and community to the dictionary\n",
    "        communityDict[gene_id] = i\n",
    "\n",
    "# plot the graph with the communities coloured\n",
    "# create a list of 18 colors\n",
    "communityColours = ['#1f77b4','#ff7f0e','#2ca02c']\n",
    "\n",
    "# create a list of the node colours\n",
    "nodeColours = [communityColours[communityDict[node]] for node in G.nodes()]\n",
    "\n",
    "# create a list of the node labels\n",
    "nodeLabels = {node:node for node in G.nodes()}\n",
    "\n",
    "# plot the graph\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# set the figure size\n",
    "plt.figure(figsize=(20,20))\n",
    "\n",
    "# draw the graph separating nodes by their community\n",
    "\n",
    "pos = nx.spring_layout(G, k=0.2, iterations=30, scale=1.5)\n",
    "nx.draw(G, pos, node_color=nodeColours, with_labels=True, node_size=1000, font_size=12, width=0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# use enrichr to perform gene set enrichment analysis on the communities\n",
    "\n",
    "#create a separate gene list from communityDict dictionary for each community\n",
    "community1_genes = []\n",
    "community2_genes = []\n",
    "community3_genes = []\n",
    "\n",
    "# loop through the dictionary\n",
    "for gene_id, community in communityDict.items():\n",
    "    # add the gene to the appropriate community list\n",
    "    if community == 0:\n",
    "        community1_genes.append(gene_id)\n",
    "    elif community == 1:\n",
    "        community2_genes.append(gene_id)\n",
    "    elif community == 2:\n",
    "        community3_genes.append(gene_id)\n",
    "\n",
    "def communityORA(genes):\n",
    "    # perform ORA against the Hallmark gene sets for each community\n",
    "    enr = gp.enrichr(gene_list=genes, # perform enrichment analysis using gsea\n",
    "                    gene_sets=['MSigDB_Hallmark_2020'],\n",
    "                    organism='human', # don't forget to set organism to the one you desired! e.g. Yeast\n",
    "                    outdir=None, # don't write to disk\n",
    "                    )\n",
    "    return enr\n",
    "\n",
    "# perform ORA for each community\n",
    "community1_enr = communityORA(community1_genes)\n",
    "community2_enr = communityORA(community2_genes)\n",
    "community3_enr = communityORA(community3_genes)\n",
    "\n",
    "# print the top 10 results for each community using prettytable\n",
    "from prettytable import PrettyTable\n",
    "x = PrettyTable()\n",
    "x.field_names = [\"Community 1\",\"Community 2\",\"Community 3\"]\n",
    "for i in range(10):\n",
    "    x.add_row([community1_enr.results['Term'][i],community2_enr.results['Term'][i],community3_enr.results['Term'][i]])\n",
    "print(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>Gene Set Enrichement Analysis Example</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to perform [GSEApy](https://gseapy.readthedocs.io/en/latest/index.html) we are going to use some suitable data. In this case some gene expression measurements for Parkinson's Disease from:\n",
    "\n",
    "- Lewandowski NM et al.Polyamine pathway contributes to the pathogenesis of Parkinson disease.\n",
    "- Proc Natl Acad Sci U S A. 2010 Sep 28;107(39):16970-5. [doi: 10.1073/pnas.1011751107](https://doi.org/10.1073/pnas.1011751107)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note this is crude as we are simply using normalised gene expression values, not for example performing differential gene expression analysis\n",
    "# we will return to this data in the coming weeks when we look at how we analyse gene expression data to find potentially important genes\n",
    "\n",
    "# we first read in the gene expression data which was measured from people with Parkinson's disease and healthy controls\n",
    "genExp = pd.read_csv('../data/expression/PD_Expr.tsv',sep='\\t',header=0,index_col=False)\n",
    "\n",
    "# # select rows where 'IDENTIFIER' is in the gene_ids list\n",
    "# genExp = genExp.loc[genExp['IDENTIFIER'].isin(gene_symbols)]\n",
    "\n",
    "# change column names 'ID_Ref' to 'Name' and 'IDENTIFIER' to 'GENES'\n",
    "genExp = genExp.rename(columns={'ID_REF':'NAME','IDENTIFIER':'GENES'})\n",
    "\n",
    "# drop the 'NAME' column\n",
    "genExp = genExp.drop(labels='NAME',axis=1)\n",
    "\n",
    "#make the GENES column the index\n",
    "genExp = genExp.set_index('GENES')\n",
    "\n",
    "# how big is the dataframe\n",
    "print('The dataframe has '+str(genExp.shape[0])+' rows and '+str(genExp.shape[1])+' columns.\\n')\n",
    "\n",
    "# show the top of the dataframe\n",
    "genExp.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we next need to specify which columns (samples are from PD patients and which are from controls)\n",
    "\n",
    "control = ['GSM488132','GSM488118','GSM488116','GSM488114','GSM488112','GSM488131','GSM488117','GSM488115','GSM488113','GSM488111']\n",
    "parkinsons = ['GSM488130','GSM488128','GSM488126','GSM488124','GSM488122','GSM488120','GSM488129','GSM488127','GSM488125','GSM488123','GSM488121','GSM488119']\n",
    "\n",
    "# create a class list where the class is 'control' if the sample is in the control list and 'pd' if it is in the pd list\n",
    "classes = []\n",
    "for sample in genExp.columns:\n",
    "    if sample in control:\n",
    "        classes.append('control')\n",
    "    elif sample in parkinsons:\n",
    "        classes.append('parkinsons')\n",
    "    else:\n",
    "        pass\n",
    "\n",
    "# print the number of samples in each class\n",
    "print('There are '+str(classes.count('control'))+' control samples and '+str(classes.count('parkinsons'))+' PD samples.\\n')\n",
    "\n",
    "print(classes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# perform GSEA for each community against the union of the other commmunites\n",
    "\n",
    "# suppress warnings\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# #list the available MSigDB libraries\n",
    "# libraries = gp.get_library_name()\n",
    "\n",
    "# #pick a random library from libraries using random.choice\n",
    "# gene_sets = np.random.choice(libraries)\n",
    "\n",
    "# #print the library name\n",
    "# print('The library we will use is '+gene_sets)\n",
    "\n",
    "gene_sets = 'KEGG_2019_Human'\n",
    "\n",
    "#perform GSEA\n",
    "gs_res = gp.gsea(data=genExp,\n",
    "                 gene_sets=gene_sets,\n",
    "                 cls= classes,\n",
    "                 permutation_num=100,\n",
    "                 outdir=None,\n",
    "                 method='signal_to_noise',\n",
    "                 threads=4, seed= 7)\n",
    "\n",
    "gs_res.res2d.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the GSEA curves\n",
    "terms = gs_res.res2d.Term\n",
    "axs = gs_res.plot(terms[:5], show_ranking=False, legend_kws={'loc': (1.05, 0)}, )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gseapy import heatmap\n",
    "# plotting heatmap\n",
    "i = 2\n",
    "genes = gs_res.res2d.Lead_genes[i].split(\";\")\n",
    "# Make sure that ``ofname`` is not None, if you want to save your figure to disk\n",
    "ax = heatmap(df = gs_res.heatmat.loc[genes], z_score=0, title=terms[i], figsize=(14,4))\n",
    "# label the x axis with the class label of the sample\n",
    "ax.set_xticklabels(classes,rotation=45,ha='right')\n",
    "ax.plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from seaborn import clustermap\n",
    "# clustermap\n",
    "i = 2\n",
    "genes = gs_res.res2d.Lead_genes[i].split(\";\")\n",
    "# Make sure that ``ofname`` is not None, if you want to save your figure to disk\n",
    "# ax = clustermap(data = gs_res.heatmat.loc[genes], z_score=0, figsize=(14,4),metric='correlation',method='average')\n",
    "ax = clustermap(\n",
    "    data = gs_res.heatmat.loc[genes],\n",
    "    pivot_kws=None,\n",
    "    method='average',\n",
    "    metric='euclidean',\n",
    "    z_score=0,\n",
    "    figsize=(14, 4),\n",
    "    dendrogram_ratio=0.2,\n",
    "    colors_ratio=0.03,\n",
    "    cbar_pos=(0.02, 0.1, 0.05, 0.1),\n",
    "    tree_kws=None)\n",
    "\n",
    "# label the x axis with the class label of the sample\n",
    "ax.ax_heatmap.set_xticklabels(classes,rotation=45,ha='right');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gseapy import dotplot\n",
    "# to save your figure, make sure that ``ofname`` is not None\n",
    "ax = dotplot(gs_res.res2d,\n",
    "             column=\"FDR q-val\",\n",
    "             title='KEGG_2021_Human',\n",
    "             cmap=plt.cm.viridis,\n",
    "             size=5,\n",
    "             figsize=(4,5), cutoff=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55680ba1",
   "metadata": {},
   "source": [
    "##### <font color='darkblue'>(Optional) Using the PantherDB API for Enrichment Analysis</font>\n",
    "\n",
    "PantherDB API details - [http://pantherdb.org/services/details.jsp](http://pantherdb.org/services/details.jsp)\n",
    "\n",
    "Functionality and Parameter testing - [http://pantherdb.org/services/openAPISpec.jsp](http://pantherdb.org/services/openAPISpec.jsp)\n",
    "\n",
    "This is quite hard work so you might decide it's simplest (and quicker) at this stage to use the website functionality above)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37750871",
   "metadata": {},
   "outputs": [],
   "source": [
    "# results are returned in JSON format so we need to load a Python module to handle this\n",
    "import json\n",
    "\n",
    "# the PantherDB API offers this function to find out what annotated resources it has available\n",
    "query_url = 'http://pantherdb.org/services/oai/pantherdb/supportedannotdatasets'\n",
    "\n",
    "# execute the query\n",
    "result = ul.request.urlopen(query_url)\n",
    "\n",
    "# load the results returning a Python dictionary\n",
    "annotationSets = json.load(result)\n",
    "\n",
    "annotations = annotationSets['search']['annotation_data_sets']['annotation_data_type']\n",
    "\n",
    "# we can just iterate through these to see the annotation sources available\n",
    "for i in annotations:\n",
    "    print('Annotation Set Label = '+i['label']+', annotDataSet string to use below = '+i['id'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6961befa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# using the list of Entrez Gene IDs generated above (entrezgene_id) create a query string for the API\n",
    "# these need to be comma separated\n",
    "genes = ','.join(map(str,gene_ids))\n",
    "\n",
    "# use the PantherDB API - NB that GO:0008150 is the accession for the \"Biological Process\" clade of the Gene Ontology from above\n",
    "query_url = \"http://pantherdb.org/services/oai/pantherdb/enrich/overrep?&geneInputList=\"+genes+\"&organism=9606&annotDataSet=GO:0008150&enrichmentTestType=FISHER&correction=FDR\"\n",
    "\n",
    "# capture the results (NB this returns in JSON format)\n",
    "result = ul.request.urlopen(query_url)\n",
    "\n",
    "# load the results from JSON to Python dictionary\n",
    "enrichment_result = json.load(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7568e80a",
   "metadata": {},
   "source": [
    "We're now going to format that into something human readable. There are many ways to do this, but this is a quick and (fairly) simple solution. Please do feel free to try your own."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a60ce0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# extract the actual result component\n",
    "results = enrichment_result['results']['result']\n",
    "\n",
    "# how long is the background list (in this case it is the default, the whole genome)\n",
    "print(len(results), \"terms in reference list\")\n",
    "\n",
    "# we're going to print this in a nice looking ASCII table\n",
    "from prettytable import PrettyTable\n",
    "\n",
    "x = PrettyTable()\n",
    "\n",
    "x.field_names = [\"GO Term\", \"Expected\", \"Fold enrichment\", \"raw P value\", \"FDR\", \"Term label\"]\n",
    "\n",
    "# Sort in order of false discovery rate i.e. multiple testing correction\n",
    "results.sort(key=lambda x: x['fdr'], reverse=False)\n",
    "\n",
    "# show the top10 results\n",
    "for r in results[:10]:\n",
    "    fdr = r['fdr']\n",
    "    if fdr < 0.05:\n",
    "        # Print result line\n",
    "        term_id = r['term'].get(\"id\")\n",
    "        if term_id is None:\n",
    "            term_id = \"\"\n",
    "        else:\n",
    "            current_row = [\n",
    "                term_id,\n",
    "                str('{0:.3f}'.format(r['expected'])),  # Convert float to string for printing\n",
    "                str('{0:.3f}'.format(r['fold_enrichment'])),\n",
    "                str('{0:.3g}'.format(r['pValue'])),\n",
    "                str('{0:.3g}'.format(r['fdr'])),\n",
    "                r['term'][\"label\"]\n",
    "            ]\n",
    "    x.add_row(current_row)\n",
    "\n",
    "print(x)\n",
    "\n",
    "# that was quite painful"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.13 ('base')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  },
  "vscode": {
   "interpreter": {
    "hash": "40d3a090f54c6569ab1632332b64b2c03c39dcf918b08424e98f38b5ae0af88f"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}