{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Graph of gene expression similarity inside a tissue\n", "The idea here is to build a similarity graph between gene expression.The idea is the same as for the genotype graph, see the \"genotype graph\" notebook for more info.\n", "\n", "In this notebook, proteins or gene expression are nodes of the graph. They are connected to their k nearest neighbors. The connections are weighted by the similarity between two protein expression according to a chosen distance. To each protein is associated a vector encoding its variations over the BXD mouse dataset. Two proteins are similar if their vectors are close in term of Euclidean distance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from scipy.sparse import csr_matrix\n", "import os" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import networkx as nx\n", "import sklearn.metrics\n", "import sklearn.neighbors\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Importing the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Config for accessing the data on the s3 storage\n", "storage_options = {'anon':True, 'client_kwargs':{'endpoint_url':'https://os.unil.cloud.switch.ch'}}\n", "s3_path = 's3://lts2-graphnex/BXDmice/'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the data\n", "# Tissue\n", "tissue_name = 'LiverProt_CD'\n", "# Other examples:\n", "#tissue_name = 'Eye'\n", "#tissue_name = 'Muscle_CD'\n", "#tissue_name = 'Hippocampus'\n", "#tissue_name = 'Gastrointestinal'\n", "#tissue_name = 'Lung'\n", "tissue_path = os.path.join(s3_path, 'expression data', tissue_name + '.txt.gz')\n", "tissue = pd.read_csv(tissue_path, sep='\\t', storage_options=storage_options)\n", "print('File {} Opened.'.format(tissue_path))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing the distances" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Remove the columns (mouse strains) where there are no measurement:\n", "tissue = tissue.dropna(axis=1)\n", "# Extract the data as a numpy array (drop the first columns)\n", "tissue_values = tissue.iloc[:,2:].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalizing\n", "If unormalized, the graph of gene expression may not account for correlated expressions but only for similar concentration." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import normalize\n", "tissue_values = normalize(tissue_values, norm='l2', axis=1)\n", "tissue_values.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Default distance is Euclidean\n", "num_neighbors = 4\n", "tissue_knn = sklearn.neighbors.kneighbors_graph(tissue_values, num_neighbors, mode='distance')\n", "# Optionally, one can use the following function to compute all the distances:\n", "#geno_distances = sklearn.metrics.pairwise_distances(geno_values)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Distribution of weights\n", "plt.hist(tissue_knn.data, bins=50)\n", "plt.title('Distribution of distances')\n", "plt.xlabel('Distance')\n", "plt.ylabel('Nb of edges')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Distance to weight\n", "# Modify the non-zero values to turn them into weights instead of distances\n", "def distance2weight(d):\n", " sigma = 1\n", " return np.exp(- sigma * d)\n", " \n", "M = tissue_knn.copy() #csr_matrix(tissue_knn.shape)\n", "M.data = distance2weight(tissue_knn.data)\n", "\n", "print('A distance of 1 becomes a weight of {}.'.format(str(distance2weight(1))))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Distribution of weights\n", "plt.hist(M.data, bins=20)\n", "plt.title('Distribution of weights')\n", "plt.xlabel('Weight value')\n", "plt.ylabel('Nb of edges')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building the graph" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G = nx.from_scipy_sparse_matrix(M)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Adding info on the nodes of the graph\n", "tissueinfo_dic = tissue[['gene']].to_dict()\n", "nx.set_node_attributes(G, tissueinfo_dic['gene'], name='Gene') # gene name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Saving the graph as a gexf file readable with Gephi.\n", "nx.write_gexf(G,tissue_name + 'graph.gexf')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Graph plotted using Gephi, colored by community (communities found automatically with Gephi). The gene expression forms two distinct clusters. \n", "\n", "![gene expression graph](liver_gene_expression.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applications of the graph\n", "There are different possible applications of this graph, see the \"genotype graph\" notebook for examples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "venv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }