{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "About the author:\n", "Oxana is a data scientist based in Stockholm, Sweden. She is studying for a PhD in Bioinformatics, exploring molecular evolution patterns in eukaryotes. You can follow Oxana on Twitter [@Merenlin](http://twitter.com/Merenlin) or read [her blog](http://merenlin.com).\n", "\n", "#### Introduction\n", "This notebook will give you the recipes of the most popular data visualizations I encounter in my work as a bioinformatician. If you always wondered what bioinformatics is all about or would like to create interactive\n", "visualization for your genomic data using [plot.ly](https://plotly.com/python/), this is the place to start. \n", "\n", "We will be working with real [gene expression](http://en.wikipedia.org/wiki/Gene_expression) data obtained by [Cap Analysis of Gene Expression(CAGE)](http://en.wikipedia.org/wiki/Cap_analysis_gene_expression) from human samples by the [FANTOM5](http://fantom.gsc.riken.jp/5/) consortium. We will be following a typical workflow of a bioinformatician exploring new data, looking for the outliers: interesting genes or samples, or general patterns in the data. In the end, you'll get the idea of the challenges and upsides of creating interactive visualizations of biological data using plot.ly Python Open Source Graphing Library. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Obtaining the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "FANTOM5 provides high precision data of thousands of human and mouse samples. The vastness of this data can be overwhelming and operating it locally is challenging. Luckily, there are many tools out there to make our life easier. \n", "For creating a small data subset we can work with in this tutorial, I used [TET: Fantom 5 Table Extraction tool](http://fantom.gsc.riken.jp/5/tet). I picked a few human samples, mostly brain tissues with a few outliers, like uterus and downloaded a tab-separated file from the website. For more advanced data extraction, it's good to have a look at [TET's API](https://github.com/Hypercubed/TET/blob/master/README.md). \n", "I have picked normalized tpm(tags per million) and annotated data, so we can focus only on processed data for protein coding genes. All data files for this notebook are available on figshare: http://dx.doi.org/10.6084/m9.figshare.1430029" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Loading the dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are loading the data from the .tsv file, skipping the first two columns (00Annotation and short_description)." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of genes: 201802\n" ] }, { "data": { "text/html": [ "
| \n", " | uniprot_id | \n", "Astrocyte__cerebellum_donor1CNhs1132111500119F6 | \n", "Astrocyte__cerebral_cortex_donor1CNhs1086411235116D2 | \n", "brain_adult_donor1CNhs1179610084102B3 | \n", "brain_adult_pool1CNhs1061710012101C3 | \n", "brain_fetal_pool1CNhs1179710085102B4 | \n", "breast_adult_donor1CNhs1179210080102A8 | \n", "cerebellum__adult_donor10196CNhs1379910173103C2 | \n", "cerebellum_adult_donor10252CNhs1232310166103B4 | \n", "cerebellum_newborn_donor10223CNhs1407510357105E6 | \n", "... | \n", "thalamus__adult_donor10196CNhs1379410168103B6 | \n", "thalamus_adult_donor10252CNhs1231410154103A1 | \n", "thalamus_adult_donor10258_tech_rep1CNhs1422310370105G1 | \n", "thalamus_adult_donor10258_tech_rep2CNhs1455110370105G1 | \n", "thalamus_newborn_donor10223CNhs1408410366105F6 | \n", "throat_fetal_donor1CNhs1177010061101H7 | \n", "thyroid_fetal_donor1CNhs1176910060101H6 | \n", "tongue_epidermis_fungiform_papillae_donor1CNhs1346010288104F9 | \n", "umbilical_cord_fetal_donor1CNhs1176510057101H3 | \n", "uterus_fetal_donor1CNhs1176310055101H1 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "NA | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 1 | \n", "uniprot:Q96JB6 | \n", "0.12 | \n", "11.45 | \n", "0 | \n", "0 | \n", "0 | \n", "2.17 | \n", "0 | \n", "0.22 | \n", "1.03 | \n", "... | \n", "5.8 | \n", "0.31 | \n", "5.65 | \n", "2.99 | \n", "0 | \n", "1.19 | \n", "1.01 | \n", "2.58 | \n", "7.04 | \n", "4.48 | \n", "
| 2 | \n", "NA | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 3 | \n", "NA | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 4 | \n", "NA | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
5 rows × 71 columns
\n", "| \n", " | uniprot_id | \n", "Astrocyte__cerebellum_donor1CNhs1132111500119F6 | \n", "Astrocyte__cerebral_cortex_donor1CNhs1086411235116D2 | \n", "brain_adult_donor1CNhs1179610084102B3 | \n", "brain_adult_pool1CNhs1061710012101C3 | \n", "brain_fetal_pool1CNhs1179710085102B4 | \n", "breast_adult_donor1CNhs1179210080102A8 | \n", "cerebellum__adult_donor10196CNhs1379910173103C2 | \n", "cerebellum_adult_donor10252CNhs1232310166103B4 | \n", "cerebellum_newborn_donor10223CNhs1407510357105E6 | \n", "... | \n", "thalamus__adult_donor10196CNhs1379410168103B6 | \n", "thalamus_adult_donor10252CNhs1231410154103A1 | \n", "thalamus_adult_donor10258_tech_rep1CNhs1422310370105G1 | \n", "thalamus_adult_donor10258_tech_rep2CNhs1455110370105G1 | \n", "thalamus_newborn_donor10223CNhs1408410366105F6 | \n", "throat_fetal_donor1CNhs1177010061101H7 | \n", "thyroid_fetal_donor1CNhs1176910060101H6 | \n", "tongue_epidermis_fungiform_papillae_donor1CNhs1346010288104F9 | \n", "umbilical_cord_fetal_donor1CNhs1176510057101H3 | \n", "uterus_fetal_donor1CNhs1176310055101H1 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | \n", "uniprot:Q96JB6 | \n", "0.12 | \n", "11.45 | \n", "0 | \n", "0 | \n", "0 | \n", "2.17 | \n", "0 | \n", "0.22 | \n", "1.03 | \n", "... | \n", "5.8 | \n", "0.31 | \n", "5.65 | \n", "2.99 | \n", "0 | \n", "1.19 | \n", "1.01 | \n", "2.58 | \n", "7.04 | \n", "4.48 | \n", "
| 6 | \n", "uniprot:Q8N2H3 | \n", "7.51 | \n", "6.3 | \n", "3.88 | \n", "3.71 | \n", "2 | \n", "5.07 | \n", "1.53 | \n", "1.99 | \n", "6.72 | \n", "... | \n", "4.15 | \n", "7.34 | \n", "4.23 | \n", "3.84 | \n", "4.28 | \n", "15.24 | \n", "11.92 | \n", "7.74 | \n", "7.04 | \n", "15.87 | \n", "
| 7 | \n", "uniprot:Q8N2H3 | \n", "3.69 | \n", "3.43 | \n", "1.94 | \n", "0.65 | \n", "0 | \n", "0 | \n", "1.53 | \n", "0.55 | \n", "1.55 | \n", "... | \n", "0.83 | \n", "2.69 | \n", "2.82 | \n", "1.92 | \n", "0 | \n", "3.33 | \n", "3.03 | \n", "2.58 | \n", "2.35 | \n", "5.29 | \n", "
| 14 | \n", "uniprot:Q92902,uniprot:Q658M9,uniprot:Q8WXE5 | \n", "35.58 | \n", "20.61 | \n", "30.05 | \n", "21.71 | \n", "19.96 | \n", "31.9 | \n", "13.73 | \n", "20.79 | \n", "21.72 | \n", "... | \n", "15.75 | \n", "26.89 | \n", "24 | \n", "26.24 | \n", "12.83 | \n", "19.76 | \n", "22.03 | \n", "12.9 | \n", "32.85 | \n", "17.09 | \n", "
| 23 | \n", "uniprot:Q8WWQ2 | \n", "0 | \n", "0 | \n", "0 | \n", "1.02 | \n", "0 | \n", "0 | \n", "0 | \n", "2.99 | \n", "0 | \n", "... | \n", "5.8 | \n", "8.38 | \n", "4.23 | \n", "8.96 | \n", "0.71 | \n", "0.71 | \n", "0 | \n", "0 | \n", "7.04 | \n", "5.7 | \n", "
5 rows × 71 columns
\n", "