{ "cells": [ { "cell_type": "markdown", "id": "61085d1a", "metadata": {}, "source": [ "# oggmap: Step 4 - other evolutionary indices\n", "\n", "This notebook will demonstrate how to to add a other evolutionary indices to scRNA data." ] }, { "cell_type": "markdown", "id": "921e8052", "metadata": {}, "source": [ "## Notebook file\n", "\n", "Notebook file can be obtained here:\n", "\n", "[https://raw.githubusercontent.com/kullrich/oggmap/main/docs/notebooks/evolutionary_indices.ipynb](https://raw.githubusercontent.com/kullrich/oggmap/main/docs/notebooks/evolutionary_indices.ipynb)" ] }, { "cell_type": "markdown", "id": "52bac613", "metadata": {}, "source": [ "## Import libraries" ] }, { "cell_type": "code", "execution_count": 1, "id": "24cf6855", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import scanpy as sc\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "from statannot import add_stat_annotation\n", "# increase dpi\n", "%matplotlib inline\n", "#plt.rcParams['figure.dpi'] = 300\n", "#plt.rcParams['savefig.dpi'] = 300\n", "plt.rcParams['figure.figsize'] = [6, 4.5]\n", "#plt.rcParams['figure.figsize'] = [4.4, 3.3]" ] }, { "cell_type": "markdown", "id": "dfcf5fed", "metadata": {}, "source": [ "## Import oggmap python package submodules" ] }, { "cell_type": "code", "execution_count": 2, "id": "20cab03e", "metadata": {}, "outputs": [], "source": [ "# import submodules\n", "from oggmap import qlin, gtf2t2g, of2orthomap, orthomap2tei, datasets, ncbitax" ] }, { "cell_type": "markdown", "id": "aaa06d2e", "metadata": {}, "source": [ "## Step 0, Step 1, Step 2 and Step 3" ] }, { "cell_type": "markdown", "id": "a5ef1a54", "metadata": {}, "source": [ "In order to come to Step 4, TEI calculation, one needs to have the results from Step 0, Step 1, Step 2 and Step 3.\n", "\n", "The query species in this part is: __*Caenorhabditis elegans*__ (nematode).\n", "\n", "**Note:** In this tutorial Step 0 and Step 2 are different since other evolutionary indices will be used to weight gene expression. It does not need to be a gene age class but can be any discrete or continuous gene based measurement.\n", "\n", "Other evolutionary indices can be e.g.:\n", "\n", "- Tajima'sD\n", "- Nucleotide diversity (within species)\n", "- Nucleotide divergence (between species)\n", "- F-statistics\n", "\n", "Please have a look at the documentation of [Step 0 - run OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/orthofinder.html) to get to know what information and files are mandatory to extract gene age classes from [OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/https://github.com/davidemms/OrthoFinder) results.\n", "\n", "In [Step 1 - get taxonomic information](https://oggmap.readthedocs.io/en/latest/tutorials/query_lineage.html) you have already been introduced how to extract query lineage information with `oggmap` and the `qlin.get_qlin()` function.\n", "\n", "In [Step 2 - gene age class assignment](https://oggmap.readthedocs.io/en/latest/tutorials/get_orthomap.html) you have already been introduced how to extract an orthomap (gene age class) from [OrthoFinder](https://oggmap.readthedocs.io/en/latest/tutorials/https://github.com/davidemms/OrthoFinder) results with `oggmap` and the `of2orthomap.get_orthomap()` function or how to import pre-calculated orthomaps with the `orthomap2tei.read_orthomap()` function.\n", "\n", "In [Step 3 - map gene/transcript IDs](https://oggmap.readthedocs.io/en/latest/tutorials/geneset_overlap.html) you have already been introduced how to extract gene IDs from `GTF` file with `orthoamp` and the `gtf2t2g.parse_gtf()` function. You have also been introduced how to use the `orthomap2tei.geneset_overlap()` function to check the overlap between the gene IDs and have learned how to use the `orthomap2tei.replace_by()` function to e.g. reduce isoform gene IDs to gene IDs." ] }, { "cell_type": "markdown", "id": "9a708305", "metadata": {}, "source": [ "## Step 0 - Use different pre-calculated evolutionary indices\n", "\n", "Diversity parameter were pre-calculated ([Ma et al., 2021](https://doi.org/10.1093/gbe/evab048)) and is available here:\n", "\n", "[https://doi.org/10.5281/zenodo.7242263](https://doi.org/10.5281/zenodo.7242263)\n", "\n", "or can be accessed with the `dataset` submodule of `oggmap`\n", "\n", "`datasets.ma21_fst(datapath='data')` (download folder set to `'data'`)." ] }, { "cell_type": "code", "execution_count": 3, "id": "9c44e81f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100% [..........................................................................] 1049100 / 1049100" ] }, { "data": { "text/plain": [ "'data/Ma2021_Fst.tsv'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "datasets.ma21_fst(datapath='data')" ] }, { "cell_type": "markdown", "id": "a54cf9a1", "metadata": {}, "source": [ "### Step 1 - get taxonomic information\n", "\n", "Please have a look at the documentation of [Step 1 - get taxonomic information](https://oggmap.readthedocs.io/en/latest/tutorials/query_lineage.html) to get further insides." ] }, { "cell_type": "code", "execution_count": 4, "id": "cbc0127d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "query name: Caenorhabditis elegans\n", "query taxID: 6239\n", "query kingdom: Eukaryota\n", "query lineage names: \n", "['root(1)', 'cellular organisms(131567)', 'Eukaryota(2759)', 'Opisthokonta(33154)', 'Metazoa(33208)', 'Eumetazoa(6072)', 'Bilateria(33213)', 'Protostomia(33317)', 'Ecdysozoa(1206794)', 'Nematoda(6231)', 'Chromadorea(119089)', 'Rhabditida(6236)', 'Rhabditina(2301116)', 'Rhabditomorpha(2301119)', 'Rhabditoidea(55879)', 'Rhabditidae(6243)', 'Peloderinae(55885)', 'Caenorhabditis(6237)', 'Caenorhabditis elegans(6239)']\n", "query lineage: \n", "[1, 131567, 2759, 33154, 33208, 6072, 33213, 33317, 1206794, 6231, 119089, 6236, 2301116, 2301119, 55879, 6243, 55885, 6237, 6239]\n" ] } ], "source": [ "# get query species taxonomic lineage information\n", "query_lineage = qlin.get_qlin(q='Caenorhabditis elegans', dbname='data/taxadb.sqlite')" ] }, { "cell_type": "markdown", "id": "04f326a4", "metadata": {}, "source": [ "## Step 2 - gene based measurement (query species evolutionary index)\n", "\n", "Here, an other evolutionary index will be used to weight gene expression. It does not need to be a gene age class but can be any discrete or continuous gene based measurement. Continuous values can be binned first and\n", "used as gene groups to weigh expression.\n", "\n", "Other evolutionary indices can be e.g.:\n", "\n", "- Tajima'sD\n", "- Nucleotide diversity (within species)\n", "- Nucleotide divergence (between species)\n", "- F-statistics" ] }, { "cell_type": "code", "execution_count": 5, "id": "03210014", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | WormBase_ID | \n", "Chr | \n", "Gene | \n", "TajimaD | \n", "NormalizedPi | \n", "FayWu | \n", "FST | \n", "
|---|---|---|---|---|---|---|---|
| 0 | \n", "WBGene00000001 | \n", "I | \n", "aap-1 | \n", "-0.6957 | \n", "0.0002 | \n", "-1.2575 | \n", "0.8062 | \n", "
| 1 | \n", "WBGene00000002 | \n", "IV | \n", "aat-1 | \n", "-0.4724 | \n", "0.0001 | \n", "-1.4628 | \n", "0.8846 | \n", "
| 2 | \n", "WBGene00000003 | \n", "V | \n", "aat-2 | \n", "-1.5266 | \n", "0.0001 | \n", "0.0816 | \n", "0.1691 | \n", "
| 3 | \n", "WBGene00000004 | \n", "X | \n", "aat-3 | \n", "-1.6401 | \n", "0.0003 | \n", "-4.7685 | \n", "0.8129 | \n", "
| 4 | \n", "WBGene00000005 | \n", "IV | \n", "aat-4 | \n", "-1.2137 | \n", "0.0006 | \n", "-0.7617 | \n", "0.3725 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 20217 | \n", "WBGene00271701 | \n", "X | \n", "F10D7.10 | \n", "-0.7428 | \n", "0.0000 | \n", "-1.9308 | \n", "0.1111 | \n", "
| 20218 | \n", "WBGene00271703 | \n", "III | \n", "ZK1010.12 | \n", "-1.3386 | \n", "0.0008 | \n", "-3.2528 | \n", "0.7683 | \n", "
| 20219 | \n", "WBGene00271706 | \n", "II | \n", "D2089.8 | \n", "-0.8312 | \n", "0.0012 | \n", "0.4489 | \n", "0.5551 | \n", "
| 20220 | \n", "WBGene00271707 | \n", "V | \n", "ZK105.14 | \n", "-1.0748 | \n", "0.0004 | \n", "-3.2069 | \n", "0.6433 | \n", "
| 20221 | \n", "WBGene00271715 | \n", "III | \n", "B0244.17 | \n", "-0.6617 | \n", "0.0010 | \n", "-1.1870 | \n", "0.6556 | \n", "
20222 rows × 7 columns
\n", "| \n", " | WormBase_ID | \n", "Chr | \n", "Gene | \n", "TajimaD | \n", "NormalizedPi | \n", "FayWu | \n", "FST | \n", "TajimaD_binned | \n", "TajimaD_bins | \n", "NormalizedPi_binned | \n", "NormalizedPi_bins | \n", "FayWu_binned | \n", "FayWu_bins | \n", "FST_binned | \n", "FST_bins | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "WBGene00000001 | \n", "I | \n", "aap-1 | \n", "-0.6957 | \n", "0.0002 | \n", "-1.2575 | \n", "0.8062 | \n", "8.0 | \n", "-0.84 >= x < -0.67368 | \n", "3.0 | \n", "0.0002 >= x < 0.0004 | \n", "5.0 | \n", "-1.68402 >= x < -0.99135 | \n", "5.0 | \n", "0.6286 < x | \n", "
| 1 | \n", "WBGene00000002 | \n", "IV | \n", "aat-1 | \n", "-0.4724 | \n", "0.0001 | \n", "-1.4628 | \n", "0.8846 | \n", "9.0 | \n", "-0.67368 >= x < -0.20841999999999972 | \n", "2.0 | \n", "0.0001 >= x < 0.0002 | \n", "5.0 | \n", "-1.68402 >= x < -0.99135 | \n", "5.0 | \n", "0.6286 < x | \n", "
| 2 | \n", "WBGene00000003 | \n", "V | \n", "aat-2 | \n", "-1.5266 | \n", "0.0001 | \n", "0.0816 | \n", "0.1691 | \n", "3.0 | \n", "-1.6187 >= x < -1.4602466666666667 | \n", "2.0 | \n", "0.0001 >= x < 0.0002 | \n", "8.0 | \n", "0.03219333333333343 >= x < 0.10182000000000008 | \n", "3.0 | \n", "0.1347 >= x < 0.3595 | \n", "
| 3 | \n", "WBGene00000004 | \n", "X | \n", "aat-3 | \n", "-1.6401 | \n", "0.0003 | \n", "-4.7685 | \n", "0.8129 | \n", "2.0 | \n", "-1.8383 >= x < -1.6187 | \n", "3.0 | \n", "0.0002 >= x < 0.0004 | \n", "2.0 | \n", "-9.455993333333332 >= x < -3.9512999999999994 | \n", "5.0 | \n", "0.6286 < x | \n", "
| 4 | \n", "WBGene00000005 | \n", "IV | \n", "aat-4 | \n", "-1.2137 | \n", "0.0006 | \n", "-0.7617 | \n", "0.3725 | \n", "5.0 | \n", "-1.3196 >= x < -1.1656 | \n", "4.0 | \n", "0.0004 >= x < 0.0011 | \n", "6.0 | \n", "-0.99135 >= x < 0.0 | \n", "4.0 | \n", "0.3595 >= x < 0.6286 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 20217 | \n", "WBGene00271701 | \n", "X | \n", "F10D7.10 | \n", "-0.7428 | \n", "0.0000 | \n", "-1.9308 | \n", "0.1111 | \n", "8.0 | \n", "-0.84 >= x < -0.67368 | \n", "1.0 | \n", "x < 0.0001 | \n", "4.0 | \n", "-2.277410000000001 >= x < -1.68402 | \n", "2.0 | \n", "0.0 >= x < 0.1347 | \n", "
| 20218 | \n", "WBGene00271703 | \n", "III | \n", "ZK1010.12 | \n", "-1.3386 | \n", "0.0008 | \n", "-3.2528 | \n", "0.7683 | \n", "4.0 | \n", "-1.4602466666666667 >= x < -1.3196 | \n", "4.0 | \n", "0.0004 >= x < 0.0011 | \n", "3.0 | \n", "-3.9512999999999994 >= x < -2.277410000000001 | \n", "5.0 | \n", "0.6286 < x | \n", "
| 20219 | \n", "WBGene00271706 | \n", "II | \n", "D2089.8 | \n", "-0.8312 | \n", "0.0012 | \n", "0.4489 | \n", "0.5551 | \n", "8.0 | \n", "-0.84 >= x < -0.67368 | \n", "5.0 | \n", "0.0011 < x | \n", "10.0 | \n", "0.2674900000000009 < x | \n", "4.0 | \n", "0.3595 >= x < 0.6286 | \n", "
| 20220 | \n", "WBGene00271707 | \n", "V | \n", "ZK105.14 | \n", "-1.0748 | \n", "0.0004 | \n", "-3.2069 | \n", "0.6433 | \n", "6.0 | \n", "-1.1656 >= x < -1.0416 | \n", "4.0 | \n", "0.0004 >= x < 0.0011 | \n", "3.0 | \n", "-3.9512999999999994 >= x < -2.277410000000001 | \n", "5.0 | \n", "0.6286 < x | \n", "
| 20221 | \n", "WBGene00271715 | \n", "III | \n", "B0244.17 | \n", "-0.6617 | \n", "0.0010 | \n", "-1.1870 | \n", "0.6556 | \n", "9.0 | \n", "-0.67368 >= x < -0.20841999999999972 | \n", "4.0 | \n", "0.0004 >= x < 0.0011 | \n", "5.0 | \n", "-1.68402 >= x < -0.99135 | \n", "5.0 | \n", "0.6286 < x | \n", "
20222 rows × 15 columns
\n", "| \n", " | orig.ident | \n", "nCount_RNA | \n", "nFeature_RNA | \n", "cell | \n", "n.umi | \n", "time.point | \n", "batch | \n", "Size_Factor | \n", "cell.type | \n", "cell.subtype | \n", "plot.cell.type | \n", "raw.embryo.time | \n", "embryo.time | \n", "embryo.time.bin | \n", "raw.embryo.time.bin | \n", "lineage | \n", "passed_initial_QC_or_later_whitelisted | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAACCTGAGACAATAC-300.1.1 | \n", "0 | \n", "1630.0 | \n", "781 | \n", "AAACCTGAGACAATAC-300.1.1 | \n", "1630 | \n", "300_minutes | \n", "Waterston_300_minutes | \n", "1.023195 | \n", "Body_wall_muscle | \n", "BWM_head_row_1 | \n", "BWM_head_row_1 | \n", "360 | \n", "380.0 | \n", "330-390 | \n", "330-390 | \n", "MSxpappp | \n", "1 | \n", "
| AAACCTGAGGGCTCTC-300.1.1 | \n", "0 | \n", "2323.0 | \n", "1116 | \n", "AAACCTGAGGGCTCTC-300.1.1 | \n", "2319 | \n", "300_minutes | \n", "Waterston_300_minutes | \n", "1.458210 | \n", "NA | \n", "NA | \n", "NA | \n", "260 | \n", "220.0 | \n", "210-270 | \n", "210-270 | \n", "MSxapaap | \n", "1 | \n", "
| AAACCTGAGTGCGTGA-300.1.1 | \n", "0 | \n", "3725.0 | \n", "1322 | \n", "AAACCTGAGTGCGTGA-300.1.1 | \n", "3719 | \n", "300_minutes | \n", "Waterston_300_minutes | \n", "2.338283 | \n", "NA | \n", "NA | \n", "NA | \n", "270 | \n", "230.0 | \n", "210-270 | \n", "270-330 | \n", "NA | \n", "1 | \n", "
| AAACCTGAGTTGAGTA-300.1.1 | \n", "0 | \n", "4236.0 | \n", "1747 | \n", "AAACCTGAGTTGAGTA-300.1.1 | \n", "4251 | \n", "300_minutes | \n", "Waterston_300_minutes | \n", "2.659051 | \n", "Body_wall_muscle | \n", "BWM_anterior | \n", "BWM_anterior | \n", "260 | \n", "280.0 | \n", "270-330 | \n", "210-270 | \n", "Dxap | \n", "1 | \n", "
| AAACCTGCAAGACGTG-300.1.1 | \n", "0 | \n", "1003.0 | \n", "621 | \n", "AAACCTGCAAGACGTG-300.1.1 | \n", "1003 | \n", "300_minutes | \n", "Waterston_300_minutes | \n", "0.629610 | \n", "Ciliated_amphid_neuron | \n", "AFD | \n", "AFD | \n", "350 | \n", "350.0 | \n", "330-390 | \n", "330-390 | \n", "ABalpppapav/ABpraaaapav | \n", "1 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| TCTGAGACATGTCGAT-b02 | \n", "0 | \n", "581.0 | \n", "361 | \n", "TCTGAGACATGTCGAT-b02 | \n", "585 | \n", "mixed | \n", "Murray_b02 | \n", "0.364709 | \n", "Rectal_gland | \n", "Rectal_gland | \n", "Rectal_gland | \n", "390 | \n", "700.0 | \n", "> 650 | \n", "390-450 | \n", "NA | \n", "1 | \n", "
| TCTGAGACATGTCTCC-b02 | \n", "0 | \n", "516.0 | \n", "327 | \n", "TCTGAGACATGTCTCC-b02 | \n", "510 | \n", "mixed | \n", "Murray_b02 | \n", "0.323907 | \n", "NA | \n", "NA | \n", "NA | \n", "510 | \n", "470.0 | \n", "450-510 | \n", "510-580 | \n", "NA | \n", "1 | \n", "
| TGGCCAGCACGAAGCA-b02 | \n", "0 | \n", "843.0 | \n", "543 | \n", "TGGCCAGCACGAAGCA-b02 | \n", "843 | \n", "mixed | \n", "Murray_b02 | \n", "0.529174 | \n", "NA | \n", "NA | \n", "NA | \n", "400 | \n", "470.0 | \n", "450-510 | \n", "390-450 | \n", "NA | \n", "1 | \n", "
| TGGCGCACAGGCAGTA-b02 | \n", "0 | \n", "634.0 | \n", "397 | \n", "TGGCGCACAGGCAGTA-b02 | \n", "636 | \n", "mixed | \n", "Murray_b02 | \n", "0.397979 | \n", "NA | \n", "NA | \n", "NA | \n", "330 | \n", "350.0 | \n", "330-390 | \n", "330-390 | \n", "NA | \n", "1 | \n", "
| TGGGCGTTCAGGCCCA-b02 | \n", "0 | \n", "1126.0 | \n", "702 | \n", "TGGGCGTTCAGGCCCA-b02 | \n", "1132 | \n", "mixed | \n", "Murray_b02 | \n", "0.706820 | \n", "NA | \n", "NA | \n", "NA | \n", "260 | \n", "265.0 | \n", "210-270 | \n", "210-270 | \n", "NA | \n", "1 | \n", "
89701 rows × 17 columns
\n", "| \n", " | g1_g2_overlap | \n", "g1_ratio | \n", "g2_ratio | \n", "
|---|---|---|---|
| 0 | \n", "20222 | \n", "1.0 | \n", "1.0 | \n", "
| \n", " | TajimaD | \n", "
|---|---|
| AAACCTGAGACAATAC-300.1.1 | \n", "5.769133 | \n", "
| AAACCTGAGGGCTCTC-300.1.1 | \n", "5.898489 | \n", "
| AAACCTGAGTGCGTGA-300.1.1 | \n", "5.736505 | \n", "
| AAACCTGAGTTGAGTA-300.1.1 | \n", "5.795881 | \n", "
| AAACCTGCAAGACGTG-300.1.1 | \n", "5.746057 | \n", "
| ... | \n", "... | \n", "
| TCTGAGACATGTCGAT-b02 | \n", "5.465485 | \n", "
| TCTGAGACATGTCTCC-b02 | \n", "5.543912 | \n", "
| TGGCCAGCACGAAGCA-b02 | \n", "5.524013 | \n", "
| TGGCGCACAGGCAGTA-b02 | \n", "5.652469 | \n", "
| TGGGCGTTCAGGCCCA-b02 | \n", "5.634361 | \n", "
89701 rows × 1 columns
\n", "| \n", " | Fst | \n", "
|---|---|
| AAACCTGAGACAATAC-300.1.1 | \n", "2.997840 | \n", "
| AAACCTGAGGGCTCTC-300.1.1 | \n", "3.103704 | \n", "
| AAACCTGAGTGCGTGA-300.1.1 | \n", "3.100244 | \n", "
| AAACCTGAGTTGAGTA-300.1.1 | \n", "3.105081 | \n", "
| AAACCTGCAAGACGTG-300.1.1 | \n", "3.011721 | \n", "
| ... | \n", "... | \n", "
| TCTGAGACATGTCGAT-b02 | \n", "3.011302 | \n", "
| TCTGAGACATGTCTCC-b02 | \n", "3.134352 | \n", "
| TGGCCAGCACGAAGCA-b02 | \n", "3.054067 | \n", "
| TGGCGCACAGGCAGTA-b02 | \n", "3.117862 | \n", "
| TGGGCGTTCAGGCCCA-b02 | \n", "3.149005 | \n", "
89701 rows × 1 columns
\n", "| \n", " | NormalizedPi | \n", "
|---|---|
| AAACCTGAGACAATAC-300.1.1 | \n", "2.321107 | \n", "
| AAACCTGAGGGCTCTC-300.1.1 | \n", "2.527517 | \n", "
| AAACCTGAGTGCGTGA-300.1.1 | \n", "2.445354 | \n", "
| AAACCTGAGTTGAGTA-300.1.1 | \n", "2.507382 | \n", "
| AAACCTGCAAGACGTG-300.1.1 | \n", "2.655676 | \n", "
| ... | \n", "... | \n", "
| TCTGAGACATGTCGAT-b02 | \n", "2.267350 | \n", "
| TCTGAGACATGTCTCC-b02 | \n", "2.484896 | \n", "
| TGGCCAGCACGAAGCA-b02 | \n", "2.378922 | \n", "
| TGGCGCACAGGCAGTA-b02 | \n", "2.483049 | \n", "
| TGGGCGTTCAGGCCCA-b02 | \n", "2.470968 | \n", "
89701 rows × 1 columns
\n", "