{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# This is a notebook where I show you what data is available for analysis, roughly how it is formatted, and how to make a few plots\n", "## It is meant to be a \"jump start\" for your own exploration of the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resources\n", "### Figure80 article\n", "LINK TO BE ADDED\n", "### Paper\n", "https://science.sciencemag.org/content/366/6464/490\n", "### Preprint \n", "https://www.biorxiv.org/content/10.1101/675314v1\n", "### Github and archived Zenodo of the pipeline to get from raw reads to the processed data available here\n", "https://github.com/mjohnson11/TnSeq_Pipeline and https://zenodo.org/record/3402230#.Xc3brpNKhTY\n", "(this notebook is essentially a simplified version of the one in these repositories used to make the paper figures)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Requirements if using this locally:\n", "I think just a normal anaconda distribution of python should be fine for everything here, which you probably already have: https://www.anaconda.com/distribution/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing various useful libraries / setting up plotting" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from matplotlib import pyplot as pl\n", "import seaborn as sns\n", "sns.set_style(\"white\")\n", "sns.set_style(\"ticks\")\n", "colors = ['#FFB000', '#648FFF']\n", "%matplotlib notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reading data:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:26: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.\n" ] } ], "source": [ "## READING DATA\n", "def change_well_format(w):\n", " if '_' in w:\n", " plate = int(w[1:3])\n", " t = 'LK' + str(plate) + '-'\n", " n = int(w.split('_')[1])\n", " lets = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']\n", " l = lets[int(np.floor((n-1)/12))]\n", " return t + l + str(((n-1) % 12) + 1).zfill(2)\n", " else:\n", " return w\n", "\n", "def get_geno_matrix(seg_names):\n", " # Data from https://www.nature.com/articles/nature11867, can also be dowloaded here http://genomics-pubs.princeton.edu/YeastCross_BYxRM/\n", " d = pd.read_csv('files/BYxRM_GenoData.csv') \n", " map_genos = {'B': 0, 'R': 1}\n", " for w in d.keys():\n", " if change_well_format(w) in seg_names:\n", " d[change_well_format(w)] = d[w].map(map_genos)\n", " assert len([s for s in seg_names if s in d.columns]) == len(seg_names)\n", " return d[['marker'] + seg_names]\n", "\n", "\n", "# Reading information of segregant fitness in our focal environment (from Jerison et al. 2017)\n", "x_info = pd.read_csv('files/Clones_For_Tn96_Experiment.csv')\n", "seg_to_fit = {i[0]: i[1] for i in x_info.as_matrix(['segregant', 'initial fitness, YPD 30C'])}\n", "# Reading data files containing fitness effect information from the small library experiment (few (~100) mutations in many genetic backgrounds)\n", "tp_all = pd.read_csv('files/TP_data_by_edge.csv')\n", "# This excludes the neutral controls, a few mutations that were unintentionally included in this library, and a few controls that didn't end up getting good enough coverage to really analyze\n", "tp = tp_all.loc[tp_all['Type']=='Experiment']\n", "# Reading same thing for large library experiment (many mutations in a few genetic backgrounds)\n", "bt = pd.read_csv('files/BT_data_by_edge.csv')\n", "# Reading aggregate data on DFE (distribution of fitness effect) data statistics for each segregant\n", "tp_dfe = pd.read_csv('files/TP_DFE_statistics.csv')\n", "bt_dfe = pd.read_csv('files/BT_DFE_statistics.csv')\n", "# For bioinformatic reasons, the BT (large library) data has a longer edge sequence, changing so the two exps are comparable\n", "bt['Long.Edge'] = bt['Edge']\n", "bt['Edge'] = bt['Long.Edge'].str[:15]\n", "# Making a few dictionaries that point from the names I am used to to the dataframes\n", "dats = {'BT': bt, 'TP': tp, 'BT.DFE': bt_dfe, 'TP.DFE': tp_dfe}\n", "exps = {'BT': 'E1', 'TP': 'E2'}\n", "# Getting a list of segregants in each experiment by looking for columns like segregant.mean.s in the dataframe\n", "segs_all = {exp: [i.split('.')[0] for i in dats[exp] if '.mean.s' in i] for exp in exps}\n", "# Getting genotype information on these segregants using data from Bloom et al. 2013\n", "gm = get_geno_matrix(segs_all['TP'])\n", "# Making restricted lists of segregants that have at least 50 mutations with s measured for DFE comparisons\n", "segs_use = {exp: [s for s in segs_all[exp] if len(dats[exp].loc[pd.notnull(dats[exp][s + '.mean.s'])])>=50] for exp in exps}\n", "# Making some \n", "sorted_segs = {exp: sorted(segs_use[exp], key=lambda x: seg_to_fit[x]) for exp in exps}\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Let's just take a look at all this data, to see what's what\n", "for tp and bt, each row is a mutation\n", "for tp_dfe and bt_dfe, each row is a DFE statistic (or background fitness, which we can treat similarly)\n", "for gm, each row is a genotyped allele that is different between RM and BY (and takes each state in ~half the segregants)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EdgeTypeEdge.IDGene.UseGene_ORFGene_ORF.nearbybriefDescriptionbriefDescription.nearbychromosomedescription...full_model_pfull_model_aicfull_model_r2_95_conf_lowfull_model_r2_95_conf_highfull_model_p_valuesfull_model_paramsfull_model_coeffsqtlsresid.qtlsfull.model.qtls
0TGATCATCACGGGACExperiment888in FLC2FLC2FLC2Putative calcium channel involved in calcium r...Putative calcium channel involved in calcium r...chr01Putative calcium channel involved in calcium r......8.990535e-25-1049.3912000.4843130.6643023.3295688484889364e-32;1.8707009014831425e-05;...x;locus_9649106_chr14_401893_C_T;locus_9717516...-0.009667588479105496;-0.06737728192983725;0.3...chr14_401893;chr14_374661;chr14_402771chr14_470303;chr14_420065;chr14_492376chr14_401893;chr14_374661;chr14_402771|chr14_4...
1TCGAAAGCACAGTAGExperiment377nearby SRB2NaNSRB2NaNSubunit of the RNA polymerase II mediator complexchr08NaN...5.640931e-01-83.7963650.0000820.3969750.00020013790535836575;0.5640930907526316x-0.049382849134784586;-0.08592702839231847NaNNaNNaN
2GTTGAACTGGTTGTTExperiment197in STM1STM1STM1Protein required for optimal translation under...Protein required for optimal translation under...chr12Protein required for optimal translation under......2.500225e-07-803.5965640.1039690.3478601.4054000123711754e-50;0.9145877127045838;4.10...x;locus_9680363_chr14_433150_G_A-0.028546289780384954;-0.0024826738390553073;0...chr14_433150;chr14_420065;chr14_485550chr14_433150;chr14_420065;chr14_492376chr14_433150;chr14_420065;chr14_485550
\n", "

3 rows × 1696 columns

\n", "
" ], "text/plain": [ " Edge Type Edge.ID Gene.Use Gene_ORF Gene_ORF.nearby \\\n", "0 TGATCATCACGGGAC Experiment 888 in FLC2 FLC2 FLC2 \n", "1 TCGAAAGCACAGTAG Experiment 377 nearby SRB2 NaN SRB2 \n", "2 GTTGAACTGGTTGTT Experiment 197 in STM1 STM1 STM1 \n", "\n", " briefDescription \\\n", "0 Putative calcium channel involved in calcium r... \n", "1 NaN \n", "2 Protein required for optimal translation under... \n", "\n", " briefDescription.nearby chromosome \\\n", "0 Putative calcium channel involved in calcium r... chr01 \n", "1 Subunit of the RNA polymerase II mediator complex chr08 \n", "2 Protein required for optimal translation under... chr12 \n", "\n", " description \\\n", "0 Putative calcium channel involved in calcium r... \n", "1 NaN \n", "2 Protein required for optimal translation under... \n", "\n", " ... full_model_p \\\n", "0 ... 8.990535e-25 \n", "1 ... 5.640931e-01 \n", "2 ... 2.500225e-07 \n", "\n", " full_model_aic full_model_r2_95_conf_low full_model_r2_95_conf_high \\\n", "0 -1049.391200 0.484313 0.664302 \n", "1 -83.796365 0.000082 0.396975 \n", "2 -803.596564 0.103969 0.347860 \n", "\n", " full_model_p_values \\\n", "0 3.3295688484889364e-32;1.8707009014831425e-05;... \n", "1 0.00020013790535836575;0.5640930907526316 \n", "2 1.4054000123711754e-50;0.9145877127045838;4.10... \n", "\n", " full_model_params \\\n", "0 x;locus_9649106_chr14_401893_C_T;locus_9717516... \n", "1 x \n", "2 x;locus_9680363_chr14_433150_G_A \n", "\n", " full_model_coeffs \\\n", "0 -0.009667588479105496;-0.06737728192983725;0.3... \n", "1 -0.049382849134784586;-0.08592702839231847 \n", "2 -0.028546289780384954;-0.0024826738390553073;0... \n", "\n", " qtls \\\n", "0 chr14_401893;chr14_374661;chr14_402771 \n", "1 NaN \n", "2 chr14_433150;chr14_420065;chr14_485550 \n", "\n", " resid.qtls \\\n", "0 chr14_470303;chr14_420065;chr14_492376 \n", "1 NaN \n", "2 chr14_433150;chr14_420065;chr14_492376 \n", "\n", " full.model.qtls \n", "0 chr14_401893;chr14_374661;chr14_402771|chr14_4... \n", "1 NaN \n", "2 chr14_433150;chr14_420065;chr14_485550 \n", "\n", "[3 rows x 1696 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tp.iloc[:3]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Edge.IDEdgeGene.UseGene_ORFGene_ORF.nearbybriefDescriptionbriefDescription.nearbychromosomedescriptiondescription.nearby...full_model_r2full_model_pfull_model_aicfull_model_p_valuesfull_model_paramsfull_model_coeffsqtlsresid.qtlsfull.model.qtlsLong.Edge
00TCTCCAAGGGATACTin YMR085WYMR085WYMR085WPutative protein of unknown functionPutative protein of unknown functionchr13Putative protein of unknown function; YMR085W ...Putative protein of unknown function; YMR085W ......0.0102230.781072-80.5475530.26814250806982043;0.7810722126668491x0.001490528225059142;0.006953626038516742NaNNaNNaNTCTCCAAGGGATACTTAACGTTATTCCTTT
11TGTGTCGATTTAGTGin RKM3RKM3RKM3Ribosomal lysine methyltransferaseRibosomal lysine methyltransferasechr02Ribosomal lysine methyltransferase; specific f...Ribosomal lysine methyltransferase; specific f......0.0306250.567432-114.4469810.15944797128735522;0.5674322326468317x0.001162120166529229;-0.009006462611979207NaNNaNNaNTGTGTCGATTTAGTGTTAAAGAATGACGTC
22TATGGTGCAGAAAAGnearby YNL143CNaNMEP2|YNL143CNaNAmmonium permease involved in regulation of ps...chr14NaNAmmonium permease involved in regulation of ps......0.1441100.132863-171.8018740.10689612722798757;0.13286265303302433x0.0006108064927804045;0.012645873311052722NaNNaNNaNTATGGTGCAGAAAAGTGGCTCGGAATGAAC
\n", "

3 rows × 246 columns

\n", "
" ], "text/plain": [ " Edge.ID Edge Gene.Use Gene_ORF Gene_ORF.nearby \\\n", "0 0 TCTCCAAGGGATACT in YMR085W YMR085W YMR085W \n", "1 1 TGTGTCGATTTAGTG in RKM3 RKM3 RKM3 \n", "2 2 TATGGTGCAGAAAAG nearby YNL143C NaN MEP2|YNL143C \n", "\n", " briefDescription \\\n", "0 Putative protein of unknown function \n", "1 Ribosomal lysine methyltransferase \n", "2 NaN \n", "\n", " briefDescription.nearby chromosome \\\n", "0 Putative protein of unknown function chr13 \n", "1 Ribosomal lysine methyltransferase chr02 \n", "2 Ammonium permease involved in regulation of ps... chr14 \n", "\n", " description \\\n", "0 Putative protein of unknown function; YMR085W ... \n", "1 Ribosomal lysine methyltransferase; specific f... \n", "2 NaN \n", "\n", " description.nearby \\\n", "0 Putative protein of unknown function; YMR085W ... \n", "1 Ribosomal lysine methyltransferase; specific f... \n", "2 Ammonium permease involved in regulation of ps... \n", "\n", " ... full_model_r2 full_model_p full_model_aic \\\n", "0 ... 0.010223 0.781072 -80.547553 \n", "1 ... 0.030625 0.567432 -114.446981 \n", "2 ... 0.144110 0.132863 -171.801874 \n", "\n", " full_model_p_values full_model_params \\\n", "0 0.26814250806982043;0.7810722126668491 x \n", "1 0.15944797128735522;0.5674322326468317 x \n", "2 0.10689612722798757;0.13286265303302433 x \n", "\n", " full_model_coeffs qtls resid.qtls full.model.qtls \\\n", "0 0.001490528225059142;0.006953626038516742 NaN NaN NaN \n", "1 0.001162120166529229;-0.009006462611979207 NaN NaN NaN \n", "2 0.0006108064927804045;0.012645873311052722 NaN NaN NaN \n", "\n", " Long.Edge \n", "0 TCTCCAAGGGATACTTAACGTTATTCCTTT \n", "1 TGTGTCGATTTAGTGTTAAAGAATGACGTC \n", "2 TATGGTGCAGAAAAGTGGCTCGGAATGAAC \n", "\n", "[3 rows x 246 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bt.iloc[:3]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DFE.statisticLK4-G06LK3-G08LK2-G07LK4-H03LK4-D11LK3-E06LK1-D12LK6-C05LK4-C09...full_resid_seg_model_r2full_resid_seg_model_pfull_resid_seg_model_r2_95_conf_lowfull_resid_seg_model_r2_95_conf_highfull_resid_seg_model_p_valuesfull_resid_seg_model_paramsfull_resid_seg_model_coeffsqtlsresid.qtlsfull.model.qtls
0background.fitness0.0175360.093806-0.0191950.022961-0.0608290.032745-0.057404-0.059662-0.087799...NaNNaNNaNNaNNaNNaNNaNchr14_381891;chr14_368185;chr14_393024|chr15_1...NaNchr14_381891;chr14_368185;chr14_393024|chr15_1...
1mean-0.021606-0.038078-0.022692-0.033220-0.011229-0.026932-0.010239-0.010303-0.016691...NaNNaNNaNNaNNaNNaNNaNchr14_381891;chr14_371959;chr14_393024|chr04_4...chr14_459667;chr14_414983;chr14_481897chr14_381891;chr14_371959;chr14_393024|chr04_4...
2median-0.017210-0.022665-0.012889-0.025198-0.012115-0.016404-0.010974-0.009748-0.010328...NaNNaNNaNNaNNaNNaNNaNchr14_376315;chr14_368185;chr14_393024chr14_433150;chr14_414983;chr14_485550chr14_376315;chr14_368185;chr14_393024|chr14_4...
\n", "

3 rows × 213 columns

\n", "
" ], "text/plain": [ " DFE.statistic LK4-G06 LK3-G08 LK2-G07 LK4-H03 LK4-D11 \\\n", "0 background.fitness 0.017536 0.093806 -0.019195 0.022961 -0.060829 \n", "1 mean -0.021606 -0.038078 -0.022692 -0.033220 -0.011229 \n", "2 median -0.017210 -0.022665 -0.012889 -0.025198 -0.012115 \n", "\n", " LK3-E06 LK1-D12 LK6-C05 LK4-C09 \\\n", "0 0.032745 -0.057404 -0.059662 -0.087799 \n", "1 -0.026932 -0.010239 -0.010303 -0.016691 \n", "2 -0.016404 -0.010974 -0.009748 -0.010328 \n", "\n", " ... full_resid_seg_model_r2 \\\n", "0 ... NaN \n", "1 ... NaN \n", "2 ... NaN \n", "\n", " full_resid_seg_model_p full_resid_seg_model_r2_95_conf_low \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "\n", " full_resid_seg_model_r2_95_conf_high full_resid_seg_model_p_values \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "\n", " full_resid_seg_model_params full_resid_seg_model_coeffs \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "\n", " qtls \\\n", "0 chr14_381891;chr14_368185;chr14_393024|chr15_1... \n", "1 chr14_381891;chr14_371959;chr14_393024|chr04_4... \n", "2 chr14_376315;chr14_368185;chr14_393024 \n", "\n", " resid.qtls \\\n", "0 NaN \n", "1 chr14_459667;chr14_414983;chr14_481897 \n", "2 chr14_433150;chr14_414983;chr14_485550 \n", "\n", " full.model.qtls \n", "0 chr14_381891;chr14_368185;chr14_393024|chr15_1... \n", "1 chr14_381891;chr14_371959;chr14_393024|chr04_4... \n", "2 chr14_376315;chr14_368185;chr14_393024|chr14_4... \n", "\n", "[3 rows x 213 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tp_dfe.iloc[:3]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DFE.statisticLK4-A04LK3-D08LK3-G02LK2-F11LK1-E09LK4-E02LK1-C09LK2-A12LK4-H11...full_resid_seg_model_r2full_resid_seg_model_pfull_resid_seg_model_r2_95_conf_lowfull_resid_seg_model_r2_95_conf_highfull_resid_seg_model_p_valuesfull_resid_seg_model_paramsfull_resid_seg_model_coeffsqtlsresid.qtlsfull.model.qtls
0background.fitness0.0365590.023625-0.0212510.0353490.001306-0.0218380.097835-0.029612-0.043350...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1mean-0.005015-0.005416-0.003760-0.004865-0.002509-0.003774-0.003754-0.002583-0.003060...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2median0.0010860.0001730.0006020.0005040.0006460.0007130.0004390.0007870.000182...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

3 rows × 85 columns

\n", "
" ], "text/plain": [ " DFE.statistic LK4-A04 LK3-D08 LK3-G02 LK2-F11 LK1-E09 \\\n", "0 background.fitness 0.036559 0.023625 -0.021251 0.035349 0.001306 \n", "1 mean -0.005015 -0.005416 -0.003760 -0.004865 -0.002509 \n", "2 median 0.001086 0.000173 0.000602 0.000504 0.000646 \n", "\n", " LK4-E02 LK1-C09 LK2-A12 LK4-H11 ... \\\n", "0 -0.021838 0.097835 -0.029612 -0.043350 ... \n", "1 -0.003774 -0.003754 -0.002583 -0.003060 ... \n", "2 0.000713 0.000439 0.000787 0.000182 ... \n", "\n", " full_resid_seg_model_r2 full_resid_seg_model_p \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "\n", " full_resid_seg_model_r2_95_conf_low full_resid_seg_model_r2_95_conf_high \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "\n", " full_resid_seg_model_p_values full_resid_seg_model_params \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "\n", " full_resid_seg_model_coeffs qtls resid.qtls full.model.qtls \n", "0 NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN \n", "\n", "[3 rows x 85 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bt_dfe.iloc[:3]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
markerLK4-G06LK3-G08LK2-G07LK4-A04LK4-H03LK4-D11LK3-E06LK1-D10LK1-D12...LK2-D09LK1-D06LK3-A01LK2-D05LK1-C11LK1-G03LK3-D04LK3-C06LK2-A01LK2-G10
027915_chr01_27915_T_C101100011...0010101111
128323_chr01_28323_G_A101100011...0010101111
228652_chr01_28652_G_T101100011...0010101111
\n", "

3 rows × 163 columns

\n", "
" ], "text/plain": [ " marker LK4-G06 LK3-G08 LK2-G07 LK4-A04 LK4-H03 \\\n", "0 27915_chr01_27915_T_C 1 0 1 1 0 \n", "1 28323_chr01_28323_G_A 1 0 1 1 0 \n", "2 28652_chr01_28652_G_T 1 0 1 1 0 \n", "\n", " LK4-D11 LK3-E06 LK1-D10 LK1-D12 ... LK2-D09 LK1-D06 LK3-A01 \\\n", "0 0 0 1 1 ... 0 0 1 \n", "1 0 0 1 1 ... 0 0 1 \n", "2 0 0 1 1 ... 0 0 1 \n", "\n", " LK2-D05 LK1-C11 LK1-G03 LK3-D04 LK3-C06 LK2-A01 LK2-G10 \n", "0 0 1 0 1 1 1 1 \n", "1 0 1 0 1 1 1 1 \n", "2 0 1 0 1 1 1 1 \n", "\n", "[3 rows x 163 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gm.iloc[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Here are some basic statistics on the data from the two experiments:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "tp_seg_edges_measured = [len(tp.loc[pd.notnull(tp[s+'.mean.s'])]) for s in segs_all['TP']]\n", "bt_seg_edges_measured = [len(bt.loc[pd.notnull(bt[s+'.mean.s'])]) for s in segs_all['BT']]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BT. Tried 20 segregants.\n", "18 segregants included. 413.8888888888889 mutations measured on average, min: 180 max: 614\n", "996 mutations. 710 measured in at least one seg. of those, they are measured in 10.492957746478874 on average.\n", "Measured in 10.492957746478874 segs on average\n", "253 significant in at least one seg, 457 in none, 0.643661971830986 percent\n" ] } ], "source": [ "print('BT. Tried 20 segregants.')\n", "print(len(segs_all['BT']), ' segregants included.', np.mean(bt_seg_edges_measured), 'mutations measured on average,', 'min:', np.min(bt_seg_edges_measured), 'max:', np.max(bt_seg_edges_measured))\n", "print(len(bt), 'mutations.', len(bt.loc[bt['num.measured']>0]), 'measured in at least one seg.', 'of those, they are measured in', np.mean(bt.loc[bt['num.measured']>0]['num.measured']), 'on average.')\n", "print('Measured in', np.nanmean(bt.loc[bt['num.measured']>0]['num.measured']), 'segs on average')\n", "print(len(bt.loc[bt['num.sig']>0]), 'significant in at least one seg,', len(bt.loc[bt['num.measured']>0].loc[bt['num.sig']==0]), 'in none,', \n", " len(bt.loc[bt['num.measured']>0].loc[bt['num.sig']==0])/len(bt.loc[bt['num.measured']>0]), 'percent')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TP. Tried 176 segregants.\n", "162 segregants included. 65.70987654320987 mutations measured on average, min: 1 max: 82\n", "91 mutations. Measured in 116.97802197802197 segs on average\n", "9 is the minimum # segs w measurements for one mutation, and the min # sig effects is: 0\n" ] } ], "source": [ "print('TP. Tried 176 segregants.')\n", "print(len(segs_all['TP']), ' segregants included.', np.mean(tp_seg_edges_measured), 'mutations measured on average,', 'min:', np.min(tp_seg_edges_measured), 'max:', np.max(tp_seg_edges_measured))\n", "print(len(tp), 'mutations.', 'Measured in', np.nanmean(tp['num.measured']), 'segs on average')\n", "print(np.min(tp['num.measured']),'is the minimum # segs w measurements for one mutation, and the min # sig effects is:', np.min(tp['num.sig']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional columns in these datasets describe the results of the modeling described in the paper, in terms of the R^2, AIC, etc. for various models of epistasis, the models are background fitness (x), QTLs (qtl), full (both), or resid versions (resid_qtl is the qtl model after regressing out x effects), and the columns like:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['x_model_r2', 'x_model_p', 'x_model_aic', 'qtl_model_r2', 'qtl_model_p', 'qtl_model_aic', 'resid_qtl_model_r2', 'resid_qtl_model_p', 'resid_qtl_model_aic', 'resid_x_model_r2', 'resid_x_model_p', 'resid_x_model_aic', 'full_model_r2', 'full_model_p', 'full_model_aic']\n" ] } ], "source": [ "print([i for i in tp if 'aic' in i or i[-2:] in ['r2', '_p']])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Plot fitness effect vs. background fitness for a particular mutation" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "def plot_simple(sub, df_row, segs):\n", " measured = [seg for seg in segs if pd.notnull(df_row[seg + '.mean.s'])]\n", " xs = [seg_to_fit[measured[s]] for s in range(len(measured))]\n", " ys = [df_row[measured[s] + '.mean.s'] for s in range(len(measured))]\n", " ye = [df_row[measured[s] + '.stderr.s'] for s in range(len(measured))]\n", " sub.axhline(y=0, xmin=0, xmax=1, color='#333333', linestyle='dashed', alpha=0.5, lw=1)\n", " sub.errorbar(x=xs, y=ys, yerr=ye, marker='o', c='k', linestyle='') \n", " sub.set_xlim([-0.16, 0.12])\n", " sub.set_ylim([-0.2, 0.08])\n", " sub.set_xlabel('X')\n", " sub.set_ylabel('S')\n", " sns.despine()\n", "\n", "f, sub = pl.subplots(1,1)\n", "dataframe_row = tp[tp['Gene.Use']=='in RPL16A'].iloc[0]\n", "plot_simple(sub, dataframe_row, segs_use['TP'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Plot a couple DFEs\n", "Note in the paper we plot combined DFEs by background fitness quartiles to make them less noisey, which is why this is different:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "def get_dfe(df, segname):\n", " use_df = df.loc[pd.notnull(df[segname + '.mean.s'])]\n", " return list(use_df[segname + '.mean.s'])\n", "\n", "def simple_plot_dfe_compare(exp):\n", " f, sub = pl.subplots(1, 1, figsize=(6, 4))\n", " lowest_fit_dfe = get_dfe(dats[exp], sorted_segs[exp][0])\n", " highest_fit_dfe = get_dfe(dats[exp], sorted_segs[exp][-1])\n", " bin_lefts = [(-16.15+i)*0.015-0.005 for i in range(22)]\n", " sub.hist(lowest_fit_dfe, bins=bin_lefts, label='Least Fit Seg', histtype=\"step\", color=\"r\", weights=np.ones_like(lowest_fit_dfe)/float(len(lowest_fit_dfe)))\n", " sub.hist(highest_fit_dfe, bins=bin_lefts, label='Most Fit Seg', histtype=\"step\", color=\"b\", weights=np.ones_like(highest_fit_dfe)/float(len(highest_fit_dfe)))\n", " sub.legend(loc='upper left')\n", " sns.despine()\n", " \n", "simple_plot_dfe_compare('BT')\n", "simple_plot_dfe_compare('TP')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# OK that concludes the jump-start! \n", "\n", "If you want to get deeper into the plotting / data exploration you can also check out the (much longer but less well commented) jupyter notebook on https://github.com/mjohnson11/TnSeq_Pipeline\n", "\n", "Please email me if you have questions: milo.s.johnson.13@gmail.com\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }