--- title: "_longevityTools_: eQTL, eSNP and GWAS Analysis" author: "Author: Dan Evans" date: "Last update: `r format(Sys.time(), '%d %B, %Y')`" package: "`r pkg_ver('longevityTools')`" output: BiocStyle::html_document: toc: true toc_depth: 3 fig_caption: yes fontsize: 14pt bibliography: bibtex.bib --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` ```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE} suppressPackageStartupMessages({ library(longevityTools) library(ggplot2) }) ``` # Introduction This vignette is part of the NIA funded Longevity Genomics project. For more information on this project please visit its website [here](http://www.longevitygenomics.org/projects/). The GitHub repository of the corresponding R package is available here and the most recent version of this vignette can be found here. Based on muscle gene expression from 15 young (25 year old) and 15 old (65 year old) participants, Sood et al. identified a 150 probe set that accurately classified young and old individuals in external studies with gene expression data collected from tissues other than muscle [@Sood2015-pb]. A gene score based on the classifier was associated with better renal function, increased survival time over follow-up, and decreased Alzheimer's Disease prevalence. Our goal is to perform Mendelian Randomization using the 150 gene set to determine whether SNPs associated with expression (eSNPs) of the 150 genes are associated with the aging phenotypes identified by Sood et al. Hypothesis: eSNPs associated with younger expression profile are markers for younger biological age relative to chronological age, which would be associated with longevity.
[Back to Table of Contents]()
# General Analytical Approach ## Construction of MR eSNP risk score File sood-2015-fileS1.xls reports the expression direction relative to age for each probe from the 150 probe set. See Top 150 worksheet. A gene marked as 'down' in column 'Ratio of Y:O muscle' means that the expression of the gene is lower in young vs old muscle. The goal is to generate an eSNP risk score constructed from eSNP alleles associated with increasing gene expression for genes that more highly expressed in young tissue and from eSNP alleles associated with decreasing gene expression for genes that show decreased expression in young tissue. This eSNP risk score would represent an expression profile that more closely mimics the younger age profile rather than the older age profile. For the 150 probes that map to genes, identify cis-eSNPs using GTEx preanalyzed results from whole blood. Preanalyzed GTEx results correct for multiple testing of SNPs for each gene. To identify independent cis-eSNPs for each gene, prune cis-eSNPs in LD using 1000G EUR population group genotypes. Once independent cis-eSNPs are identified, create a file listing the SNPs to be included in a risk score and which allele is associated with decreasing expression for genes that increase with age (upregulated gene) and which allele is associated with increasing expression for genes that decrease with age (downregulated gene). Upregulated and downregulated genes based on column 'Ratio of Y:O muscle' from Sood et al supplemental file. Individual level eSNP risk scores will be constructed as below. The eSNP risk score represents the MR instrumental variable. $$ eSNP Risk Score_i = \sum_{j=1}^{N} eSNP_j^{AlleleDown} + \sum_{k=1}^{N} eSNP_k^{AlleleUp} $$ # Getting Started ## Installation The R software for running [_`longevityTools`_](https://github.com/tgirke/longevityTools) can be downloaded from [_CRAN_](http://cran.at.r-project.org/). The _`longevityTools`_ package can be installed from the R console using the following _`biocLite`_ install command. ```{r install, eval=FALSE} source("http://bioconductor.org/biocLite.R") # Sources the biocLite.R installation script biocLite("tgirke/longevityTools", build_vignettes=FALSE) # Installs package from GitHub ```
[Back to Table of Contents]()
## Loading package and documentation ```{r documentation, eval=FALSE} library("longevityTools") # Loads the package library(help="longevityTools") # Lists package info vignette("longevityTools") # Opens vignette ```
[Back to Table of Contents]()
# Setup of working environment ## Import custom functions Preliminary functions that are still under developement, or not fully tested and documented can be imported with the `source` command from the `inst/extdata` directory of the package. Imported is the function alleleCheck(). ```{r source_fct, eval=TRUE} fctpath <- system.file("extdata", "longevityTools_eQTL_Fct.R", package="longevityTools") source(fctpath) ```
[Back to Table of Contents]()
## Create expected directory structure The following creates the directory structure expected by this workflow. Input data will be stored in the `data` directory and results will be written to the `results` directory. All paths are given relative to the present working directory of the user's R session. ```{r work_envir, eval=FALSE} dir.create("data"); dir.create("results") ```
[Back to Table of Contents]()
# Sample data from GTEx The eQTL data used in this vignette can be downloaded from the [GTEx site](http://www.gtexportal.org/home/datasets). Note, the following downloads a 782 MB file, which will take some time. Using patched V6 results. ```{r eQTL_download, eval=FALSE} download.file('http://www.gtexportal.org/static/datasets/gtex_analysis_v6p/single_tissue_eqtl_data/GTEx_Analysis_v6p_eQTL.tar','./GTEx_Analysis_v6p_eQTL.tar') untar("GTEx_Analysis_v6p_eQTL.tar") ``` # Data import Will begin with whole blood eSNPs, as the Sood paper reported predictive blood gene expression pattern. ```{r data_input, eval=FALSE} library(longevityTools) library(data.table) #use devel version for fwrite() library(stringr) dat_effect<-fread("~/bigdata/GTEx/GTEx_Analysis_v6p_eQTL/Whole_Blood_Analysis.v6p.egenes.txt",colClasses=c(gene_chr="character",chr="character")) sood<-fread("~/U24/MR/GTEx/data/sood150.csv") sood<-sood[!duplicated(Gene_Symbol),] sood <- sood[HU133_Probeset_ID!=Gene_Symbol , ] setnames(sood,"Y:0_muscle","young_old_muscle") query_genes<-sood[,Gene_Symbol] dat_effect[gene_name %in% query_genes,.N] #121 gene matches dat_effect<-merge(dat_effect,sood,by.x="gene_name",by.y="Gene_Symbol" ) sum(duplicated(dat_effect[,gene_name])) #some eSNPs are indels dat_effect[nchar(ref)==1 & nchar(alt)==1,.N] dat_effect[ , KGID:=paste(chr,snp_pos,sep=":") ] dat_effect[nchar(ref)>1 | nchar(alt)>1,.N] #make KGID for indels #dat_effect[nchar(ref)>1 | nchar(alt)>1 , KGID:=paste0(chr,":",snp_pos,"[b37]",ref,",",alt ) ] #remove indels since they don't merge with my 1KG data anyway dat_effect <- dat_effect[nchar(ref)==1 & nchar(alt)==1,] #count non-sig eSNPs dat_effect[qval>0.05 ,.N] #78 dat_effect[pval_nominal>pval_nominal_threshold,.N] #64 dat_effect[pval_perm>0.05 ,.N] #74 dat_effect[pval_perm>0.05 & pval_nominal= 0.4 , ] #95 eSNPs after poorly imputed SNPs removed dat2<-dat2[order(chr,SNPorder) ,] alleleCheck(dat2) #go ahead and switch coded and noncoded alleles now dat2[ ,coded_all_temp:=coded_all ] dat2[ ,noncoded_all_temp:=noncoded_all ] dat2[allele_switch==1, coded_all:=noncoded_all_temp] dat2[allele_switch==1, noncoded_all:=coded_all_temp] dat2[allele_switch==1, AF_coded_all:=1-AF_coded_all ] #output dat2 for reference fwrite(dat2,file="~/U24/MR/GTEx/data/soodSNPs.csv",na="NA") #extract SNPs from dosage file my_chr<-unique(dat2[,chr] ) for(i in 1:length(my_chr)){ snp_annot<-dat2[chr==my_chr[i],] for(j in 1:snp_annot[,.N]) { mySNPorder<-snp_annot[j,SNPorder ] SNPline<-fread(paste0("~/bigdata/U24/GWAScohorts/MrOS/plink1KG/chr",my_chr[i],".plink.dose"),skip=mySNPorder-1,nrows=1,header=F) if(i==1 & j==1 ){ dose_out<-SNPline } else { dose_out<-rbind(dose_out,SNPline) } } } fwrite(dat2,file="~/U24/MR/GTEx/data/snp_info.csv",na="NA") fwrite(dose_out,file="~/U24/MR/GTEx/data/dose.txt",na="NA",sep="\t",col.names=F) ``` # Association analysis with final age and survival time ```{r association, eval=FALSE} #read in dose file, flip dosage data as needed, multiply betas, sum across SNPs doseHeader<-scan("~/bigdata/U24/GWAScohorts/MrOS/vcf1KG/headerDosage.txt",sep="\t",what="character") doseHeader <- sapply(str_split(doseHeader,"_" ),function(x) x[1] ) dose<-fread("~/U24/MR/GTEx/data/dose.txt",header=F,col.names=doseHeader) dose<-as.data.frame(dose) map<-fread("~/U24/MR/GTEx/data/snp_info.csv") identical(map[,SNPID],dose[[1]]) #SNP map in same order as dose file #identify which rows contain SNPs that need to be flipped, then loop through those rows and flip dosage myrows<-which(map[,allele_switch]==1) for(i in myrows){ temp<-unlist(dose[i,4:4618]) temp<-2-temp dose[i,4:4618]<-temp } #multiply beta by dosage for each snp #Not for GTEx, negative and positive will cancel #for(i in 1:length(map$effect)){ # effect1<-map$effect[i] # dose[i,4:4618] <- effect1*unlist(dose[i,4:4618]) #} #sum allele dosage per person IV <- sapply(dose[4:4618],sum ) #4615 length vector IV <- unname(IV) ID <- names(dose)[4:4618] IV_predictor<-data.table(ID=ID,IV=IV) #now merge to phenotype and covariates pheno<-fread("~/U24/MR/HDL/data/MrOSsurvivalPercentiles.csv") pheno<-merge(pheno,IV_predictor,by="ID") pheno[ ,dead90:=-9L ] pheno[control60dead==1 ,dead90:=0L ] pheno[cases90==1 ,dead90:=1L ] pheno[dead90==-9 ,dead90:=NA ] L1<-glm(dead90 ~ IV + GIAGE1 + factor(SITE) + EV1 + EV2 + EV3 + EV4 ,data=pheno, family=binomial(link="logit"), maxit=50 , na.action=na.omit ) summary(L1) #survival library(survival) pheno[DADEAD=="M:Not Applicable" ,DADEAD:=NA] pheno[DADEAD=="A:Missing" , DADEAD:="0"] pheno[,DADEAD:=as.integer(DADEAD)] L1 <- coxph(Surv(FUCDTIME,DADEAD) ~ IV + GIAGE1 + factor(SITE) + EV1 + EV2 + EV3 + EV4, data=pheno) summary(L1) ``` # SNP clumping using PLINK The *egene datasets have selected the most significant eSNP for each gene. However, all potential eSNPs for a gene are provided in a separate file. These results could potentially be mined for secondary independent eSNP associations for each gene. GTEx V6 analysis results are based on genotypes imputed to 1000 Genomes (1KG) Phase I version 3. Thus, significant results could be LD-filtered using Phase I data. One could make use of the larger sample size in later projects, such as 1KG Phase 3 or the Haplotype Reference Panel, but the SNP identifiers might differ from GTEx results, creating difficulties. ```{r 1KG_download_filter, eval=FALSE} #genotypes download.file("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", "./ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz") untar("ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz") #sample ped file download.file("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v2.20130502.ALL.ped", "./integrated_call_samples_v2.20130502.ALL.ped") #sample super population file download.file("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel", "./integrated_call_samples_v3.20130502.ALL.panel") #identify EUR unrelated samples from 1KG phase 3 ped2<-read.table("data/integrated_call_samples_v2.20130502.ALL.ped", stringsAsFactors = F, header = T, sep="\t") ped3<-read.table("data/integrated_call_samples_v3.20130502.ALL.panel", stringsAsFactors = F, header = T, sep="\t") samples1KG <- filter_1KGsamples("EUR",ped2,ped3) samples1KG_ID <- samples1KG[,"Individual.ID",drop=F] write.table(samples1KG_ID,file="data/samples1KG.txt",quote=F,row.names=F,col.names=F) ``` Create region file to use with bcftools for LD. ```{r regionsFile, eval=FALSE} regions<-data.frame(chr="chr1",pos=0,pos_to=0,stringsAsFactors = F) regions$chr[1]<-gsub("chr","",result$snp_chrom[1]) #1KG genotype files do not have chr regions$pos[1]<-min(result$snp_pos) regions$pos_to[1]<-max(result$snp_pos) write.table(regions,file="data/regions.txt",quote=F,row.names=F,col.names=F,sep="\t") ``` Filter 1KG genotypes to only include EUR unrelated individuals and eQTL region. ``` >bcftools --version bcftools 1.3 Using htslib 1.3 Copyright (C) 2015 Genome Research Ltd. License Expat: The MIT/Expat license This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. ``` ``` bcftools view -Ov -o results/1KGgeno.vcf -S data/samples1KG.txt -R data/regions.txt -v snps data/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz #without -v snps, multiple indels with same rsID were outputted, and plink would not read that in. #Quick solution, only include SNPs, not indels. ``` Run PLINK clump command using default settings, but might want to change with different nominal significance thresholds. ``` plink -vcf results/1KGgeno.vcf --clump data/eSNP.assoc ``` PLINK clump command identifies 8 independent eSNPs in the region. Next step, extract independent eSNPs from individual level genotype data, build MR risk score, evaluate for association with survival time.
[Back to Table of Contents]()
# Funding This project is funded by NIH grant U24AG051129 awarded by the National Intitute on Aging (NIA).
[Back to Table of Contents]()
# Version information ```{r sessionInfo} sessionInfo() ```
[Back to Table of Contents]()
# References