{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "26a9e083-2461-4b32-9e6d-da2f4ece7bb3",
   "metadata": {},
   "source": [
    "---\n",
    "format:\n",
    "  html:\n",
    "    code-fold: false\n",
    "    toc: true\n",
    "  ipynb:\n",
    "    toc: true\n",
    "    number-sections: false\n",
    "bibliography: references/references_5.bib\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac338aaf-cb78-4ee7-9259-d8a16e259ba4",
   "metadata": {},
   "source": [
    "# Association models with LDAK\n",
    "\n",
    "Tutorial modified from the LDAK creator Doug Speed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ddf864fb-8a2e-433c-a211-432cdd1ee979",
   "metadata": {},
   "source": [
    "For these practicals, we will use the les: `human.bed`, `human.bim` and `human.fam` - 1000 Genomes Project SNP data for 424 humans, 3289 SNPs (Chromosomes 21 & 22). `quant.pheno`, `binary.pheno`, `multi.pheno` and `covar.covar` are phenotypes and covariates for these.\n",
    "\n",
    "`hapmap.bed`, `hapmap.bim` and `hapmap.fam` - SNP data for 1184 humans, 1016 SNPs (from HapMap Project). `mice.bed`, `mice.bim`, `mice.fam` and `mice.pheno` - SNP and phenotype data for 1940 mice, 2984 SNPs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc8a99f6-8c0d-473d-9035-7d81faccf78c",
   "metadata": {},
   "outputs": [],
   "source": [
    "wget https://www.dropbox.com/s/5vcfcree3xnvhew/data.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24f61bcf-0a44-43bd-91d9-22352a5d2b1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "unzip data.zip -d LDAKdata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a0f05f3f-063f-4a5a-8e82-72497ac982a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "chmod -R 777 LDAKdata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2630fc93-8c7a-4f3f-a925-74a3ac652cda",
   "metadata": {},
   "source": [
    "# About LDAK\n",
    "\n",
    "Designed in the style of PLINK. For example, all options start with two dashes, if you want to provide data in binary `PLINK` format, you use `--bfile`, to select a subset of SNPs you use `--extract`, provide phenotypes with `--pheno`, perform linear regression with `--linear`, etc. \n",
    "\n",
    "LDKA has some similarity with `GCTA`: you merge kinship matrices using `--add-grm`, estimate variance components using `--reml`, and kinship matrices are stored in a compatible format. LDAK performs most features of GCTA (not bivariate and some conditional analyses). \n",
    "\n",
    "LDAK has a number of additional features (e.g., computing SNP weightings, merging datasets, pro le scoring, thinning and clumping, as well as GBAT for set-based association testing, SumHer for summary stats heritability analysis, and MultiBLUP for prediction) LDAK can also perform all the features of LDSC.\n",
    "\n",
    "LDAK can accommodate almost any type of data. The formats are: Binary PLINK (`--bfile`) - the most commonly used format, an [efficient way to store hard genotypes](https://www.cog-genomics.org/plink/1.9/input#bed) (all values are 0, 1, 2 or NA).  [Gen / Chiamo / Oxford Stat Format](http://www.stats.ox.ac. uk/~marchini/software/gwas/file_format.html) (`--gen`) - rows are predictors, columns are SNPs, can set the number of header rows and columns. Typically, used to provide state probabilities. Can also store haplotype/phasing calls or dosage, etc. Files can be raw text or gzipped. \n",
    "\n",
    "SP Format (`--sp`) - Simple text matrix, with a row for each predictor and a column for each individual. Value need not correspond to SNPs. SPEED Format (`--speed`)- binary version of SP Format More info at [http://www.ldak.org/file-formats/](http://www.ldak.org/file-formats/)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad8986cd-3047-49ed-9134-ef07fe96a2a2",
   "metadata": {},
   "source": [
    "Call LDAK by typing the name of the executable (same for PLINK, GCTA), which will provide you informations about the program and which arguments could be missing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34b69003-73d3-4209-b7a8-bc73847ec24b",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19049eab-75b6-4a57-ac79-8ab3f989f65f",
   "metadata": {},
   "source": [
    "LDAK options are provided in pairs. One main argument is required, which is followed by the name of the output le. For example, suppose we want to compute statistics, we would use the main argument `--calc-stats`. The command might look like: \n",
    "\n",
    "```{.bash}\n",
    "ldak --calc-stats output --bfile human\n",
    "```\n",
    "\n",
    "This is asking LDAK to read the data stored in Binary PLINK format with prefix `human`, then save the results to les with prefix `output`. Option pairs can be provided in any order, so would be equivalent to use \n",
    "\n",
    "```{.bash}\n",
    "ldak --bfile human--calc-stats output\n",
    "```\n",
    "\n",
    "(The command above would compute MAF, call-rates and possibly info scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49511bff-262e-4ab8-b8d7-70bc100b3d4e",
   "metadata": {},
   "source": [
    "## Single SNP association analysis\n",
    "\n",
    "We wish to test each of the 3289 SNPs stored in the dataset human individually for association with the continuous phenotype contained within quant.pheno (using the model $Y = + alpha_j X_j$). For this we use the main argument `--linear` Each main argument requires di erent options. Documentation is provided at [www.ldak.org](https://www.ldak.org), but typically the easiest way is to run `LDAK` using just the main argument, then it will tell you what options you require So here we type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4db4e92a-3b5f-4b6c-8924-e926d765eb62",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --linear linear"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7848789-9f64-4bb7-b937-0fcdb31d3404",
   "metadata": {},
   "source": [
    "First LDAK tells us we need to provide phenotypes, so we add that"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a3c392b-0a1a-4d7a-8e70-e5f7bf7c0c4c",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --linear linear --pheno LDAKdata/quant.pheno"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c9bb1ce-f0e4-4938-9711-b36e628bf6bd",
   "metadata": {},
   "source": [
    "Now LDAK tells us we need to provide data les, so we add them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "521bbe13-093d-41c1-a95c-b3e9d1a05b21",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --linear linear --pheno LDAKdata/quant.pheno --bfile LDAKdata/human"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae774276-6e99-4c17-a224-c8a4d681b436",
   "metadata": {},
   "source": [
    ":::{.callout-warning title=\"Always pay attention at the output screen\"}\n",
    "It will tell you what options are required and why there are errors. It will also update you on progress (as will the file `.progress`) and tell you where the results are saved. For the linear regression, the main results are in `linear.assoc`. We will use `linear.summaries` for summary-statistic methods, `linear.pvalues` for clumping, while `linear.score` provides the polygenic risk score prediction model.\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b653b7be-a54b-4d22-bc62-812bbab49bc9",
   "metadata": {},
   "source": [
    "Some extra options we might wish to use are `--keep` (to use only a subset of individuals), `--extract` (to use only a subset of predictors) and `--covar` (to include covariates, such as age, sex, population axes). \n",
    "\n",
    "To achieve the same in PLINK, you would use \n",
    "\n",
    "```{.bash}\n",
    "plink --linear --out linear_plink --bfile LDAKdata/human --pheno LDAKdata/quant.pheno \n",
    "```\n",
    "\n",
    "Note that by default PLINK ignores individuals without sex (add `--allow-no-sex` to continue). If your phenotype is binary, you might prefer logistic regression (use `--logistic` instead of `--linear`).\n",
    "\n",
    "<img src=\"Images/R.png\" alt=\"Bash\" width=\"80\"> Choose the R kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e067f24-43b0-40b6-85eb-36e278c5bf88",
   "metadata": {},
   "outputs": [],
   "source": [
    "assoc=read.table(\"linear.assoc\",head=T)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2210f872-fbfe-4f11-b106-6f713e25272e",
   "metadata": {},
   "outputs": [],
   "source": [
    "head(assoc,4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52103f44-7e75-4add-8787-44eef45b6a98",
   "metadata": {},
   "outputs": [],
   "source": [
    "which.min(assoc[,7])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe3326fc-5094-4c2a-83d3-89366f27439a",
   "metadata": {},
   "outputs": [],
   "source": [
    "assoc[71,]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5f4a355-5c9d-408d-9e86-2d0334fa567c",
   "metadata": {},
   "outputs": [],
   "source": [
    "options(repr.plot.width=14, repr.plot.height=8)\n",
    "\n",
    "par(mfrow=c(1,2)) \n",
    "plot(-log10(assoc[,7]), \n",
    "     col=assoc[,1], \n",
    "     pch=19,\n",
    "     xlab=\"Chromosome\", \n",
    "     ylab=expression(paste(\"-log\"[10],\" p-value\",sep=\"\")), \n",
    "     main=\"A Pretty Manhattan Plot\",axes=F) \n",
    "marks=array(0,23) \n",
    "for(i in 1:22){\n",
    "    a=which(assoc[,1]==i)\n",
    "    marks[i+1]=marks[i]+length(a)\n",
    "} \n",
    "\n",
    "marks2=(marks[-1]+marks[-23])/2 \n",
    "\n",
    "axis(1,at=marks,rep(\"\",23));\n",
    "\n",
    "axis(1,at=marks2,lab=1:22,tick=F) \n",
    "\n",
    "axis(2);abline(h=-log10(5e-8),lwd=3,lty=2) \n",
    "\n",
    "ord=order(assoc[,7])\n",
    "\n",
    "obsP=assoc[ord,7] \n",
    "\n",
    "N=nrow(assoc)\n",
    "\n",
    "expP=1:N/(N+1) \n",
    "\n",
    "plot(-log10(expP), \n",
    "     -log10(obsP),\n",
    "     pch=19,\n",
    "     xlab=\"Expected-log10 P\", \n",
    "     ylab=\"Observed-log10 P\",\n",
    "     main=\"A Less Pretty QQ-Plot\") \n",
    "\n",
    "abline(a=0,b=1,col=2,lwd=3,lty=3) \n",
    "\n",
    "gif=qchisq(median(assoc[,7]),1,lower=F)/qchisq(.5,1) \n",
    "\n",
    "text(1.5,15,paste(\"Genomic Inflation Factor:\",round(gif,2)),cex=1.3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "048c2f2e-845e-4340-8ad0-4d4c5b28a1a8",
   "metadata": {},
   "source": [
    "<img src=\"Images/bash.png\" alt=\"Bash\" width=\"80\"> Choose the Bash kernel\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7d41fb0-b1aa-47d5-98ae-1b5b5564178f",
   "metadata": {},
   "source": [
    "To identify genome-wide associated SNPs we can use `awk`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7590d00a-e930-4b1e-b978-93b6f6937842",
   "metadata": {},
   "outputs": [],
   "source": [
    "awk < linear.assoc '($7<5e-8){print$0}' "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82eec1f7-86ed-4c92-b912-fb9bc4031d28",
   "metadata": {},
   "source": [
    "To identify the independent loci, use the command `--thin-tops`; this will find then prune all SNPs with p-values below a specified threshold. `--thin-tops` requires lots of options, butagain, LDAK will walk you through them..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df36dd40-e5a6-45a5-adf3-ebb5f3b7114e",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --thin-tops top --bfile LDAKdata/human --pvalues linear.pvalues \\\n",
    "--cutoff 5e-8 --window-cm 1 --window-prune .2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96b6ed09-751f-4586-b827-f6b0c9c21143",
   "metadata": {},
   "outputs": [],
   "source": [
    "cat top.in"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa9ed208-d5b5-4a3b-b19c-619ca230cc1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "awk '(NR==FNR){arr[$1];next}($2inarr){print$0}' top.in linear.assoc | head -n 10"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd33f57c-a31e-4c46-90cd-2a9188272fec",
   "metadata": {},
   "source": [
    "## Preparing a GWAS\n",
    "\n",
    "The steps are as follows: \n",
    "\n",
    "1 - Remove poorly genotyped samples and SNPs\n",
    "2 - Remove ethnic outliers\n",
    "3 - Imputation\n",
    "4 - Remove poorly imputed samples and SNPs\n",
    "5 - Remove related individuals\n",
    "\n",
    "Steps 1, 2 & 3 are fairly constant. Steps 4 & 5 depend on the aim of the GWAS; heritability analyses generally require much stricter quality control than single-SNP analysis, and are designed for unrelated individuals. One would also increase QC if studying a binary phenotype. E.g., for a single-SNP analysis, I might use an info score threshold of >0.5 or >0.8, but for heritability analysis require info >0.99."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c89be51b-6810-42fa-91a5-bb2bfb9cccda",
   "metadata": {},
   "source": [
    "### Standard QC steps on the human data\n",
    "\n",
    "We run some predefined thresholds on the human data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "470e3612-2ea6-4e62-b02b-49bcc731b5a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "plink --bfile LDAKdata/human \\\n",
    "    --geno 0.02 \\\n",
    "    --make-bed \\\n",
    "    --out Results/human1 --silent\n",
    "\n",
    "plink --bfile Results/human1 \\\n",
    "    --mind 0.02 \\\n",
    "    --make-bed \\\n",
    "    --out Results/human2 --silent\n",
    "    \n",
    "plink --bfile Results/human2 \\\n",
    "    --maf 0.05 \\\n",
    "    --make-bed  \\\n",
    "    --out Results/human3 --silent"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be5572a4-6dfe-4eaf-b910-e48dcf8a7540",
   "metadata": {},
   "source": [
    "### Ethnic outliers and overlapping to a reference\n",
    "\n",
    "Generally we wish to identify and exclude population outliers (the exception is mixed-model association analysis). For this, we will compute a kinship matrix (also referred to as a relatedness matrix/allelic correlations). \n",
    "\n",
    "The command for computing a kinship matrix is `--calc-kins-direct`. You must use either `--weights` or `--ignore-weights YES` to specify SNP weightings, and `--power` to specify how to standardize SNPs.\n",
    "\n",
    "People normally use (in-sample) principal component analysis (PCA) in this way:thin the SNPs, calculate a kinship matrix, perform PCA then plot, and find outliers. \n",
    "\n",
    "However, you cna also project the data onto population axes derived from a diverse reference panel (e.g., HapMap): \n",
    "\n",
    "1 - Find overlap of SNPs between dataset and reference panel \n",
    "2 - Using the reference panel, calculate a kinship matrix, perform PCA and compute SNP loadings for the principal components \n",
    "3 - Project the data onto these principal components, then plot"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cf8abdf-8c35-4e26-9703-66362c69eb41",
   "metadata": {},
   "source": [
    "#### Ethnic outliers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fbe7de47-e4ef-4c91-97ce-213ec3883284",
   "metadata": {},
   "outputs": [],
   "source": [
    "#thin for LD \n",
    "ldak --bfile Results/human3 --thin prune \\\n",
    "    --window-prune .2 --window-cm 1 "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9633156-cb27-47bd-8192-1dedd6459335",
   "metadata": {},
   "source": [
    "This produces `prune.in` and `prune.out` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1e2eb05-0db6-4139-a0fc-0a6ed401c869",
   "metadata": {},
   "outputs": [],
   "source": [
    "#compute unweighted kinships \n",
    "ldak --calc-kins-direct prune --bfile Results/human3 \\\n",
    "    --ignore-weights YES --power -1 --extract prune.in \\\n",
    "    --kinship-raw YES "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4fffc52-9a39-493b-8189-94708c100f3e",
   "metadata": {},
   "source": [
    "The kinship matrix is stored with stem `prune`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e297f017-8086-4fbf-9704-fc575a05bb0e",
   "metadata": {},
   "outputs": [],
   "source": [
    "#calculate 5 principal components \n",
    "ldak --pca prune --grm prune --axes 5"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5962ce59-f584-4b3b-a7c6-4838984a425f",
   "metadata": {},
   "source": [
    "Having computed the PCAs, we can plot and identify outliers\n",
    "\n",
    "<img src=\"Images/R.png\" alt=\"Bash\" width=\"80\"> Choose the R kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a489dfa9-e266-4655-9c4d-732338bce1d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "datapca=read.table(\"prune.vect\")\n",
    "plot(datapca[,3:4],xlab=\"PC1\",ylab=\"PC2\")\n",
    "abline(v=.07,col=2,lwd=3,lty=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3860413-60e9-465b-8ae9-942dce01d8ed",
   "metadata": {},
   "source": [
    "### Projection from reference\n",
    "\n",
    "<img src=\"Images/bash.png\" alt=\"Bash\" width=\"80\"> Choose the bash kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b0f26f04-f8d4-449b-aeb7-017c0e3e7025",
   "metadata": {},
   "outputs": [],
   "source": [
    "#get overlap between human and hapmap, then thin hapmap \n",
    "awk '(NR==FNR){arr[$2];next}($2 in arr){print $2}' \\\n",
    "    Results/human3.bim LDAKdata/hapmap.bim > overlap.txt \n",
    "\n",
    "ldak --thin thin --bfile LDAKdata/hapmap \\\n",
    "    --window-prune .2 --window-kb 1000 \\\n",
    "    --extract overlap.txt \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a46d0b74-6561-4571-bcac-c8887805b50f",
   "metadata": {},
   "outputs": [],
   "source": [
    "#compute kinships, pcas and loadings \n",
    "ldak --calc-kins-direct hapmap --bfile LDAKdata/hapmap \\\n",
    "    --extract thin.in --ignore-weights YES --power -1 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1aca86f1-790e-4e6c-8637-e4cf9f5a4101",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --pca hapmap --grm hapmap --axes 20"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "231e464b-1e4d-41dc-a5a8-adc62c7df373",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --calc-pca-loads hapmap --bfile LDAKdata/hapmap \\\n",
    "    --grm hapmap --pcastem hapmap\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "032567c5-769c-4c8b-8d04-980ea8e9b8a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "#project human onto these loadings \n",
    "ldak --calc-scores project --bfile Results/human3 \\\n",
    "    --scorefile hapmap.load --power 0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cba5d5bd-ca77-4e0a-ae9c-64a9f7b5c0f1",
   "metadata": {},
   "source": [
    "Plot the HapMap PCs, project onto these our data, then identify outliers.  Save a list of individuals to keep to `goodpop.keep`\n",
    "\n",
    "<img src=\"Images/R.png\" alt=\"Bash\" width=\"80\"> Choose the R kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f33b099-c30d-4c20-8f85-58bd139604a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "happca=read.table(\"hapmap.vect\") \n",
    "proj=read.table(\"project.profile\",head=T) \n",
    "par(mfrow=c(1,2)) \n",
    "plot(happca[,3:4],col=\"light grey\",pch=19,xlab=\"Proj 1\",ylab=\"Proj 2\") \n",
    "text(c(-.04,.02,.02),c(0,.05,-.04),c(\"African\",\"European\",\"Asian\")) \n",
    "plot(happca[,3:4],col=\"light grey\",pch=19,xlab=\"Proj 1\",ylab=\"Proj 2\") \n",
    "points(proj[,c(5,7)],col=2,pch=19) \n",
    "abline(v=.005,col=2,lwd=3,lty=2) \n",
    "abline(h=.025,col=2,lwd=3,lty=2) \n",
    "keep=which(proj[,5]>.005&proj[,7]>.025) \n",
    "write.table(proj[keep,1:2],\"goodpop.keep\",row=F,col=F,quote=F)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02703945-4ba7-43d5-b122-6f9d719dbfbb",
   "metadata": {},
   "source": [
    "Note that when using a proper number of individuals, the clusters will be much tighter. Also, for convenience, here we used [HapMap](ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/2009-01_ phaseIII/plink_format), but better to use 1000 Genomes Project (see [www.ldak.org/reference-panel](www.ldak.org/reference-panel))\n",
    "\n",
    "Below we plot both the PCA of the data itself and the PCA of the projection on the reference dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a43491da-70c9-4aff-84de-86dc47b45ee4",
   "metadata": {},
   "outputs": [],
   "source": [
    "datapca=read.table(\"prune.vect\") \n",
    "happca=read.table(\"hapmap.vect\") \n",
    "proj=read.table(\"project.profile\",head=T) \n",
    "sites=rep(1,nrow(proj)) \n",
    "a=grep(\"AFR\",proj[,2])\n",
    "sites[a]=2 \n",
    "a=grep(\"EAS\",proj[,2])\n",
    "sites[a]=3\n",
    "a=grep(\"SAS\",proj[,2])\n",
    "sites[a]=4 \n",
    "a=grep(\"EUR\",proj[,2])\n",
    "sites[a]=5\n",
    "a=grep(\"AMR\",proj[,2])\n",
    "sites[a]=6 \n",
    "a=grep(\"FIN\",proj[,2])\n",
    "sites[a]=7\n",
    "par(mfrow=c(1,2)) \n",
    "plot(datapca[,3:4],col=sites,pch=19,xlab=\"PCA 1\",ylab=\"PCA 2\") \n",
    "plot(happca[,3:4],col=\"light grey\",pch=19,xlab=\"Proj 1\",ylab=\"Proj 2\") \n",
    "points(proj[,c(5,7)],col=sites,pch=19)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6befa136-ba23-436e-b783-b9904a50b5b1",
   "metadata": {},
   "source": [
    "We want to create a covariate le to use in subsequent regressions. Generally include as covariates sex plus 10 principal components (5 from in-sample PC, 5 from population projections, from the two plots above). If you have imputed the data, repeat the PCA using that before using the PCA as covariates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34d80942-e9a0-4380-8974-c77301e3d0dc",
   "metadata": {},
   "outputs": [],
   "source": [
    "#can get sex from Column 5 of the fam file \n",
    "fam=read.table(\"LDAKdata/human.fam\") \n",
    "#use match() to ensure samples are aligned \n",
    "m1=match(fam[,1],datapca[,1]) \n",
    "m2=match(fam[,1],proj[,1]) \n",
    "covar=cbind(fam[,c(1,2,5)],datapca[m1,3:7],proj[m2,c(5,7,9,11,13)]) \n",
    "write.table(covar,\"covar.covar\",row=F,col=F,quote=F)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "457597ac-5544-4f7c-8b49-3fbf699e1fe2",
   "metadata": {},
   "source": [
    "## Remove relatedness\n",
    "\n",
    "Remove Relatedness To estimate relatedness we can use the kinship matrix computed when performing PCA (i.e., from thinned SNPs). Ideally you would first remove population outliers. Notice that allelic correlation kinships are distributed around expected values (e.g., 1/2 for full-sibs, 0 for unrelateds ), and there are values above 1 and below 0. (For proper sample sizes, spread will be smaller). The general assumption is if we find identical individuals, these are accidental duplicates, so we want to remove one of each pair. We can do this using `filter`:\n",
    "\n",
    "<img src=\"Images/bash.png\" alt=\"Bash\" width=\"80\"> Choose the Bash kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d93c7e5-797b-412b-9942-afc53a92c8f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --filter dups --grm prune --keep goodpop.keep --max-rel .8"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d945747-685a-460c-8f80-32f6317d513b",
   "metadata": {},
   "source": [
    "(Of course, duplicates may be twins, so check clinical data if possible).\n",
    "\n",
    "If our aim is single-SNP analysis, we normally want to remove close relatedness; a typical cut-off is 0.185, half-way between half-sibs and cousins:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aaa46550-489f-4005-86f3-40c6a5da9410",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --filter close --grm prune --keep goodpop.keep --max-rel .375"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "caddda92-b76b-4f77-a361-05a0e144a40d",
   "metadata": {},
   "source": [
    "(This is not necessary if performing mixed-model association analysis)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18bc35da-b292-40d1-bf50-506237d6c955",
   "metadata": {},
   "source": [
    "For heritability analyses, we generally require unrelated individuals, so we lter based on estimated relatedness. As a threshold, you can use an arbitrary value (say 0.025 or 0.05) or second cousins (1/32=0.03125), but prefer to use the negative of the smallest value observed (the default in LDAK) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00da7af7-7ccd-41f5-9ab5-6bfbf7deab91",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --filter filter --grm prune --keep goodpop.keep"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc366f18-fee8-4b44-8d3a-76ae8c48a21c",
   "metadata": {},
   "source": [
    "In this (unrealistic example), the minimum kinship is -0.176; using this threshold results in removal of 9 individuals, so that 397 remain (stored in `filter.keep`)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ff63a11-6c67-4bad-9932-c7aa7fcce96d",
   "metadata": {},
   "source": [
    "# SingleSNP analysis\n",
    "\n",
    "Having performed quality control, now would be the correct time to perform single-SNP analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "761ea594-4e36-45de-91c5-4bf0210f24bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --linear linear --pheno LDAKdata/quant.pheno --bfile Results/human3  --covar covar.covar"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89b8ae02-073b-448c-9d65-e1cd4fff4b00",
   "metadata": {},
   "source": [
    "Look at the output message, which tells us the results are in the `linear.assoc` file. The file looks like this"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61218803-b405-43ce-adb4-7f482a8daf59",
   "metadata": {},
   "source": [
    "<img src=\"Images/R.png\" alt=\"Bash\" width=\"80\"> Choose the R kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94118d24-196a-4bf4-8384-78d1a88896f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "results_log <- read.table(\"linear.assoc\", head=TRUE)\n",
    "head(results_log)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee82443c-9643-44a6-be72-0fecbbeee1a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setup to avoid long messages and plot on screen\n",
    "options(warn=-1)\n",
    "options(jupyter.plot_mimetypes = 'image/png')\n",
    "\n",
    "# Load GWAS package qqman\n",
    "suppressMessages(library(\"qqman\"))\n",
    "\n",
    "# Manhattan plot using --logistic results\n",
    "results_log <- read.table(\"linear.assoc\", head=TRUE)\n",
    "#manhattan(results_log, main = \"Manhattan plot: logistic\", cex.axis=1.1)\n",
    "manhattan(\n",
    "  results_log,\n",
    "  chr = \"Chromosome\",\n",
    "  bp = \"Basepair\",\n",
    "  p = \"Wald_P\",\n",
    "  snp = \"Predictor\",\n",
    "  col = c(\"pink\", \"blue\"),\n",
    "  suggestiveline = -log10(1e-7),\n",
    "  genomewideline = 0,\n",
    "  highlight = NULL,\n",
    "  logp = TRUE,\n",
    "  annotatePval = FALSE,\n",
    "  annotateTop = TRUE\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "691b9d51-2d10-4290-b9a4-408d54df4bbd",
   "metadata": {},
   "source": [
    "Which is the SNP with strongest association? We filter for smallest SNPsas in the line on the plot above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "880dd883-2e91-484d-8d3e-906a50728cc5",
   "metadata": {},
   "outputs": [],
   "source": [
    "library(dplyr)\n",
    "results_log %>% filter(Wald_P < .00001) %>% arrange(Wald_P)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f0f98ae-c1e0-4c25-a889-ca10e5f849a8",
   "metadata": {},
   "source": [
    "# Conditional association analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e9cd874-b840-4000-9faf-43e150251eec",
   "metadata": {},
   "source": [
    "We now wish to condition on the most strongly associated SNP (21:15603999) Save this SNP to a file, then add the option `--top-preds`:\n",
    "\n",
    "<img src=\"Images/bash.png\" alt=\"Bash\" width=\"80\"> Choose the Bash kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d71b0789-72a7-40c9-80c5-07b5e624500a",
   "metadata": {},
   "outputs": [],
   "source": [
    "echo \"21:15603999\" > top.snps ldak \n",
    "\n",
    "ldak --linear linear_cond \\\n",
    "    --pheno LDAKdata/quant.pheno \\\n",
    "    --bfile LDAKdata/human \\\n",
    "    --top-preds top.snps"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d09a8668-b42f-4bcd-8143-e0894f72971c",
   "metadata": {},
   "source": [
    "This will include the top SNP as a (fixed-effect) covariate in the regression (this is equivalent to making a covariate file containing the SNP values, then using `--covar` instead of `--top-preds`) Previously, we searched for independent loci by pruning; conditional analysis is a more e ective (but slower) way to do this.\n",
    "\n",
    "We see that when we condition on `21:15603999`, the two nearby SNPs are no longer significant, because they were in high LD with it! You could then also condition on the next top SNP (by adding it to the file `top.snps`), and so on\n",
    "\n",
    "<img src=\"Images/R.png\" alt=\"Bash\" width=\"80\"> Choose the R kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1195185d-dd4e-4da7-9f57-6843b27ecee9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setup to avoid long messages and plot on screen\n",
    "options(warn=-1)\n",
    "options(jupyter.plot_mimetypes = 'image/png')\n",
    "\n",
    "# Load GWAS package qqman\n",
    "suppressMessages(library(\"qqman\"))\n",
    "\n",
    "# Manhattan plot using --logistic results\n",
    "results_log <- read.table(\"linear_cond.assoc\", head=TRUE)\n",
    "#manhattan(results_log, main = \"Manhattan plot: logistic\", cex.axis=1.1)\n",
    "manhattan(\n",
    "  results_log,\n",
    "  chr = \"Chromosome\",\n",
    "  bp = \"Basepair\",\n",
    "  p = \"Wald_P\",\n",
    "  snp = \"Predictor\",\n",
    "  col = c(\"pink\", \"blue\"),\n",
    "  suggestiveline = -log10(1e-7),\n",
    "  genomewideline = 0,\n",
    "  highlight = NULL,\n",
    "  logp = TRUE,\n",
    "  annotatePval = FALSE,\n",
    "  annotateTop = TRUE\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b64c9e58-06d1-4481-88ca-5ce932f27f5f",
   "metadata": {},
   "outputs": [],
   "source": [
    "library(dplyr)\n",
    "results_log %>% filter(Wald_P < .00001) %>% arrange(Wald_P)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99641f76-e726-47ab-a093-7ce99d56f81e",
   "metadata": {},
   "source": [
    "# Mixed model analysis\n",
    "\n",
    "The standard linear model for testing SNP j is \n",
    "\n",
    "$$Y = + \\beta_j X_j$$\n",
    "\n",
    "Mixed-model linear regression extends this to \n",
    "\n",
    "$$Y = + \\beta_j X_j + g where gi N(0, K \\sigma^2_g)$$\n",
    "\n",
    "and K is a kinship matrix The original idea was that including g allows for long-range correlations due to relatedness and population structure. But recently, it has been realised that **g** will also capture the contribution of other causal variants, so in effect we are testing SNP j using a multi-SNP model. Mixed-model association analysis is also implemented in EMMA, FastLMM, GEMMA and GCTA.\n",
    "\n",
    "The mice data contain large amounts of relatedness (all individualsare derived from just 8 founders). Standard single-SNP analysis will give near meaningless results. Look at the plots from the linear test:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7622a90-865b-40c8-bd34-e6ea6a1166c6",
   "metadata": {},
   "source": [
    "<img src=\"Images/bash.png\" alt=\"Bash\" width=\"80\"> Choose the Bash kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "097c333b-83f9-4d58-95f5-77f04f59e065",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --linear stand --pheno LDAKdata/mice.pheno --bfile LDAKdata/mice"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25a12923-e70f-4eef-96a5-8104b1dacffb",
   "metadata": {},
   "source": [
    "<img src=\"Images/bash.png\" alt=\"R\" width=\"80\"> Choose the R kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d151ad77-8f27-482e-9c95-2b3e81f983d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setup to avoid long messages and plot on screen\n",
    "options(warn=-1)\n",
    "options(jupyter.plot_mimetypes = 'image/png')\n",
    "\n",
    "# Load GWAS package qqman\n",
    "suppressMessages(library(\"qqman\"))\n",
    "\n",
    "# Manhattan plot using --logistic results\n",
    "results_log <- read.table(\"stand.assoc\", head=TRUE)\n",
    "#manhattan(results_log, main = \"Manhattan plot: logistic\", cex.axis=1.1)\n",
    "manhattan(\n",
    "  results_log,\n",
    "  chr = \"Chromosome\",\n",
    "  bp = \"Basepair\",\n",
    "  p = \"Wald_P\",\n",
    "  snp = \"Predictor\",\n",
    "  col = c(\"pink\", \"blue\"),\n",
    "  suggestiveline = -log10(1e-7),\n",
    "  genomewideline = 0,\n",
    "  highlight = NULL,\n",
    "  logp = TRUE,\n",
    "  annotatePval = FALSE,\n",
    "  annotateTop = TRUE\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa9a6536-c486-43b6-9a16-c275d8b8cb21",
   "metadata": {},
   "outputs": [],
   "source": [
    "qq(results_log$Wald_P, main = \"Q-Q plot of GWAS p-values (log) using linear model on mice\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fb7408d-5aca-4c89-97e6-4faebcde5dfa",
   "metadata": {},
   "source": [
    "However, results are dramatically improved by adding a kinship matrix\n",
    "\n",
    "<img src=\"Images/bash.png\" alt=\"Bash\" width=\"80\"> Choose the Bash kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4fad9fda-2316-479c-afc4-491a761e0e04",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --calc-kins-direct kinsm --bfile LDAKdata/mice\\\n",
    "    --ignore-weights YES --power -1 \n",
    "ldak --linear mm --pheno LDAKdata/mice.pheno --bfile LDAKdata/mice --grm kinsm"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74cfeb78-70dc-4743-be61-acdb7248e4fb",
   "metadata": {},
   "source": [
    "<img src=\"Images/R.png\" alt=\"Bash\" width=\"80\"> Choose the R kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6963ca59-3be8-499b-b7c0-3801f20d684b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setup to avoid long messages and plot on screen\n",
    "options(warn=-1)\n",
    "options(jupyter.plot_mimetypes = 'image/png')\n",
    "\n",
    "# Load GWAS package qqman\n",
    "suppressMessages(library(\"qqman\"))\n",
    "\n",
    "# Manhattan plot using --logistic results\n",
    "results_log <- read.table(\"mm.assoc\", head=TRUE)\n",
    "#manhattan(results_log, main = \"Manhattan plot: logistic\", cex.axis=1.1)\n",
    "manhattan(\n",
    "  results_log,\n",
    "  chr = \"Chromosome\",\n",
    "  bp = \"Basepair\",\n",
    "  p = \"Wald_P\",\n",
    "  snp = \"Predictor\",\n",
    "  col = c(\"pink\", \"blue\"),\n",
    "  suggestiveline = -log10(1e-7),\n",
    "  genomewideline = 0,\n",
    "  highlight = NULL,\n",
    "  logp = TRUE,\n",
    "  annotatePval = FALSE,\n",
    "  annotateTop = TRUE\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01709d35-1378-4d69-91ab-86fff5b6ae58",
   "metadata": {},
   "outputs": [],
   "source": [
    "qq(results_log$Wald_P, main = \"Q-Q plot of GWAS p-values (log) using linear model on mice\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb22d154-ee81-454c-99eb-2f80538f88d2",
   "metadata": {},
   "source": [
    "# Estimate SNP heritability from data\n",
    "\n",
    "To SNP heritability under the LDAK Model, there are three steps: \n",
    "\n",
    "A - Calculate SNP weightings\n",
    "B - Calculate a weighted kinship matrix \n",
    "C - Estimate the genetic and environmental variance via REML or Haseman Elston or PCGC Regression (using only unrelated individuals). Those are some of the possible techniques which are implemented also in other softwares.\n",
    "\n",
    "## SNP weights\n",
    "\n",
    "The simple way to do this is using `--cut-weights` and `--calc-weights-all`. This can take a lot of time and you need to be patient.\n",
    "\n",
    "<img src=\"Images/bash.png\" alt=\"Bash\" width=\"80\"> Choose the bash kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8433e61-787f-4ca4-98f7-e87027f5be8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --bfile LDAKdata/human --cut-weights sections \\\n",
    "    --keep goodpop.keep \n",
    "\n",
    "ldak --bfile LDAKdata/human --calc-weights-all sections \\\n",
    "    --keep goodpop.keep"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90f23f50-3a58-4d53-911d-4ef247b489cd",
   "metadata": {},
   "source": [
    "## Weighted Kinship matrix\n",
    "\n",
    "For this, we use `--calc-kins-direct`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "259a93dc-92b6-4b24-a421-c2916bc1c77d",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --calc-kins-direct kins --bfile LDAKdata/human \\\n",
    "    --weights sections/weights.all\\\n",
    "    --power -.25 --kinship-raw YES"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a60e905-1f88-4c23-97b3-f9a2391c63bd",
   "metadata": {},
   "source": [
    "## Estimate variance components\n",
    "\n",
    "First we consider the phenotype `quant.pheno` and use `REML`. Remember to include `--keep` so that we only use unrelated individuals. The main output file is `quant.reml`, but there are many additional files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5c906773-99a7-4c8e-bb63-ee3737f2f9e1",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --reml quant --grm kins\\\n",
    "    --pheno LDAKdata/quant.pheno\\\n",
    "    --keep filter.keep"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16ce5a73-7c6e-4bbd-813f-e08b65ec6e41",
   "metadata": {},
   "source": [
    "Note, large efect SNPs should be included as top predictors!\n",
    "\n",
    "Heritability is estimated to be 0.73 +- 0.08"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46e1fab5-c8bc-4cec-8e62-53e4c97eb721",
   "metadata": {},
   "outputs": [],
   "source": [
    "cat quant.reml"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c03b045-ddfd-463b-8dbc-c1143633c250",
   "metadata": {},
   "source": [
    "# From multiple phenotypes\n",
    "\n",
    "The file multi.pheno contains 10 traits. You can use the `--reml multi` option for multiple phenotypes. We also need to decompose the kinship matrix, which speeds up a lot the analysis computation. This is especially useful when testing gene expression, where you can have more than 20K phenotypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d7a9751-7daf-4720-8e90-a072bd565490",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldak --decompose kins --grm kins --keep filter.keep\n",
    "\n",
    "ldak --reml multi --grm kins --eigen kins \\\n",
    "    --pheno LDAKdata/multi.pheno --keep filter.keep\\\n",
    "    --mpheno -1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ac4fd6cd-9eaf-4555-9ffb-da5640ce6dcc",
   "metadata": {},
   "outputs": [],
   "source": [
    "cat multi.10.reml"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Bash",
   "language": "bash",
   "name": "bash"
  },
  "language_info": {
   "codemirror_mode": "shell",
   "file_extension": ".sh",
   "mimetype": "text/x-sh",
   "name": "bash"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}