{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**!!Important!!** \n", "\n", "If running this on CyVerse, please run the following cell first." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.environ[\"PATH\"] = \"/opt/conda/envs/biosci/bin:\" + os.environ[\"PATH\"]\n", "\n", "# test plink\n", "! plink\n", "\n", "# Import and extract tutorial files\n", "! git clone https://github.com/CosiMichele/GWA_tutorial.git\n", "! cd GWA_tutorial && for i in `ls *.zip`; do unzip ${i}; done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Section 1. Quality Control" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial uses freely available HapMap data: `hapmap3_r3_b36_fwd.consensus.qc`. We simulated a binary outcome measure (i.e., a binary phenotypic trait) and added this to the dataset. The outcome measure was only simulated for the founders in the HapMap data. This data set will be referred to as `HapMap_3_r3_1`. \n", "\n", "The HapMap data, without our simulated outcome measure, can also be obtained from http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/2010-05_phaseIII/plink_format/ \n", "\n", "It is essential for the execution of the tutorial that that all scripts belonging to this tutorial are in the same directory on your UNIX workstation.\n", "\n", "Many scripts include comments which explain how these scripts work. Note, in order to complete the tutorial it is essential to execute all commands in this tutorial.\n", "\n", "This script can also be used for your own data analysis, to use it as such, replace the name of the HapMap file with the name of your own data file. \n", "\n", "Furthermore, this script is based on a binary outcome measure, and is therefore not applicable for quantitative outcome measures (this would require some adaptations)\n", "\n", "Note, most GWAS studies are performed on an ethnic homogenous population, in which population outliers are removed. The HapMap data, used for this tutorial, contains multiple distinct ethnic groups, which makes it problematic for analysis.\n", "\n", "Therefore, we have selected only the EUR individuals of the complete HapMap sample for the tutorials 1-3. This selection is already performed in the `HapMap_3_r3_1` file from our GitHub page.\n", "\n", "The Rscripts used in this tutorial are all executed from the Unix command line.\n", "\n", "Therefore, this tutorial and the other tutorials from our GitHub page, can be completed simply by copy-and-pasting all commands from the main scripts into the Unix terminal.\n", "For a thorough theoretical explanation of all QC steps we refer to the article accompanying this tutorial entitled \"A tutorial on conducting Genome-Wide-Association Studies: Quality control and statistical analysis\" (https://www.ncbi.nlm.nih.gov/pubmed/29484742)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Step 1.0\n", "\n", "Change directory to a folder on your UNIX device containing all files from `1_QC_GWAS.zip`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# If running on the command line use \n", "\n", "# ! cd HOME/{user}/{path/folder containing your files}\n", "\n", "# or\n", "\n", "# ! cd GWA_tutorial/1_QC_GWAS\n", "\n", "# If running on Jupyter Notebook use\n", "os.chdir(\"GWA_tutorial/1_QC_GWAS\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Step 1.1: Investigate missingness per individual and per SNP and make histograms." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_1 --missing " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "output: `plink.imiss` and `plink.lmiss`.\n", "\n", "These files show respectively the proportion of missing SNPs per individual and the proportion of missing individuals per SNP.\n", "\n", "Generate plots to visualize the missingness results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! Rscript --no-save hist_miss.R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Delete SNPs and individuals with high levels of missingness, explanation of this and all following steps can be found in box 1 and table 1 of the article mentioned in the comments of this script.\n", "\n", "The following two QC commands will not remove any SNPs or individuals. However, it is good practice to start the QC with these non-stringent thresholds. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete SNPs with missingness >0.2.\n", "! plink --bfile HapMap_3_r3_1 --geno 0.2 --make-bed --out HapMap_3_r3_2\n", "\n", "# Delete individuals with missingness >0.2.\n", "! plink --bfile HapMap_3_r3_2 --mind 0.2 --make-bed --out HapMap_3_r3_3\n", "\n", "# Delete SNPs with missingness >0.02.\n", "! plink --bfile HapMap_3_r3_3 --geno 0.02 --make-bed --out HapMap_3_r3_4\n", "\n", "# Delete individuals with missingness >0.02.\n", "! plink --bfile HapMap_3_r3_4 --mind 0.02 --make-bed --out HapMap_3_r3_5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Step 1.2: Check for sex discrepancy. \n", "\n", "Subjects who were a priori determined as females must have a F value of <0.2, and subjects who were a priori determined as males must have a F value >0.8. This F value is based on the X chromosome inbreeding (homozygosity) estimate.\n", "\n", "Subjects who do not fulfil these requirements are flagged \"PROBLEM\" by PLINK." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_5 --check-sex \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate plots to visualize the sex-check results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! Rscript --no-save gender_check.R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These checks indicate that there is one woman with a sex discrepancy, F value of 0.99. (When using other datasets often a few discrepancies will be found). \n", "\n", "The following two scripts can be used to deal with individuals with a sex discrepancy.\n", "\n", "Delete individuals with sex discrepancy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This command generates a list of individuals with the status `PROBLEM`.\n", "! grep \"PROBLEM\" plink.sexcheck| awk '{print$1,$2}'> sex_discrepancy.txt\n", "\n", "# This command removes the list of individuals with the status `PROBLEM`.\n", "! plink --bfile HapMap_3_r3_5 --remove sex_discrepancy.txt --make-bed --out HapMap_3_r3_6 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Step 1.3: Generate a bfile with autosomal SNPs only and delete SNPs with a low [minor allele frequency (MAF)](https://en.wikipedia.org/wiki/Minor_allele_frequency).\n", "\n", "Select autosomal SNPs only (i.e., from chromosomes 1 to 22).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{ if ($1 >= 1 && $1 <= 22) print $2 }' HapMap_3_r3_6.bim > snp_1_22.txt\n", "! plink --bfile HapMap_3_r3_6 --extract snp_1_22.txt --make-bed --out HapMap_3_r3_7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate a plot of the MAF distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_7 --freq --out MAF_check\n", "! Rscript --no-save MAF_check.R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove SNPs with a low MAF frequency." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_7 --maf 0.05 --make-bed --out HapMap_3_r3_8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1073226 SNPs are left\n", "\n", "A conventional MAF threshold for a regular GWAS is between 0.01 or 0.05, depending on sample size." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Step 1.4: Delete SNPs which are not in Hardy-Weinberg equilibrium (HWE). \n", "\n", "Check the distribution of HWE p-values of all SNPs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_8 --hardy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting SNPs with HWE p-value below 0.00001, required for one of the two plot generated by the next Rscript, allows to zoom in on strongly deviating SNPs. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{ if ($9 <0.00001) print $0 }' plink.hwe>plinkzoomhwe.hwe\n", "! Rscript --no-save hwe.R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default the `--hwe` option in plink only filters for controls. Therefore, we use two steps, first we use a stringent HWE threshold for controls, followed by a less stringent threshold for the case data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_8 --hwe 1e-6 --make-bed --out HapMap_hwe_filter_step1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The HWE threshold for the cases filters out only SNPs which deviate extremely from HWE. \n", "\n", "This second HWE step only focusses on cases because in the controls all SNPs with a HWE p-value < hwe 1e-6 were already removed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_hwe_filter_step1 --hwe 1e-10 --hwe-all --make-bed --out HapMap_3_r3_9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Theoretical background for this step is given in an accompanying article: https://www.ncbi.nlm.nih.gov/pubmed/29484742" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Step 1.5: Generate a plot of the distribution of the heterozygosity rate of your subjects and remove individuals with a heterozygosity rate deviating more than 3 sd from the mean.\n", "\n", "Checks for heterozygosity are performed on a set of SNPs which are not highly correlated. Therefore, to generate a list of non-(highly)correlated SNPs, we exclude high inversion regions (inversion.txt [High LD regions]) and prune the SNPs using the command `--indep-pairwise`. The parameters `50 5 0.2` stand respectively for: the window size, the number of SNPs to shift the window at each step, and the multiple correlation coefficient for a SNP being regressed on all other SNPs simultaneously." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_9 --exclude inversion.txt --range --indep-pairwise 50 5 0.2 --out indepSNP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note, don't delete the file indepSNP.prune.in, we will use this file in later steps of the tutorial." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_9 --extract indepSNP.prune.in --het --out R_check" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This file contains your pruned data set.\n", "\n", "Plot of the heterozygosity rate distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! Rscript --no-save check_heterozygosity_rate.R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code generates a list of individuals who deviate more than 3 standard deviations from the heterozygosity rate mean.\n", "\n", "For data manipulation we recommend using UNIX. However, when performing statistical calculations R might be more convenient, hence the use of the Rscript for this step:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! Rscript --no-save heterozygosity_outliers_list.R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Output of the command above: `fail-het-qc.txt`\n", "\n", "When using our example data/the HapMap data this list contains 2 individuals (i.e., two individuals have a heterozygosity rate deviating more than 3 SD's from the mean).\n", "\n", "Adapt this file to make it compatible for PLINK, by removing all quotation marks from the file and selecting only the first two columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! sed 's/\"// g' fail-het-qc.txt | awk '{print$1, $2}'> het_fail_ind.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove heterozygosity rate outliers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_9 --remove het_fail_ind.txt --make-bed --out HapMap_3_r3_10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Step 1.6: Check for Cryptic Relatedness\n", "\n", "It is essential to check datasets you analyse for cryptic relatedness. Assuming a random population sample we are going to exclude all individuals above the pihat threshold of 0.2 in this tutorial.\n", "\n", "Check for relationships between individuals with a pihat > 0.2." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_10 --extract indepSNP.prune.in --genome --min 0.2 --out pihat_min0.2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The HapMap dataset is known to contain parent-offspring relations. The following commands will visualize specifically these parent-offspring relations, using the z values. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{ if ($8 >0.9) print $0 }' pihat_min0.2.genome>zoom_pihat.genome" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate a plot to assess the type of relationship." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! Rscript --no-save Relatedness.R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The generated plots show a considerable amount of related individuals (explentation plot; PO = parent-offspring, UN = unrelated individuals) in the Hapmap data, this is expected since the dataset was constructed as such.\n", "\n", "Normally, family based data should be analyzed using specific family based methods. In this tutorial, for demonstrative purposes, we treat the relatedness as cryptic relatedness in a random population sample.\n", "\n", "In this tutorial, we aim to remove all 'relatedness' from our dataset.\n", "\n", "To demonstrate that the majority of the relatedness was due to parent-offspring we only include founders (individuals without parents in the dataset)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_10 --filter-founders --make-bed --out HapMap_3_r3_11" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will look again for individuals with a pihat >0.2." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_11 --extract indepSNP.prune.in --genome --min 0.2 --out pihat_min0.2_in_founders" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The file `pihat_min0.2_in_founders.genome` shows that, after exclusion of all non-founders, only 1 individual pair with a pihat greater than 0.2 remains in the HapMap data.\n", "\n", "This is likely to be a full sib or DZ twin pair based on the Z values. Noteworthy, they were not given the same family identity (FID) in the HapMap data.\n", "\n", "For each pair of 'related' individuals with a pihat > 0.2, we recommend to remove the individual with the lowest call rate. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_11 --missing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use an UNIX text editor (e.g., vi(m) ) to check which individual has the highest call rate in the 'related pair'. \n", "\n", "Generate a list of FID and IID of the individual(s) with a Pihat above 0.2, to check who had the lower call rate of the pair.\n", "\n", "In our dataset the individual 13291 NA07045 had the lower call rate.\n", "\n", "```\n", "> vim 0.2_low_call_rate_pihat.txt\n", "> i \n", "> 13291 NA07045\n", "```\n", "\n", "Press esc on keyboard!\n", "\n", "```\n", ":x\n", "```\n", "\n", "Press enter on keyboard!\n", "\n", "In case of multiple 'related' pairs, the list generated above can be extended using the same method as for our lone 'related' pair.\n", "\n", "Delete the individuals with the lowest call rate in 'related' pairs with a pihat > 0.2 " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_11 --remove 0.2_low_call_rate_pihat.txt --make-bed --out HapMap_3_r3_12" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "CONGRATULATIONS!! You've just succesfully completed the first tutorial! You are now able to conduct a proper genetic QC. \n", "\n", "For the next tutorial, using the script: 2_Main_script_MDS.txt, you need the following files:\n", "\n", "- The bfile `HapMap_3_r3_12` (i.e., `HapMap_3_r3_12.fam`, `HapMap_3_r3_12.bed`, and `HapMap_3_r3_12.bim`\n", "- `indepSNP.prune.in`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "---\n", "\n", "# Section 2: Multidimensional Scaling (MDS) \n", "\n", "This is the main script for second tutorial from our comprehensive tutorial on GWAS and PRS.\n", "\n", "To run this script the following (b)files from the first tutorial are required: HapMap_3_r3_12 (this bfile contain: HapMap_3_r3_12.fam,HapMap_3_r3_12.bim, and HapMap_3_r3_12.bed; you need all three), and indepSNP.prune.in. In this tutorial we are going to check for population stratification.\n", "\n", "We will do this as follows, the bfile (HapMap_3_r3_12) generated at the end of the previous tutorial (1_QC_GWAS) is going to checked for population stratification using data from the 1000 Genomes Project. Individuals with a non-European ethnic background will be removed.\n", "\n", "Furthermore, this tutorial will generate a covariate file which helps to adust for remaining population stratification within the European subjects.\n", "\n", "In order to complete this tutorial it is necessary to have generated the bfile `HapMap_3_r3_12` and the file `indepSNP.prune.in` from the previous section.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Step 2.0\n", "\n", "Copy the files from the previous tutorial to the current directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# If running on the command line use \n", "\n", "#! cp HOME/{user}/{path/1_QC_GWAS}/HapMap_3_r3_12.* HOME/{user}/{path/2_Population_stratification}\n", "#! cp HOME/{user}/{path/1_QC_GWAS}/indepSNP.prune.in HOME/{user}/{path/2_Population_stratification}\n", "\n", "# or\n", "\n", "# If running on Jupyter Notebook use\n", "import os\n", "os.chdir(\"/home/jovyan/data-store/GWA_tutorial/2_Population_stratification\")\n", "! cp ../1_QC_GWAS/HapMap_3_r3_12.* . && cp ../1_QC_GWAS/indepSNP.prune.in ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download 1000 Genomes data\n", "\n", "This file from the 1000 Genomes contains genetic data of 629 individuals from different ethnic backgrounds.\n", "\n", "**Note, this file is quite large (>60 gigabytes!!)** " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convert vcf to Plink format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --vcf ALL.2of4intersection.20100804.genotypes.vcf.gz --make-bed --out ALL.2of4intersection.20100804.genotypes\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Noteworthy, the file `ALL.2of4intersection.20100804.genotypes.bim` contains SNPs without an rs-identifier, these SNPs are indicated with `.`. This can also be observed in the file `ALL.2of4intersection.20100804.genotypes.vcf.gz`. To check this file use this command: \n", "\n", "`zmore ALL.2of4intersection.20100804.genotypes.vcf.gz` " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The missing rs-identifiers in the 1000 Genomes data are not a problem for this tutorial.\n", "However, for good practice, we will assign unique indentifiers to the SNPs with a missing rs-identifier (i.e., the SNPs with `.`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile ALL.2of4intersection.20100804.genotypes --set-missing-var-ids @:#[b37]\\$1,\\$2 --make-bed --out ALL.2of4intersection.20100804.genotypes_no_missing_IDs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Step 2.1: QC on 1000 Genomes data\n", "\n", "From the 1000 Genomes data, remove variants based on missing genotype data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile ALL.2of4intersection.20100804.genotypes_no_missing_IDs --geno 0.2 --allow-no-sex --make-bed --out 1kG_MDS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove individuals based on missing genotype data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile 1kG_MDS --mind 0.2 --allow-no-sex --make-bed --out 1kG_MDS2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove variants based on missing genotype data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile 1kG_MDS2 --geno 0.02 --allow-no-sex --make-bed --out 1kG_MDS3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove individuals based on missing genotype data.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile 1kG_MDS3 --mind 0.02 --allow-no-sex --make-bed --out 1kG_MDS4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove variants based on MAF." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile 1kG_MDS4 --maf 0.05 --allow-no-sex --make-bed --out 1kG_MDS5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extract the variants present in HapMap dataset from the 1000 genomes dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{print$2}' HapMap_3_r3_12.bim > HapMap_SNPs.txt\n", "! plink --bfile 1kG_MDS5 --extract HapMap_SNPs.txt --make-bed --out 1kG_MDS6\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extract the variants present in 1000 Genomes dataset from the HapMap dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{print$2}' 1kG_MDS6.bim > 1kG_MDS6_SNPs.txt\n", "! plink --bfile HapMap_3_r3_12 --extract 1kG_MDS6_SNPs.txt --recode --make-bed --out HapMap_MDS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The datasets now contain the exact same variants. The datasets must have the same build. Change the build 1000 Genomes data build." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{print$2,$4}' HapMap_MDS.map > buildhapmap.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`buildhapmap.txt` contains one SNP-id and physical position per line." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile 1kG_MDS6 --update-map buildhapmap.txt --make-bed --out 1kG_MDS7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`1kG_MDS7` and `HapMap_MDS` now have the same build." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Step 2.2: Merge the HapMap and 1000 Genomes data sets.\n", "\n", "Prior to merging 1000 Genomes data with the HapMap data we want to make sure that the files are mergeable, for this we conduct 3 steps:\n", "\n", "- 2.2.1. Make sure the reference genome is similar in the HapMap and the 1000 Genomes Project datasets.\n", "- 2.2.2. Resolve strand issues.\n", "- 2.2.3. Remove the SNPs which after the previous two steps still differ between datasets.\n", "\n", "The following steps are maybe quite technical in terms of commands, but we just compare the two data sets and make sure they correspond." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2.1: Set reference genome " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{print$2,$5}' 1kG_MDS7.bim > 1kg_ref-list.txt\n", "! plink --bfile HapMap_MDS --reference-allele 1kg_ref-list.txt --make-bed --out HapMap-adj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The 1kG_MDS7 and the HapMap-adj have the same reference genome for all SNPs.\n", "\n", "This command will generate some warnings for impossible A1 allele assignment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2.2: Resolve strand issues.\n", "\n", "Check for potential strand issues." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{print$2,$5,$6}' 1kG_MDS7.bim > 1kGMDS7_tmp\n", "! awk '{print$2,$5,$6}' HapMap-adj.bim > HapMap-adj_tmp\n", "! sort 1kGMDS7_tmp HapMap-adj_tmp |uniq -u > all_differences.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1624 differences between the files, some of these might be due to strand issues.\n", "\n", "Flip SNPs for resolving strand issues. \n", "\n", "Print SNP-identifier and remove duplicates." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{print$1}' all_differences.txt | sort -u > flip_list.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generates a file of 812 SNPs. These are the non-corresponding SNPs between the two files. Flip the 812 non-corresponding SNPs. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap-adj --flip flip_list.txt --reference-allele 1kg_ref-list.txt --make-bed --out corrected_hapmap" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check for SNPs which are still problematic after they have been flipped." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{print$2,$5,$6}' corrected_hapmap.bim > corrected_hapmap_tmp\n", "! sort 1kGMDS7_tmp corrected_hapmap_tmp |uniq -u > uncorresponding_SNPs.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This file demonstrates that there are 84 differences between the files." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2.3 Remove problematic SNPs from HapMap and 1000 Genomes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{print$1}' uncorresponding_SNPs.txt | sort -u > SNPs_for_exlusion.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The command above generates a list of the 42 SNPs which caused the 84 differences between the HapMap and the 1000 Genomes data sets after flipping and setting of the reference genome. \n", "\n", "Remove the 42 problematic SNPs from both datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile corrected_hapmap --exclude SNPs_for_exlusion.txt --make-bed --out HapMap_MDS2\n", "! plink --bfile 1kG_MDS7 --exclude SNPs_for_exlusion.txt --make-bed --out 1kG_MDS8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Merge HapMap with 1000 Genomes Data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_MDS2 --bmerge 1kG_MDS8.bed 1kG_MDS8.bim 1kG_MDS8.fam --allow-no-sex --make-bed --out MDS_merge2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note, we are fully aware of the sample overlap between the HapMap and 1000 Genomes datasets. However, for the purpose of this tutorial this is not important.**\n", "\n", "Perform MDS on HapMap-CEU data anchored by 1000 Genomes data, using a set of pruned SNPs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile MDS_merge2 --extract indepSNP.prune.in --genome --out MDS_merge2\n", "! plink --bfile MDS_merge2 --read-genome MDS_merge2.genome --cluster --mds-plot 10 --out MDS_merge2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Step 2.3: MDS-plot\n", "\n", "Download the file with population information of the 1000 genomes dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/20100804.ALL.panel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The file `20100804.ALL.panel` contains population codes of the individuals of 1000 genomes.\n", "\n", "Convert population codes into superpopulation codes (i.e., AFR,AMR,ASN, and EUR)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{print$1,$1,$2}' 20100804.ALL.panel > race_1kG.txt\n", "! sed 's/JPT/ASN/g' race_1kG.txt>race_1kG2.txt\n", "! sed 's/ASW/AFR/g' race_1kG2.txt>race_1kG3.txt\n", "! sed 's/CEU/EUR/g' race_1kG3.txt>race_1kG4.txt\n", "! sed 's/CHB/ASN/g' race_1kG4.txt>race_1kG5.txt\n", "! sed 's/CHD/ASN/g' race_1kG5.txt>race_1kG6.txt\n", "! sed 's/YRI/AFR/g' race_1kG6.txt>race_1kG7.txt\n", "! sed 's/LWK/AFR/g' race_1kG7.txt>race_1kG8.txt\n", "! sed 's/TSI/EUR/g' race_1kG8.txt>race_1kG9.txt\n", "! sed 's/MXL/AMR/g' race_1kG9.txt>race_1kG10.txt\n", "! sed 's/GBR/EUR/g' race_1kG10.txt>race_1kG11.txt\n", "! sed 's/FIN/EUR/g' race_1kG11.txt>race_1kG12.txt\n", "! sed 's/CHS/ASN/g' race_1kG12.txt>race_1kG13.txt\n", "! sed 's/PUR/AMR/g' race_1kG13.txt>race_1kG14.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a racefile of your own data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{print$1,$2,\"OWN\"}' HapMap_MDS.fam>racefile_own.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Concatenate racefiles." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cat: race_1kG14.txt: No such file or directory\n", "cat: racefile_own.txt: No such file or directory\n", "sed: 1: \"1i\\FID IID race\n", "\": extra characters after \\ at the end of i command\n" ] } ], "source": [ "! cat race_1kG14.txt racefile_own.txt | sed -e '1i\\FID IID race' > racefile.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate population stratification plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! Rscript MDS_merged.R " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output file `MDS.pdf` demonstrates that our own data falls within the European group of the 1000 genomes data. Therefore, we do not have to remove subjects.\n", "\n", "For educational purposes however, we give scripts below to filter out population stratification outliers. Please execute the script below in order to generate the appropriate files for the next tutorial.\n", "\n", "Exclude ethnic outliers.\n", "\n", "Select individuals in HapMap data below cut-off thresholds. The cut-off levels are not fixed thresholds but have to be determined based on the visualization of the first two dimensions. To exclude ethnic outliers, the thresholds need to be set around the cluster of population of interest.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{ if ($4 <-0.04 && $5 >0.03) print $1,$2 }' MDS_merge2.mds > EUR_MDS_merge2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extract these individuals in HapMap data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_12 --keep EUR_MDS_merge2 --make-bed --out HapMap_3_r3_13" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note, since our HapMap data did include any ethnic outliers, no individuls were removed at this step. However, if our data would have included individuals outside of the thresholds we set, then these individuals would have been removed.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create covariates based on MDS.\n", "\n", "Perform an MDS ONLY on HapMap data without ethnic outliers. The values of the 10 MDS dimensions are subsequently used as covariates in the association analysis in the third tutorial." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_13 --extract indepSNP.prune.in --genome --out HapMap_3_r3_13\n", "! plink --bfile HapMap_3_r3_13 --read-genome HapMap_3_r3_13.genome --cluster --mds-plot 10 --out HapMap_3_r3_13_mds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Change the format of the .mds file into a plink covariate file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{print$1, $2, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13}' HapMap_3_r3_13_mds.mds > covar_mds.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The values in `covar_mds.txt` will be used as covariates, to adjust for remaining population stratification, in the third tutorial where we will perform a genome-wide association analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "CONGRATULATIONS you have succesfully controlled your data for population stratification!\n", "\n", "For the next tutorial you need the following files:\n", "\n", "- `HapMap_3_r3_13` (the bfile, i.e., `HapMap_3_r3_13.bed`, `HapMap_3_r3_13.bim` ,and `HapMap_3_r3_13.fam`\n", "- `covar_mds.txt`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "---\n", "# Step 3: Association Analyses\n", "\n", "Just as with the previous tutorials, this tutorial can be completed simply by copy-and-pasting all commands from this main script into the Unix terminal.\n", "\n", "In order to run this script you need the following files from the previous tutorial: `covar_mds.txt` and `HapMap_3_r3_13` (bfile, i.e., `HapMap_3_r3_13.bed`, `HapMap_3_r3_13.bim`, `HapMap_3_r3_13.fam`).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Step 3.0\n", "\n", "Copy the bfile `HapMap_3_r3_13` and the `covar_mds.txt` from the previous tutorial in the current directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#! cp HOME/{user}/{path/ 2_Population_stratification}/HapMap_3_r3_13.* HOME/{user}/{path/3_Main_script_association_GWAS.txt}\n", "#! cp HOME/{user}/{path/ 2_Population_stratification}/covar_mds.txt HOME/{user}/{path/3_Main_script_association_GWAS.txt}\n", "\n", "# If running on the command line use \n", "\n", "#! cp HOME/{user}/{path/ 2_Population_stratification}/HapMap_3_r3_13.* HOME/{user}/{path/3_Association_GWAS/}\n", "#! cp HOME/{user}/{path/ 2_Population_stratification}/covar_mds.txt HOME/{user}/{path/3_Association_GWAS/}\n", "\n", "# or\n", "\n", "# If running on Jupyter Notebook use\n", "\n", "import os\n", "os.chdir(\"/home/jovyan/data-store/GWA_tutorial/3_Association_GWAS\")\n", "! cp ../2_Population_stratification/HapMap_3_r3_13.* . && cp ../2_Population_stratification/covar_mds.txt .\n", "\n", "# if you are not following the tutorial fully, and just want to run this section, do the following:\n", "# go to the command line and do `gocmd init`\n", "# Then do gocmd get --progress gocmd get --progress iplant/home/shared/cyverse_training/datalab/biosciences/gwas_step3_tut.tar.gz && tar -xvf gwas_step3_tut.tar.gz && cd gwas_step3_raw\n", "# move files according to your folder structure (look at the above os.chdir directions!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For binary traits:\n", "\n", "association" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_13 --assoc --out assoc_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note, the `--assoc` option does not allow to correct covariates such as principal components (PC's)/ MDS components, which makes it less suited for association analyses.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "logistic\n", "\n", "We will be using 10 principal components as covariates in this logistic analysis. We use the MDS components calculated from the previous tutorial: `covar_mds.txt`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_13 --covar covar_mds.txt --logistic --hide-covar --out logistic_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note, we use the option `--hide-covar` to only show the additive results of the SNPs in the output file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove `NA` values, those might give problems generating plots in later steps." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '!/'NA'/' logistic_results.assoc.logistic > logistic_results.assoc_2.logistic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results obtained from these GWAS analyses will be visualized in the last step. This will also show if the data set contains any genome-wide significant SNPs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note, in case of a quantitative outcome measure the option `--logistic` should be replaced by `--linear`. The use of the `--assoc` option is also possible for quantitative outcome measures (as metioned previously, this option does not allow the use of covariates).**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Step 3.1: Multiple testing\n", "\n", "There are various way to deal with multiple testing outside of the conventional genome-wide significance threshold of 5.0E-8, below we present a couple. \n", "\n", "adjust" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_13 -assoc --adjust --out adjusted_assoc_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This file gives a Bonferroni corrected p-value, along with FDR and others." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Permutation\n", "\n", "This is a computational intensive step. Further pros and cons of this method, which can be used for association and dealing with multiple testing, are described in our article corresponding to this tutorial (https://www.ncbi.nlm.nih.gov/pubmed/29484742).\n", "\n", "The reduce computational time we only perform this test on a subset of the SNPs from chromosome 22.\n", "\n", "The EMP2 collumn provides the for multiple testing corrected p-value.\n", "\n", "Generate subset of SNPs\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! awk '{ if ($4 >= 21595000 && $4 <= 21605000) print $2 }' HapMap_3_r3_13.bim > subset_snp_chr_22.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filter your bfile based on the subset of SNPs generated in the step above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_3_r3_13 --extract subset_snp_chr_22.txt --make-bed --out HapMap_subset_for_perm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perform 1000000 perrmutations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! plink --bfile HapMap_subset_for_perm --assoc --mperm 1000000 --out subset_1M_perm_result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Order your data, from lowest to highest p-value." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! sort -gk 4 subset_1M_perm_result.assoc.mperm > sorted_subset.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check ordered permutation results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! head sorted_subset.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Step 3.2: Generate a Manhattan and QQ plots\n", "\n", "These scripts assume R >= 3.0.0.\n", "\n", "If you changed the name of the `.assoc` file or to the `assoc.logistic` file, please assign those names also to the Rscripts for the Manhattan and QQ plot, otherwise the scripts will not run.\n", "\n", "The following Rscripts require the R package qqman, the scripts provided will automatically download this R package and install it in `/home/{user}/` . Additionally, the scripts load the qqman library and can therefore, similar to all other Rscript on this GitHub page, be executed from the command line.\n", "\n", "This location can be changed to your desired directory" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! Rscript --no-save Manhattan_plot.R\n", "! Rscript --no-save QQ_plot.R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please read below when you encountered an error:\n", "\n", "**Note, the mirror used to download the package qqman can no longer by active, which will result in an error (so below for specific error).**\n", "\n", "This error can simply be resolved by changing the addresses in the scripts: `Manhattan_plot.R` and `QQ_plot.R`.\n", "\n", "Simply change the address (http://cran...) in line below, in both Rscripts (Manhattan_plot.R and QQ_plot.R) with (for example) https://cran.univ-paris1.fr/ .\n", "\n", "`install.packages(\"qqman\",repos=\"http://cran.cnr.berkeley.edu/\",lib=\"~\" )` location of installation can be changed but has to correspond with the library location\n", "\n", "Example error: \n", "```\n", "\"Error in library(\"qqman\", lib.loc = \"-\") :\n", " there is no package called 'qqman'\n", "Execution halted\"\n", "```\n", "\n", "If using CyVerse, the qqplot package is already installed. Modify the R files such that:\n", "\n", "1. Removes the installation line\n", "2. Removes the `lib=\"~\" line" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "**CONGRATULATIONS** you have succesfully conducted a GWAS analyses!!\n", "\n", "If you are also interested in learning how to conduct a polygenic risk score (PRS) analysis please see our fourth tutorial.\n", "The tutorial explaining PRS is independent from the previous tutorials." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.1" } }, "nbformat": 4, "nbformat_minor": 2 }