{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ipyrad-analysis toolkit: PCA and other dimensionality reduction\n", "\n", "[View as notebook](https://nbviewer.jupyter.org/github/dereneaton/ipyrad/blob/master/newdocs/API-analysis/cookbook-pca.ipynb)\n", "\n", "Principal component analysis is a dimensionality reduction method used to transform and project data points onto fewer orthogonal axes that can explain the greatest amount of variance in the data. While there are many tools available to implement PCA, the ipyrad tool has many options available specifically to deal with missing data. PCA analyses are *very sensitive* to missing data. The `ipyrad.pca` tool makes it easy to perform PCA on RAD-seq data by filtering and/or imputing missing data, and allowing for easy subsampling of individuals to include in analyses. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required software" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# conda install ipyrad -c bioconda\n", "# conda install scikit-learn -c bioconda\n", "# conda install toyplot -c eaton-lab" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ipyrad.analysis as ipa\n", "import pandas as pd\n", "import toyplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required input data files\n", "Your input data should be a `.snps.hdf` database file produced by ipyrad. If you do not have this you can generate it from any VCF file following the [vcf2hdf5 tool tutorial](https://ipyrad.readthedocs.io/en/latest/API-analysis/cookbook-vcf2hdf5.html). The database file contains the genotype calls information as well as linkage information that is used for subsampling unlinked SNPs and bootstrap resampling." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# the path to your .snps.hdf5 database file\n", "data = \"/home/deren/Downloads/ref_pop2.snps.hdf5\"\n", "#data = \"/home/deren/Downloads/denovo-min50.snps.hdf5\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Input data file and population assignments\n", "Population assignments (imap dictionary) are optional, but can be used in a number of ways by the `pca` tool. First, you can filter your data to require at least N coverage in each population. Second, you can use the frequency of genotypes within populations to *impute* missing data for other samples. Finally, population assignments can be used to color points when plotting your results. You can assign individual samples to populations using an `imap` dictionary like below. We also create a `minmap` dictionary stating that we want to require 50% coverage in each population. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# group individuals into populations\n", "imap = {\n", " \"virg\": [\"TXWV2\", \"LALC2\", \"SCCU3\", \"FLSF33\", \"FLBA140\"],\n", " \"mini\": [\"FLSF47\", \"FLMO62\", \"FLSA185\", \"FLCK216\"],\n", " \"gemi\": [\"FLCK18\", \"FLSF54\", \"FLWO6\", \"FLAB109\"],\n", " \"bran\": [\"BJSL25\", \"BJSB3\", \"BJVL19\"],\n", " \"fusi\": [\"MXED8\", \"MXGT4\", \"TXGR3\", \"TXMD3\"],\n", " \"sagr\": [\"CUVN10\", \"CUCA4\", \"CUSV6\"],\n", " \"oleo\": [\"CRL0030\", \"HNDA09\", \"BZBB1\", \"MXSA3017\"],\n", "}\n", "\n", "# require that 50% of samples have data in each group\n", "minmap = {i: 0.5 for i in imap}" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# ipa.snps_extracter(data).names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Enter data file and params\n", "The `pca` analysis object takes input data as the *.snps.hdf5* file produced by ipyrad. All other parameters are optional. The **imap** dictionary groups individuals into populations and **minmap** can be used to filter SNPs to only include those that have data for at least some proportion of samples in every group. The **mincov** option works similarly, it filters SNPs that are shared across less than some proportion of all samples (in contrast to minmap this does not use imap groupings). \n", "\n", "When you init the object it will load the data and apply filtering. The printed output tells you how many SNPs were removed by each filter and the remaining amount of missing data after filtering. These remaining missing values are the ones that will be filled by imputation. The options for imputing data are listed further down in this tutorial. Here we are using the \"sample\" method, which I generally recommend. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Samples: 27\n", "Sites before filtering: 349914\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 13001\n", "Filtered (mincov): 110150\n", "Filtered (minmap): 112898\n", "Filtered (combined): 138697\n", "Sites after filtering: 211217\n", "Sites containing missing values: 183722 (86.98%)\n", "Missing values in SNP matrix: 501031 (8.79%)\n", "Imputation: 'sampled'; (0, 1, 2) = 77.1%, 10.7%, 12.2%\n" ] } ], "source": [ "# init pca object with input data and (optional) parameter options\n", "pca = ipa.pca(\n", " data=data,\n", " imap=imap,\n", " minmap=minmap,\n", " mincov=0.75,\n", " impute_method=\"sample\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run PCA\n", "Call `.run()` and to generate the PC axes and the variance explained by each axis. The results are stored in your analysis object as dictionaries under the attributes `.pcaxes` and `.variances`. Feel free to take these data and plot them using any method you prefer. The code cell below shows how to save the data to a CSV file, and also to view the PC data as a table. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Subsampling SNPs: 28369/211217\n" ] } ], "source": [ "# run the PCA analysis\n", "pca.run()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789
BJSB345.2652.07-10.97-35.642.39-0.393.120.142.590.07
BJSL2543.0548.69-9.43-30.461.440.722.19-0.171.070.63
BJVL1943.0148.83-10.85-31.742.470.472.57-0.452.36-0.62
BZBB139.00-48.774.830.0810.37-22.452.96-2.563.47-0.63
CRL003039.69-49.193.03-0.238.56-13.140.131.222.310.07
CUCA412.60-34.85-3.61-4.60-21.3340.63-0.5415.260.97-4.93
CUSV68.41-33.68-3.85-5.30-21.4442.093.06-18.190.338.01
CUVN1013.45-35.30-1.01-3.65-14.5927.801.941.44-3.39-7.37
FLAB109-31.42-0.51-20.02-3.05-25.75-16.48-2.72-3.171.67-1.13
FLBA140-30.572.8725.80-8.743.680.65-0.593.00-0.390.15
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9\n", "BJSB3 45.26 52.07 -10.97 -35.64 2.39 -0.39 3.12 0.14 2.59 0.07\n", "BJSL25 43.05 48.69 -9.43 -30.46 1.44 0.72 2.19 -0.17 1.07 0.63\n", "BJVL19 43.01 48.83 -10.85 -31.74 2.47 0.47 2.57 -0.45 2.36 -0.62\n", "BZBB1 39.00 -48.77 4.83 0.08 10.37 -22.45 2.96 -2.56 3.47 -0.63\n", "CRL0030 39.69 -49.19 3.03 -0.23 8.56 -13.14 0.13 1.22 2.31 0.07\n", "CUCA4 12.60 -34.85 -3.61 -4.60 -21.33 40.63 -0.54 15.26 0.97 -4.93\n", "CUSV6 8.41 -33.68 -3.85 -5.30 -21.44 42.09 3.06 -18.19 0.33 8.01\n", "CUVN10 13.45 -35.30 -1.01 -3.65 -14.59 27.80 1.94 1.44 -3.39 -7.37\n", "FLAB109 -31.42 -0.51 -20.02 -3.05 -25.75 -16.48 -2.72 -3.17 1.67 -1.13\n", "FLBA140 -30.57 2.87 25.80 -8.74 3.68 0.65 -0.59 3.00 -0.39 0.15" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# store the PC axes as a dataframe\n", "df = pd.DataFrame(pca.pcaxes[0], index=pca.names)\n", "\n", "# write the PC axes to a CSV file\n", "df.to_csv(\"pca_analysis.csv\")\n", "\n", "# show the first ten samples and the first 10 PC axes\n", "df.iloc[:10, :10].round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run PCA and plot results. \n", "When you call `.run()` a PCA model is fit to the data and two results are generated: (1) samples weightings on the component axes; (2) the proportion of variance explained by each axis. For convenience we have developed a plotting function that can be called as `.draw()` to plot these results (generated with [`toyplot`](https://toyplot.rtfd.io)). The first two arguments to this function are the two axes to be plotted. By default this plotting function will use the `imap` information to color points and create a legend." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV2-40-2002040PC0 (15.0%) explained-2502550PC2 (6.7%) explainedvirgminigemibranfusisagroleo
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot PC axes 0 and 2\n", "pca.draw(0, 2);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subsampling SNPs\n", "By default `run()` will randomly subsample one SNP per RAD locus to reduce the effect of linkage on your results. This can be turned off by setting `subsample=False`, like in the plot above. When using subsampling you can set the random seed to make your results repeatable. The results here subsample 29K SNPs from a possible 228K SNPs, but the final results are quite similar to above." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV2-1000100PC0 (13.8%) explained-1000100PC2 (6.6%) explainedvirgminigemibranfusisagroleo
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot PC axes 0 and 2 with no subsampling\n", "pca.run(subsample=False)\n", "pca.draw(0, 2);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subsampling with replication\n", "Subsampling *unlinked* SNPs is generally a good idea for PCA analyses since you want to remove the effects of linkage from your data. It also presents a convenient way to explore the confidence in your results. By using the option `nreplicates` you can run many replicate analyses that subsample a different random set of unlinked SNPs each time. The replicate results are drawn with a lower opacity and the centroid of all the points for each sample is plotted as a black point. You can hover over the points with your cursor to see the sample names pop-up. \n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Subsampling SNPs: 28369/211217\n" ] }, { "data": { "text/html": [ "
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV2-2502550PC0 (14.8%) explained-40-2002040PC2 (6.7%) explainedvirgminigemibranfusisagroleo
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot PC axes 0 and 2 with many replicate subsamples\n", "pca.run(nreplicates=25, seed=12345)\n", "pca.draw(0, 2);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "------------------------\n", "\n", "### Advanced: Imputation algorithms:\n", "\n", "We offer three algorithms for *imputing* missing data:\n", "\n", "1. **sample**: Randomly sample genotypes based on the frequency of alleles within (user-defined) populations (imap). \n", "\n", "\n", "2. **kmeans**: Randomly sample genotypes based on the frequency of alleles in (kmeans cluster-generated) populations. \n", "\n", "\n", "3. **None**: All missing values are imputed with zeros (ancestral allele)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### No imputation (None)\n", "The None option will almost always be a *bad choice* when there is any reasonable amount of missing data. Missing values will all be filled as zeros (ancestral allele) -- this is what many other PCA tools do as well. I show it here for comparison to the imputed results, which are better. The two points near the top of the plot are samples with the most missing data that are erroneously grouped together. The rest of the samples also form much less clear clusters than in the other examples where we use imputation or stricter filtering options." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Samples: 27\n", "Sites before filtering: 349914\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 13001\n", "Filtered (mincov): 9517\n", "Filtered (minmap): 112898\n", "Filtered (combined): 121048\n", "Sites after filtering: 228866\n", "Sites containing missing values: 201371 (87.99%)\n", "Missing values in SNP matrix: 640419 (10.36%)\n", "Imputation (null; sets to 0): 100.0%, 0.0%, 0.0%\n", "Subsampling SNPs: 29695/228866\n" ] }, { "data": { "text/html": [ "
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV2-3003060PC0 (12.0%) explained-2002040PC2 (5.1%) explainedvirgminigemibranfusisagroleo
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# init pca object with input data and (optional) parameter options\n", "pca1 = ipa.pca(\n", " data=data,\n", " imap=imap,\n", " minmap=minmap,\n", " mincov=0.25,\n", " impute_method=None,\n", ")\n", "\n", "# run and draw results for impute_method=None\n", "pca1.run(nreplicates=25, seed=123)\n", "pca1.draw(0, 2);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### No imputation but stricter filtering (mincov)\n", "Here I do not allow for any missing data (`mincov`=1.0). You can see that this reduces the number of total SNPs from 349K to 10K. The final reslult is not too different from our first example, but seems a little less smooth. In most data sets it is probably better to include more data by imputing some values, though. Many data sets may not have as many SNPs without missing data as this one." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Samples: 27\n", "Sites before filtering: 349914\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 13001\n", "Filtered (mincov): 321628\n", "Filtered (minmap): 112898\n", "Filtered (combined): 322419\n", "Sites after filtering: 27495\n", "Sites containing missing values: 0 (0.00%)\n", "Missing values in SNP matrix: 0 (0.00%)\n", "Subsampling SNPs: 6675/27495\n" ] }, { "data": { "text/html": [ "
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV2-2502550PC0 (14.8%) explained-40-2002040PC2 (6.7%) explainedvirgminigemibranfusisagroleo
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# init pca object with input data and (optional) parameter options\n", "pca2 = ipa.pca(\n", " data=data,\n", " imap=imap,\n", " minmap=minmap,\n", " mincov=1.0,\n", " impute_method=None,\n", ")\n", "\n", "# run and draw results for impute_method=None and mincov=1.0\n", "pca2.run(nreplicates=25, seed=123)\n", "pca.draw(0, 2);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Kmeans imputation (integer)\n", "\n", "The *kmeans* clustering method allows imputing values based on population allele frequencies (like the *sample* method) but without having to *a priori* assign individuals to populations. In other words, it is meant to reduce the bias introduced by assigning individuals yourself. Instead, this method uses kmeans clustering to group individuals into \"populations\" and then imputes values based on those population assignments. This is accomplished through **iterative clustering**, starting by using only SNPs that are present across 90% of all samples (this can be changed with the topcov param) and then allowing more missing data in each iteration until it reaches the mincov parameter value. \n", "\n", "This method works great especially if you have a lot of missing data and fear that user-defined population assignments will bias your results. Here it gives super similar results to our first plots using the \"sample\" impute method, suggesting that our population assignments are not greatly biasing the results. To use K=7 clusters you simply enter `impute_method=7`. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Kmeans clustering: iter=0, K=7, mincov=0.9, minmap={'global': 0.5}\n", "Samples: 27\n", "Sites before filtering: 349914\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 13001\n", "Filtered (mincov): 222081\n", "Filtered (minmap): 29740\n", "Filtered (combined): 225958\n", "Sites after filtering: 123956\n", "Sites containing missing values: 96461 (77.82%)\n", "Missing values in SNP matrix: 142937 (4.27%)\n", "Imputation: 'sampled'; (0, 1, 2) = 76.7%, 15.0%, 8.3%\n", "{0: ['FLCK216', 'FLSA185'], 1: ['BJSB3', 'BJSL25', 'BJVL19'], 2: ['FLAB109', 'FLCK18', 'FLMO62', 'FLSF47', 'FLSF54', 'FLWO6'], 3: ['BZBB1', 'CRL0030', 'CUVN10', 'HNDA09', 'MXSA3017'], 4: ['MXED8', 'MXGT4', 'TXGR3', 'TXMD3'], 5: ['FLBA140', 'FLSF33', 'LALC2', 'SCCU3', 'TXWV2'], 6: ['CUCA4', 'CUSV6']}\n", "\n", "Kmeans clustering: iter=1, K=7, mincov=0.8, minmap={0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5}\n", "Samples: 27\n", "Sites before filtering: 349914\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 13001\n", "Filtered (mincov): 131220\n", "Filtered (minmap): 111129\n", "Filtered (combined): 150798\n", "Sites after filtering: 199116\n", "Sites containing missing values: 171621 (86.19%)\n", "Missing values in SNP matrix: 427659 (7.95%)\n", "Imputation: 'sampled'; (0, 1, 2) = 77.6%, 10.0%, 12.5%\n", "{0: ['FLAB109', 'FLCK18', 'FLMO62', 'FLSF47', 'FLSF54', 'FLWO6'], 1: ['BZBB1', 'CRL0030', 'CUVN10', 'HNDA09', 'MXSA3017'], 2: ['FLBA140', 'FLSF33', 'LALC2', 'SCCU3', 'TXWV2'], 3: ['MXED8', 'MXGT4', 'TXGR3', 'TXMD3'], 4: ['BJSB3', 'BJSL25', 'BJVL19'], 5: ['FLCK216', 'FLSA185'], 6: ['CUCA4', 'CUSV6']}\n", "\n", "Kmeans clustering: iter=2, K=7, mincov=0.7, minmap={0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5}\n", "Samples: 27\n", "Sites before filtering: 349914\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 13001\n", "Filtered (mincov): 76675\n", "Filtered (minmap): 111129\n", "Filtered (combined): 124159\n", "Sites after filtering: 225755\n", "Sites containing missing values: 198260 (87.82%)\n", "Missing values in SNP matrix: 606805 (9.96%)\n", "Imputation: 'sampled'; (0, 1, 2) = 77.4%, 10.1%, 12.5%\n", "{0: ['FLCK216', 'FLSA185'], 1: ['BJSB3', 'BJSL25', 'BJVL19'], 2: ['FLAB109', 'FLCK18', 'FLMO62', 'FLSF47', 'FLSF54', 'FLWO6'], 3: ['BZBB1', 'CRL0030', 'CUVN10', 'HNDA09', 'MXSA3017'], 4: ['MXED8', 'MXGT4', 'TXGR3', 'TXMD3'], 5: ['FLBA140', 'FLSF33', 'LALC2', 'SCCU3', 'TXWV2'], 6: ['CUCA4', 'CUSV6']}\n", "\n", "Kmeans clustering: iter=3, K=7, mincov=0.6, minmap={0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5}\n", "Samples: 27\n", "Sites before filtering: 349914\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 13001\n", "Filtered (mincov): 52105\n", "Filtered (minmap): 111129\n", "Filtered (combined): 119932\n", "Sites after filtering: 229982\n", "Sites containing missing values: 202487 (88.04%)\n", "Missing values in SNP matrix: 646076 (10.40%)\n", "Imputation: 'sampled'; (0, 1, 2) = 77.3%, 10.1%, 12.5%\n", "{0: ['FLBA140', 'FLSF33', 'LALC2', 'SCCU3', 'TXWV2'], 1: ['BJSB3', 'BJSL25', 'BJVL19'], 2: ['BZBB1', 'CRL0030', 'HNDA09', 'MXSA3017'], 3: ['FLCK216', 'FLSA185'], 4: ['FLAB109', 'FLCK18', 'FLMO62', 'FLSF47', 'FLSF54', 'FLWO6'], 5: ['MXED8', 'MXGT4', 'TXGR3', 'TXMD3'], 6: ['CUCA4', 'CUSV6', 'CUVN10']}\n", "\n", "Kmeans clustering: iter=4, K=7, mincov=0.5, minmap={0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5}\n", "Samples: 27\n", "Sites before filtering: 349914\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 13001\n", "Filtered (mincov): 29740\n", "Filtered (minmap): 115494\n", "Filtered (combined): 123595\n", "Sites after filtering: 226319\n", "Sites containing missing values: 198824 (87.85%)\n", "Missing values in SNP matrix: 627039 (10.26%)\n", "Imputation: 'sampled'; (0, 1, 2) = 77.2%, 10.4%, 12.4%\n", "Subsampling SNPs: 29415/226319\n" ] }, { "data": { "text/html": [ "
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV2-3003060PC0 (14.8%) explained-40-2002040PC2 (6.9%) explainedvirgminigemibranfusisagroleo
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# kmeans imputation \n", "pca3 = ipa.pca(\n", " data=data,\n", " imap=imap,\n", " minmap=minmap,\n", " mincov=0.5,\n", " impute_method=7,\n", ")\n", "\n", "# run and draw results for kmeans clustering into 7 groups\n", "pca3.run(nreplicates=25, seed=123)\n", "pca3.draw(0, 2);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save plot to PDF\n", "\n", "You can save the figure as a PDF or SVG automatically by passing an `outfile` argument to the `.draw()` function." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# The outfile must end in either `.pdf` or `.svg`\n", "pca.draw(outfile=\"mypca.pdf\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Advanced: Missing data per sample\n", "\n", "You can view the proportion of missing data per sample by accessing the `.missing` data table from your `pca` analysis object. You can see that most samples in this data set had 10% missing data or less, but a few had 20-50% missing data. You can hover your cursor over the plot above to see the sample names. It seems pretty clear that samples with huge amounts of missing data do not stand out at outliers in these plots like they did in the no-imputation plot. Which is great!" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
missing
BJSL250.03
BJVL190.03
FLBA1400.03
CRL00300.04
LALC20.04
FLSF540.04
CUVN100.06
FLAB1090.06
MXGT40.07
MXED80.08
CUSV60.08
HNDA090.08
FLSF330.08
BJSB30.09
FLSF470.09
MXSA30170.09
FLMO620.10
TXMD30.10
FLWO60.11
FLCK180.11
TXGR30.11
BZBB10.11
FLCK2160.11
FLSA1850.13
CUCA40.14
SCCU30.23
TXWV20.55
\n", "
" ], "text/plain": [ " missing\n", "BJSL25 0.03\n", "BJVL19 0.03\n", "FLBA140 0.03\n", "CRL0030 0.04\n", "LALC2 0.04\n", "FLSF54 0.04\n", "CUVN10 0.06\n", "FLAB109 0.06\n", "MXGT4 0.07\n", "MXED8 0.08\n", "CUSV6 0.08\n", "HNDA09 0.08\n", "FLSF33 0.08\n", "BJSB3 0.09\n", "FLSF47 0.09\n", "MXSA3017 0.09\n", "FLMO62 0.10\n", "TXMD3 0.10\n", "FLWO6 0.11\n", "FLCK18 0.11\n", "TXGR3 0.11\n", "BZBB1 0.11\n", "FLCK216 0.11\n", "FLSA185 0.13\n", "CUCA4 0.14\n", "SCCU3 0.23\n", "TXWV2 0.55" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# .missing is a pandas DataFrame\n", "pca3.missing.sort_values(by=\"missing\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Advanced: TSNE and other dimensionality reduction methods\n", "\n", "While PCA plots are very informative, it is sometimes difficult to visualize just how well separated all of your samples are since the results are in many dimensions. A popular tool to further examine the separation of samples is t-distribution stochastic neighbor embedding (TSNE). We've implemented this in the `pca` tool as well, where it first decomposes the data using pca, and then TSNE on the PC axes. The results will vary depending on the parameters and random seed, and so you cannot plot replicates runs using this method. And it is important to explore parameter values to find something that works well." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Subsampling SNPs: 28369/211217\n" ] } ], "source": [ "pca.run_tsne(subsample=True, perplexity=4.0, n_iter=100000, seed=123)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV2-100-50050100150TNSE component 1-300-200-1000100TNSE component 2virgminigemibranfusisagroleo
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pca.draw();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Advanced: UMAP dimensionality reduction\n", "From the [UMAP docs](https://umap.scikit-tda.org/parameters.html): \"low values of n_neighbors will force UMAP to concentrate on very local structure (potentially to the detriment of the big picture), while large values will push UMAP to look at larger neighborhoods of each point when estimating the manifold structure of the data\"\n", "\n", "The min_dist parameter controls how tightly UMAP is allowed to pack points together. It, quite literally, provides the minimum distance apart that points are allowed to be in the low dimensional representation" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "pca.run_umap(subsample=False, n_neighbors=12, min_dist=0.1)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV29111315UMAP component 125810UMAP component 2virgminigemibranfusisagroleo
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pca.draw();" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }