{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ipyrad-analysis toolkit: sratools" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For reproducibility purposes, it is nice to be able to download the raw data for your analysis from an online repository like NCBI with a simple script at the top of your notebook. We've written a simple wrapper for the sratools command line program (which is notoriously difficult to use and poorly documented) to try to make this easier to do. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required software" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# conda install ipyrad -c bioconda \n", "# conda install sratools -c bioconda" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ipyrad.analysis as ipa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fetch info for a published data set by its accession ID\n", "You can find the study ID or individual sample IDs from published papers or by searching the NCBI or related databases. ipyrad can take as input one or more accessions IDs for individual Runs or Studies (SRR or SRP, and similarly ERR or ERP, etc.). \n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# init sratools object with an accessions argument\n", "sra = ipa.sratools(accessions=\"SRP065788\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r", "Fetching project data..." ] } ], "source": [ "# fetch info for all samples from this study, save as a dataframe\n", "stable = sra.fetch_runinfo()\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RunReleaseDateLoadDatespotsbasesspots_with_matesavgLengthsize_MBAssemblyNamedownload_path...SRAStudyBioProjectStudy_Pubmed_idProjectIDSampleBioSampleSampleTypeTaxIDScientificNameSampleName
0SRR28957322015-11-04 15:50:012015-11-04 17:19:152009174182834834091116NaNhttps://sra-download.ncbi.nlm.nih.gov/sos/sra-......SRP065788PRJNA299402NaN299402SRS1146158SAMN04202163simple224736Viburnum betulifoliumLib1_betulifolium
1SRR28957432015-11-04 15:50:012015-11-04 17:18:352452970223220270091140NaNhttps://sra-download.ncbi.nlm.nih.gov/sos/sra-......SRP065788PRJNA299402NaN299402SRS1146171SAMN04202164simple1220044Viburnum bitchiuenseLib1_bitchiuense_combined
2SRR28957552015-11-04 15:50:012015-11-04 17:18:464640732422306612091264NaNhttps://sra-download.ncbi.nlm.nih.gov/sos/sra-......SRP065788PRJNA299402NaN299402SRS1146182SAMN04202165simple237927Viburnum carlesiiLib1_carlesii_D1_BP_001
3SRR28957562015-11-04 15:50:012015-11-04 17:20:183719383338463853091214NaNhttps://sra-download.ncbi.nlm.nih.gov/sos/sra-......SRP065788PRJNA299402NaN299402SRS1146183SAMN04202166simple237928Viburnum cinnamomifoliumLib1_cinnamomifolium_PWS2105X
4SRR28957572015-11-04 15:50:012015-11-04 17:20:063745852340872532091213NaNhttps://sra-download.ncbi.nlm.nih.gov/sos/sra-......SRP065788PRJNA299402NaN299402SRS1146181SAMN04202167simple237929Viburnum clemensaeLib1_clemensiae_DRY6_PWS_2135
\n", "

5 rows × 30 columns

\n", "
" ], "text/plain": [ " Run ReleaseDate LoadDate spots bases \\\n", "0 SRR2895732 2015-11-04 15:50:01 2015-11-04 17:19:15 2009174 182834834 \n", "1 SRR2895743 2015-11-04 15:50:01 2015-11-04 17:18:35 2452970 223220270 \n", "2 SRR2895755 2015-11-04 15:50:01 2015-11-04 17:18:46 4640732 422306612 \n", "3 SRR2895756 2015-11-04 15:50:01 2015-11-04 17:20:18 3719383 338463853 \n", "4 SRR2895757 2015-11-04 15:50:01 2015-11-04 17:20:06 3745852 340872532 \n", "\n", " spots_with_mates avgLength size_MB AssemblyName \\\n", "0 0 91 116 NaN \n", "1 0 91 140 NaN \n", "2 0 91 264 NaN \n", "3 0 91 214 NaN \n", "4 0 91 213 NaN \n", "\n", " download_path ... SRAStudy \\\n", "0 https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 \n", "1 https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 \n", "2 https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 \n", "3 https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 \n", "4 https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 \n", "\n", " BioProject Study_Pubmed_id ProjectID Sample BioSample \\\n", "0 PRJNA299402 NaN 299402 SRS1146158 SAMN04202163 \n", "1 PRJNA299402 NaN 299402 SRS1146171 SAMN04202164 \n", "2 PRJNA299402 NaN 299402 SRS1146182 SAMN04202165 \n", "3 PRJNA299402 NaN 299402 SRS1146183 SAMN04202166 \n", "4 PRJNA299402 NaN 299402 SRS1146181 SAMN04202167 \n", "\n", " SampleType TaxID ScientificName \\\n", "0 simple 224736 Viburnum betulifolium \n", "1 simple 1220044 Viburnum bitchiuense \n", "2 simple 237927 Viburnum carlesii \n", "3 simple 237928 Viburnum cinnamomifolium \n", "4 simple 237929 Viburnum clemensae \n", "\n", " SampleName \n", "0 Lib1_betulifolium \n", "1 Lib1_bitchiuense_combined \n", "2 Lib1_carlesii_D1_BP_001 \n", "3 Lib1_cinnamomifolium_PWS2105X \n", "4 Lib1_clemensiae_DRY6_PWS_2135 \n", "\n", "[5 rows x 30 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the dataframe has all information about this study\n", "stable.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### File names\n", "You can select columns by their index number to use for file names. See below." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RunScientificNameSampleName
0SRR2895732Viburnum betulifoliumLib1_betulifolium
1SRR2895743Viburnum bitchiuenseLib1_bitchiuense_combined
2SRR2895755Viburnum carlesiiLib1_carlesii_D1_BP_001
3SRR2895756Viburnum cinnamomifoliumLib1_cinnamomifolium_PWS2105X
4SRR2895757Viburnum clemensaeLib1_clemensiae_DRY6_PWS_2135
\n", "
" ], "text/plain": [ " Run ScientificName SampleName\n", "0 SRR2895732 Viburnum betulifolium Lib1_betulifolium\n", "1 SRR2895743 Viburnum bitchiuense Lib1_bitchiuense_combined\n", "2 SRR2895755 Viburnum carlesii Lib1_carlesii_D1_BP_001\n", "3 SRR2895756 Viburnum cinnamomifolium Lib1_cinnamomifolium_PWS2105X\n", "4 SRR2895757 Viburnum clemensae Lib1_clemensiae_DRY6_PWS_2135" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stable.iloc[:5, [0, 28, 29]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download the data\n", "From an sratools object you can fetch just the info, or you can download the files as well. Here we call `.run()` to download the data into a designated workdir. There are arguments for how to name the files according to name fields in the fetch_runinfo table. The accessions argument here is a list of the first five SRR sample IDs in the table above." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 SRR2895732\n", "1 SRR2895743\n", "2 SRR2895755\n", "3 SRR2895756\n", "4 SRR2895757\n", "Name: Run, dtype: object" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select first 5 samples\n", "list_of_srrs = stable.Run[:5]\n", "list_of_srrs" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parallel connection | oud: 4 cores\n", "[####################] 100% 0:02:07 | downloading/extracting fastq data \n", "\n", "5 fastq files downloaded to /home/deren/Documents/ipyrad/newdocs/cookbook/downloaded\n" ] } ], "source": [ "# new sra object\n", "sra2 = ipa.sratools(accessions=list_of_srrs, workdir=\"downloaded\")\n", "\n", "# call download (run) function\n", "sra2.run(auto=True, name_fields=(1,30))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check the data files\n", "You can see that the files were named according to the SRR and species name in the table. The intermediate .sra files were removed and only the fastq files were saved. \n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 6174784\r\n", "-rw-rw-r-- 1 deren deren 1372440058 Aug 17 16:36 SRR2895732_Lib1_betulifolium.fastq\r\n", "-rw-rw-r-- 1 deren deren 1422226640 Aug 17 16:36 SRR2895743_Lib1_bitchiuense_combined.fastq\r\n", "-rw-rw-r-- 1 deren deren 759216310 Aug 17 16:37 SRR2895755_Lib1_carlesii_D1_BP_001.fastq\r\n", "-rw-rw-r-- 1 deren deren 1812215534 Aug 17 16:36 SRR2895756_Lib1_cinnamomifolium_PWS2105X.fastq\r\n", "-rw-rw-r-- 1 deren deren 956848184 Aug 17 16:36 SRR2895757_Lib1_clemensiae_DRY6_PWS_2135.fastq\r\n" ] } ], "source": [ "! ls -l downloaded" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }