{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# _ipyrad_ testing tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting started\n", "Import _ipyrad_ and remove previous test files if they are already present" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "version 0.1.47\n" ] } ], "source": [ "## import modules\n", "import ipyrad as ip ## \n", "print \"version\", ip.__version__ ## print version" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "## clear data from test directory if it already exists\n", "import shutil\n", "import os\n", "if os.path.exists(\"./test_rad/\"):\n", " shutil.rmtree(\"./test_rad/\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Assembly and Sample objects\n", "\n", "Assembly and Sample objects are used by _ipyrad_ to access data stored on disk and to manipulate it. Each biological sample in a data set is represented in a Sample object, and a set of Samples is stored inside an Assembly object. The Assembly object has functions to assemble the data, and stores a log of all steps performed and the resulting statistics of those steps. Assembly objects can be copied or merged to allow branching events where different parameters can subsequently be applied to different Assemblies going forward. Examples of this are shown below.\n", "\n", "We'll being by creating a single Assembly object named \"data1\". It is created with a set of default assembly parameters and without any Samples linked to it. The name provided will be used in the output files that this Assembly creates. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " New Assembly: data1\n" ] } ], "source": [ "## create an Assembly object named data1. \n", "data1 = ip.Assembly(\"data1\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modifying assembly parameters\n", "An Assembly object's parameter settings can be viewed using its `get_params()` function. To get more detailed information about all parameters use the function `ip.get_params_info()` or select a single parameter with `ip.get_params_info(N)`, where N is the number or string representation of a parameter. Assembly objects have a function `set_params()` that is used to modify parameters, like below." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0 assembly_name data1 \n", " 1 project_dir ./test_rad \n", " 2 raw_fastq_path ./data/sim_rad_test_R1_.fastq.gz \n", " 3 barcodes_path ./data/sim_rad_test_barcodes.txt \n", " 4 sorted_fastq_path \n", " 5 assembly_method denovo \n", " 6 reference_sequence \n", " 7 datatype rad \n", " 8 restriction_overhang ('TGCAG', '') \n", " 9 max_low_qual_bases 5 \n", " 10 phred_Qscore_offset 33 \n", " 11 mindepth_statistical 6 \n", " 12 mindepth_majrule 6 \n", " 13 maxdepth 1000 \n", " 14 clust_threshold 0.85 \n", " 15 max_barcode_mismatch 1 \n", " 16 filter_adapters 0 \n", " 17 filter_min_trim_len 35 \n", " 18 max_alleles_consens 2 \n", " 19 max_Ns_consens (5, 5) \n", " 20 max_Hs_consens (8, 8) \n", " 21 min_samples_locus 4 \n", " 22 max_SNPs_locus (100, 100) \n", " 23 max_Indels_locus (5, 99) \n", " 24 max_shared_Hs_locus 0.25 \n", " 25 edit_cutsites (0, 0) \n", " 26 trim_overhang (1, 2, 2, 1) \n", " 27 output_formats * \n", " 28 pop_assign_file \n", " 29 excludes \n", " 30 outgroups \n" ] } ], "source": [ "## modify parameters for this Assembly object\n", "data1.set_params('project_dir', \"./test_rad\")\n", "data1.set_params('raw_fastq_path', \"./data/sim_rad_test_R1_.fastq.gz\")\n", "data1.set_params('barcodes_path', \"./data/sim_rad_test_barcodes.txt\")\n", "data1.set_params('filter_adapters', 0)\n", "data1.set_params('datatype', 'rad')\n", "\n", "## test on real data\n", "#data1.set_params(2, \"~/Dropbox/UO_C353_1.fastq.part-aa.gz\")\n", "#data1.set_params(3, \"/home/deren/Dropbox/Viburnum_revised.barcodes\")\n", "\n", "## print the new parameters to screen\n", "data1.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Starting data\n", "If the data are already demultiplexed then fastq files can be linked directly to the Assembly object, which in turn will create new Sample objects from them, or link them to existing Sample objects based on the file names (or pair of fastq files for paired data files). The files may be gzip compressed. If the data are not demultiplexed then you will have to run the step1 function below to demultiplex the raw data." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "## This would link fastq files from the 'sorted_fastq_path' if present\n", "## Here it raises an error because there are no files in the sorted_fastq_path\n", "\n", "## data1.link_fastqs() #path=\"./test_rad/data1_fastqs/*\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1: Demultiplexing raw data files\n", "Step1 uses barcode information to demultiplex data files found in param 2 ['raw_fastq_path']. It will create a Sample object for each barcoded sample. Below we use the step1() function to demultiplex. The `stats` attribute of an Assembly object is returned as a `pandas` data frame." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Saving Assembly.\n", " state reads_raw\n", "1A_0 1 20099\n", "1B_0 1 19977\n", "1C_0 1 20114\n", "1D_0 1 19895\n", "2E_0 1 19928\n", "2F_0 1 19934\n", "2G_0 1 20026\n", "2H_0 1 19936\n", "3I_0 1 20084\n", "3J_0 1 20011\n", "3K_0 1 20117\n", "3L_0 1 19901\n" ] } ], "source": [ "## run step 1 to demultiplex the data\n", "data1.step1(force=True)\n", "\n", "## print the results for each Sample in data1\n", "print data1.stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2: Filter reads \n", "If for some reason we wanted to execute on just a subsample of our data, we could do this by selecting only certain samples to call the `step2` function on. Because `step2` is a function of `data`, it will always execute with the parameters that are linked to `data`. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Saving Assembly.\n", " reads_raw filtered_by_qscore filtered_by_adapter reads_passed\n", "1A_0 20099 0 0 20099\n", "1B_0 19977 0 0 19977\n", "1C_0 20114 0 0 20114\n", "1D_0 19895 0 0 19895\n", "2E_0 19928 0 0 19928\n", "2F_0 19934 0 0 19934\n", "2G_0 20026 0 0 20026\n", "2H_0 19936 0 0 19936\n", "3I_0 20084 0 0 20084\n", "3J_0 20011 0 0 20011\n", "3K_0 20117 0 0 20117\n", "3L_0 19901 0 0 19901\n" ] } ], "source": [ "## example of ways to run step 2 to filter and trim reads\n", "data1.step2() #[\"1B_0\", \"1C_0\"], force=True) ## run on one or more samples\n", "\n", "## print the results\n", "print data1.statsfiles.s2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Branching Assembly objects\n", "Let's imagine at this point that we are interested in clustering our data at two different clustering thresholds. We will try 0.90 and 0.85. First we need to make a copy/branch of the Assembly object. This will inherit the locations of the data linked in the first object, but diverge in any future applications to the object. Thus, the two Assembly objects can share the same working directory, and inherit shared files, but will diverge in creating new files linked to only one or the other. You can view the directories linked to an Assembly object with the `.dirs` argument, shown below. The prefix_outname (param 14) of the new object is automatically set to the Assembly object name. \n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0 assembly_name data1 \n", " 1 project_dir ./test_rad \n", " 2 raw_fastq_path ./data/sim_rad_test_R1_.fastq.gz \n", " 3 barcodes_path ./data/sim_rad_test_barcodes.txt \n", " 4 sorted_fastq_path \n", " 5 assembly_method denovo \n", " 6 reference_sequence \n", " 7 datatype rad \n", " 8 restriction_overhang ('TGCAG', '') \n", " 9 max_low_qual_bases 5 \n", " 10 phred_Qscore_offset 33 \n", " 11 mindepth_statistical 6 \n", " 12 mindepth_majrule 6 \n", " 13 maxdepth 1000 \n", " 14 clust_threshold 0.9 \n", " 15 max_barcode_mismatch 1 \n", " 16 filter_adapters 0 \n", " 17 filter_min_trim_len 35 \n", " 18 max_alleles_consens 2 \n", " 19 max_Ns_consens (5, 5) \n", " 20 max_Hs_consens (8, 8) \n", " 21 min_samples_locus 4 \n", " 22 max_SNPs_locus (100, 100) \n", " 23 max_Indels_locus (5, 99) \n", " 24 max_shared_Hs_locus 0.25 \n", " 25 edit_cutsites (0, 0) \n", " 26 trim_overhang (1, 2, 2, 1) \n", " 27 output_formats * \n", " 28 pop_assign_file \n", " 29 excludes \n", " 30 outgroups \n" ] } ], "source": [ "## create a branch of our Assembly object\n", "data2 = data1.branch(newname=\"data2\")\n", "\n", "## set clustering threshold to 0.90\n", "data2.set_params(\"clust_threshold\", 0.90)\n", "\n", "## look at inherited parameters\n", "data2.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3: clustering within-samples\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " loading Assembly: data1 [~/Documents/ipyrad/tests/test_rad/data1.json]\n" ] } ], "source": [ "import ipyrad as ip\n", "data1 = ip.load_json(\"test_rad/data1.json\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Saving Assembly.\n", " avg_depth_mj avg_depth_stat clusters_hidepth clusters_total \\\n", "1A_0 20.099 20.099 1000 1000 \n", "1B_0 19.977 19.977 1000 1000 \n", "1C_0 20.114 20.114 1000 1000 \n", "1D_0 19.895 19.895 1000 1000 \n", "2E_0 19.928 19.928 1000 1000 \n", "2F_0 19.934 19.934 1000 1000 \n", "2G_0 20.026 20.026 1000 1000 \n", "2H_0 19.936 19.936 1000 1000 \n", "3I_0 20.084 20.084 1000 1000 \n", "3J_0 20.011 20.011 1000 1000 \n", "3K_0 20.117 20.117 1000 1000 \n", "3L_0 19.901 19.901 1000 1000 \n", "\n", " avg_depth_total \n", "1A_0 20.099 \n", "1B_0 19.977 \n", "1C_0 20.114 \n", "1D_0 19.895 \n", "2E_0 19.928 \n", "2F_0 19.934 \n", "2G_0 20.026 \n", "2H_0 19.936 \n", "3I_0 20.084 \n", "3J_0 20.011 \n", "3K_0 20.117 \n", "3L_0 19.901 \n" ] } ], "source": [ "## run step 3 to cluster reads within samples using vsearch\n", "data1.step3(force=True)\n", "\n", "## print the results\n", "print data1.statsfiles.s3" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Saving Assembly.\n", " state reads_raw reads_filtered clusters_total clusters_hidepth\n", "1A_0 3 20099 20099 1000 1000\n", "1B_0 3 19977 19977 1000 1000\n", "1C_0 3 20114 20114 1000 1000\n", "1D_0 3 19895 19895 1000 1000\n", "2E_0 3 19928 19928 1000 1000\n", "2F_0 3 19934 19934 1000 1000\n", "2G_0 3 20026 20026 1000 1000\n", "2H_0 3 19936 19936 1000 1000\n", "3I_0 3 20084 20084 1000 1000\n", "3J_0 3 20011 20011 1000 1000\n", "3K_0 3 20117 20117 1000 1000\n", "3L_0 3 19901 19901 1000 1000\n" ] } ], "source": [ "## run step 3 to cluster reads in data2 at 0.90 sequence similarity\n", "data2.step3(force=True) \n", "\n", "## print the results\n", "print data2.stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Branched Assembly objects\n", "You can see below that the two Assembly objects are now working with several shared directories (working, fastq, edits) but with different clust directories (data1_clust_0.85 and data2_clust_0.9). " ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data1 directories:\n", "project\t/home/deren/Documents/ipyrad/tests/test_rad\n", "edits\t/home/deren/Documents/ipyrad/tests/test_rad/data1_edits\n", "fastqs\t/home/deren/Documents/ipyrad/tests/test_rad/data1_fastqs\n", "outfiles\t\n", "clusts\t/home/deren/Documents/ipyrad/tests/test_rad/data1_clust_0.85\n", "consens\t\n", "\n", "data2 directories:\n", "consens\t\n", "project\t/home/deren/Documents/ipyrad/tests/test_rad\n", "edits\t/home/deren/Documents/ipyrad/tests/test_rad/data1_edits\n", "fastqs\t/home/deren/Documents/ipyrad/tests/test_rad/data1_fastqs\n", "outfiles\t\n", "clusts\t/home/deren/Documents/ipyrad/tests/test_rad/data2_clust_0.9\n" ] } ], "source": [ "print \"data1 directories:\"\n", "for (i,j) in data1.dirs.items():\n", " print \"{}\\t{}\".format(i, j)\n", " \n", "print \"\\ndata2 directories:\"\n", "for (i,j) in data2.dirs.items():\n", " print \"{}\\t{}\".format(i, j)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving stats outputs\n", "Example: two simple ways to save the stats data frame to a file." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data1.stats.to_csv(\"data1_results.csv\", sep=\"\\t\")\n", "data1.stats.to_latex(\"data1_results.tex\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example of plotting with _ipyrad_\n", "There are a a few simple plotting functions in _ipyrad_ useful for visualizing results. These are in the module `ipyrad.plotting`. Below is an interactive plot for visualizing the distributions of coverages across the 12 samples in the test data set. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
010203040500501001501A_0010203040500501001501B_0010203040500501001501C_0010203040500501001501D_0010203040500501001502E_0010203040500501001502F_0010203040500501001502G_0010203040500501001502H_0010203040500501001503I_0010203040500501001503J_0010203040500501001503K_0010203040500501001503L_0
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import ipyrad.plotting as iplot\n", "\n", "## plot for one or more selected samples\n", "#iplot.depthplot(data1, [\"1A_0\", \"1B_0\"])\n", "\n", "## plot for all samples in data1\n", "iplot.depthplot(data1)\n", "\n", "## save plot as pdf and html\n", "#iplot.depthplot(data1, outprefix=\"testfig\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "### Step 4: Joint estimation of heterozygosity and error rate\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Saving Assembly.\n", " clusters_hidepth clusters_total error_est hetero_est reads_filtered \\\n", "1A_0 1000 1000 0.000756 0.002223 20099 \n", "1B_0 1000 1000 0.000775 0.001910 19977 \n", "1C_0 1000 1000 0.000751 0.002260 20114 \n", "1D_0 1000 1000 0.000731 0.001876 19895 \n", "2E_0 1000 1000 0.000770 0.001809 19928 \n", "2F_0 1000 1000 0.000725 0.002103 19934 \n", "2G_0 1000 1000 0.000707 0.001910 20026 \n", "2H_0 1000 1000 0.000755 0.002215 19936 \n", "3I_0 1000 1000 0.000784 0.001877 20084 \n", "3J_0 1000 1000 0.000741 0.001698 20011 \n", "3K_0 1000 1000 0.000767 0.001821 20117 \n", "3L_0 1000 1000 0.000755 0.002003 19901 \n", "\n", " reads_raw state \n", "1A_0 20099 4 \n", "1B_0 19977 4 \n", "1C_0 20114 4 \n", "1D_0 19895 4 \n", "2E_0 19928 4 \n", "2F_0 19934 4 \n", "2G_0 20026 4 \n", "2H_0 19936 4 \n", "3I_0 20084 4 \n", "3J_0 20011 4 \n", "3K_0 20117 4 \n", "3L_0 19901 4 \n" ] } ], "source": [ "## run step 4\n", "data1.step4(force=True)\n", "\n", "## print the results\n", "print data1.stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 5: Consensus base calls\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Saving Assembly.\n", " clusters_hidepth clusters_total error_est hetero_est reads_consens \\\n", "1A_0 1000 1000 0.000756 0.002223 1000 \n", "1B_0 1000 1000 0.000775 0.001910 1000 \n", "1C_0 1000 1000 0.000751 0.002260 1000 \n", "1D_0 1000 1000 0.000731 0.001876 1000 \n", "2E_0 1000 1000 0.000770 0.001809 1000 \n", "2F_0 1000 1000 0.000725 0.002103 1000 \n", "2G_0 1000 1000 0.000707 0.001910 1000 \n", "2H_0 1000 1000 0.000755 0.002215 1000 \n", "3I_0 1000 1000 0.000784 0.001877 1000 \n", "3J_0 1000 1000 0.000741 0.001698 1000 \n", "3K_0 1000 1000 0.000767 0.001821 1000 \n", "3L_0 1000 1000 0.000755 0.002003 1000 \n", "\n", " reads_filtered reads_raw state \n", "1A_0 20099 20099 5 \n", "1B_0 19977 19977 5 \n", "1C_0 20114 20114 5 \n", "1D_0 19895 19895 5 \n", "2E_0 19928 19928 5 \n", "2F_0 19934 19934 5 \n", "2G_0 20026 20026 5 \n", "2H_0 19936 19936 5 \n", "3I_0 20084 20084 5 \n", "3J_0 20011 20011 5 \n", "3K_0 20117 20117 5 \n", "3L_0 19901 19901 5 \n" ] } ], "source": [ "## run step 5\n", "data1.step5(force=True) #\"1B_0\")\n", "\n", "## print the results\n", "print data1.stats" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Saving Assembly.\n" ] } ], "source": [ "## run step 6\n", "data1.step6(force=True)\n", "\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Outfiles written to: ~/Documents/ipyrad/tests/test_rad/data1_outfiles\n", " Saving Assembly.\n" ] } ], "source": [ "data1.step7(force=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other stuffm" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import h5py\n", "ioh5 = h5py.File(data1.database, 'r')\n", "superseqs = ioh5[\"seqs\"][:]\n", "superseqs[superseqs == \"\"] = \"N\"" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "AMBIGS = {\"R\":(\"G\", \"A\"),\n", " \"K\":(\"G\", \"T\"),\n", " \"S\":(\"G\", \"C\"),\n", " \"Y\":(\"T\", \"C\"),\n", " \"W\":(\"T\", \"A\"),\n", " \"M\":(\"C\", \"A\")}" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def ucount(sitecol):\n", " \"\"\" \n", " Used to count the number of unique bases in a site for snpstring. \n", " returns as a spstring with * and - \n", " \"\"\"\n", " ## make into set\n", " site = set(sitecol)\n", "\n", " ## get resolutions of ambiguitites\n", " for iupac in \"RSKYWM\":\n", " if iupac in site:\n", " site.discard(iupac)\n", " site.update(AMBIGS[iupac])\n", "\n", " ## remove - and N\n", " site.discard(\"N\")\n", " site.discard(\"-\")\n", "\n", " ## if site is invariant return \"\"\n", " if len(site) < 2:\n", " return \" \"\n", " else:\n", " ## get how many bases come up more than once \n", " ccx = np.sum(np.array([np.sum(sitecol == base) for base in site]) > 1)\n", " ## if another site came up more than once return *\n", " if ccx > 1:\n", " return \"*\"\n", " else:\n", " return \"-\"\n" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": true }, "outputs": [], "source": [ "snps = np.apply_along_axis(ucount, 1, superseqs)" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(100, 150, 2)" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "snpsarr = np.zeros((superseqs.shape[0], superseqs.shape[2], 2), dtype=np.bool)\n", "snpsarr[:, :, 0] = snps == \"-\"\n", "snpsarr[:, :, 1] = snps == \"*\"\n", "snpsarr.shape" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'a', 'b'}" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import os\n", "a = np.array(['a', 'b'])\n", "set(a)" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(100, 150, 2)" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = os.path.join(data1.dirs.outfiles, data1.name+\"_tmpchunks\", \"snpsarr.0.npy\")\n", "\n", "np.load(path).shape" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(100,)" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = os.path.join(data1.dirs.outfiles, data1.name+\"_tmpchunks\", \"hetf.0.npy\")\n", "\n", "np.load(path).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assembly finished" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 9068\r\n", "-rw-rw-r-- 1 deren 2600999 Jan 23 02:03 data1.alleles\r\n", "-rw-rw-r-- 1 deren 1252625 Jan 23 02:03 data1.full.vcf\r\n", "-rw-rw-r-- 1 deren 1250895 Jan 23 02:03 data1.gphocs\r\n", "-rw-rw-r-- 1 deren 1353000 Jan 23 02:03 data1.loci\r\n", "-rw-rw-r-- 1 deren 1269535 Jan 23 02:03 data1.nex\r\n", "-rw-rw-r-- 1 deren 1128105 Jan 23 02:03 data1.phy\r\n", "-rw-rw-r-- 1 deren 21655 Jan 23 02:03 data1.phy.partitions\r\n", "-rw-rw-r-- 1 deren 34 Jan 23 02:03 \u001b[0m\u001b[01;31mdata1.treemix.gz\u001b[0m\r\n", "-rw-rw-r-- 1 deren 388482 Jan 23 02:03 data1.vcf\r\n" ] } ], "source": [ "ll test_rad/outfiles/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quick parameter explanations are always on-hand" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " (10) phred_Qscore_offset ---------------------------------------------\n", " Examples:\n", " ----------------------------------------------------------------------\n", " data.set_params(10) = 33\n", " data.set_params(\"phred_Qscore_offset\") = 33\n", " ----------------------------------------------------------------------\n", " \n" ] } ], "source": [ "ip.paramsinfo(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Log history \n", "A common problem at the end of an analysis, or while troubleshooting it, is that you find you've completely forgotten which parameters you used at what point, and when you changed them. Documenting or executing code inside Jupyter notebooks (like the one you're reading right now) is a great way to keep track of all of this. In addition, _ipyrad_ also stores a log history which time stamps all modifications to Assembly objects. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for i in data1.log:\n", " print i\n", " \n", "print \"\\ndata 2 log includes its pre-branching history with data1\"\n", "for i in data2.log:\n", " print i" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving Assembly objects\n", "Assembly objects can be saved and loaded so that interactive analyses can be started, stopped, and returned to quite easily. The format of these saved files is a serialized 'dill' object used by Python. Individual Sample objects are saved within Assembly objects. These objects to not contain the actual sequence data, but only link to it, and so are not very large. The information contained includes parameters and the log of Assembly objects, and the statistics and state of Sample objects. Assembly objects are autosaved each time an assembly `step` function is called, but you can also create your own checkpoints with the `save` command. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## save assembly object\n", "#ip.save_assembly(\"data1.p\")\n", "\n", "## load assembly object\n", "#data = ip.load_assembly(\"data1.p\")\n", "#print data.name" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "isinstance(0.25, float)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }