\n", " Hypothesis $\\rightarrow$\n", " | \n", "\n",
" Reject H0 \n", " ($cost_{lab} \\neq cost_{dragon}$)\n", " | \n",
" \n",
" Fail to Reject H0 \n", " ($cost_{lab} = cost_{dragon}$)\n", " | \n",
"
\n", " Truth ↓\n", " | \n", "||
\n", " $cost_{lab} = cost_{dragon}$\n", " | \n", "\n",
" | \n",
" \n", " Correct\n", " | \n", "
\n", " $cost_{lab} \\neq cost_{dragon}$ \n", " | \n", "\n", " Correct\n", " | \n", "\n", " False Negative\n", " | \n", "
\n",
" overwrite (boolian)\n", " | \n",
" \n",
" When Overwrite will also cause the notebook to generate new \n",
" power calculations, even if the files do not exist. Whether or \n",
" not these power calculation results are saved can be \n",
" controlled by setting \n",
" | \n",
"
\n",
" save_intermediates (boolian)\n", " | \n",
" \n", " The current power code is computationally expensive and takes \n", " a long time to compute, due to the iterative nature. While \n", " code optimization may come in the future, it can be \n", " advantageous to save the power calculation results somewhere, \n", " so they can be retrieved later.\n", " | \n", "
\n",
" save_images (boolian)\n", " | \n",
" \n",
" This notebook will generate images of the power curves. By \n",
" default, these will be displayed inside the notebook. However, \n",
" some users also find it advantageous to save the images files. \n",
" The file format is set in \n",
" image_pattern .\n",
" | \n",
"
\n",
" txt_delim (string)\n", " | \n",
" \n",
" txt_delim specifies the way \n",
" columns are separated in the files. QIIME typically consumes \n",
" and produces tab-delimited \n",
" (\"\\t\" ) \n",
" text files (.txt) for metadata and results generation.\n",
" | \n",
"
\n",
" map_index (string)\n", " | \n",
" \n",
" The name of the column containing the sample id name. In \n",
" QIIME, this column is called \n",
" #SampleID .\n",
" | \n",
"
\n",
" map_nas (list of strings)\n", " | \n",
" \n",
" It is possible a mapping file map be missing values, since \n",
" American Gut participants are free to skip any question. The \n",
" pandas package is able to omit these missing samples from \n",
" analysis. In raw American Gut files, missing values are \n",
" typically denoted as \n",
" “NA” , \n",
" “no_data” , \n",
" “unknown” , \n",
" and empty spaces (“” ).\n",
" | \n",
"
\n",
" write_na (string)\n", " | \n",
" \n",
" The value to denote missing values when text files are written \n",
" from Pandas data frames. Using an empty space, \n",
" (“” ) will allow \n",
" certain QIIME scripts, like \n",
" [group_signigance.py](http://qiime.org/scripts/group_significance.html), \n",
" to ignore the missing values.\n",
" | \n",
"
\n",
" date_cols (list of strings)\n", " | \n",
" \n",
" Temporal data can be identified using the \n",
" date_cols .\n",
" | \n",
"
\n",
" a_div_metric (string)\n", " | \n",
" \n",
" The alpha diversity metric to be used in the analysis. Mapping \n",
" files generated by the Preprocessing Notebook have a set of \n",
" mapping columns appended which provide the mean for several \n",
" metrics. These are labeled as the metric name with \n",
" “_mean” to\n",
" indicate the values are the mean of 10 rarefactions.\n",
" | \n",
"
\n",
" a_title (string)\n", " | \n",
" \n", " The title to be displayed on the alpha diversity power curve.\n", " | \n", "
\n",
" a_suffix (string)\n", " | \n",
" \n", " If files are saved, this string is used to differentiate alpha \n", " diversity files from beta diversity.\n", " | \n", "
\n",
" b_div_metric (string)\n", " | \n",
" \n", " This identifies the beta diversity metric to be used in the \n", " analysis. This name will appear at the beginning of the \n", " distance matrix file.\n", " | \n", "
\n",
" b_num_iter (int)\n", " | \n",
" \n",
" Differences in beta diversity are frequently tested using a \n",
" permutative test [[23](#Bondini)]. \n",
" his takes care of many of the statistical constraints associated with distance matrices. \n",
" b_num_iter sets the number of \n",
" permutations performed on a distance matrix during beta \n",
" diversity power calculation. A large number can slow \n",
" processing considerably, since we much perform the permutation \n",
" several times. \n",
" | \n",
"
\n",
" b_title (string)\n", " | \n",
" \n", " The title to be displayed on the beta diversity power curve.\n", " | \n", "
\n",
" b_suffix (string)\n", " | \n",
" \n", " If files are saved, this string is used to differentiate alpha \n", " diversity files from beta diversity, and different beta \n", " diversity metrics.\n", " | \n", "
\n",
"\t\t\tnum_iter (int)\n", "\t\t | \n",
"\t\t\n", "\t\t\tThe number of times data should be subsampled at each sampling \n", " depth to calculate the statistical power for the sample.\n", "\t\t | \n", "\t
\n",
"\t\t\tnum_runs (int)\n", "\t\t | \n",
"\t\t\n", "\t\t\tThe number of times paired samples should be drawn for \n", " confidence interval calculation.\n", "\t\t | \n", "\t
\n",
"\t\t\tp_crit (float)\n", "\t\t | \n",
"\t\t\n", "\t\t\tThe value of $\\alpha$ (the probability of a false positive) \n", " acceptable for these power calculations. Empirical power will \n", " be based on the number of iterations for a sample set that are \n", " less than this value. For historical and cultural reasons, \n", " 0.05 is often used.\n", "\t\t | \n", "\t
\n",
"\t\t\tmin_counts (int)\n", "\t\t | \n",
"\t\t\n", "\t\t\tThe minimum number of samples drawn from each group during \n", " statistical testing. This should be set based on the expected \n", " effect size and number of available samples. \n", "\t\t | \n", "\t
\n",
"\t\t\tmax_counts (int)\n", "\t\t | \n",
"\t\t\n", "\t\t\tThe maximum number of samples drawn from each group during \n", " statistical testing. This should be set based on the expected \n", " effect size and number of available samples and should not \n", " exceed the size of the smallest group.\n", "\t\t | \n", "\t
\n",
"\t\t\tcounts_interval (int)\n", "\t\t | \n",
"\t\t\n",
"\t\t\tA sampling interval used to determine the number of samples \n",
" which should be drawn during statistical testing. Samples will \n",
" be drawn in a size increasing from the \n",
" min_counts , to \n",
" min_counts + counts_interval , \n",
" min_counts + 2*counts_interval , \n",
" and so on, up to max_counts .\n",
"\t\t | \n",
"\t
\n",
" all_cat (string)\n", " | \n",
" \n", " The metadata category use for body site comparison.\n", " | \n", "
\n",
" all_order (string)\n", " | \n",
" \n", " The body sites being analyzed.\n", " | \n", "
\n",
" all_controls (string)\n", " | \n",
" \n", " The metadata categories used to identify matched samples.\n", " | \n", "
\n",
" all_min_counts (int)\n", " | \n",
" \n", " The minimum number of samples drawn from each group during \n", " statistical testing. This should be set based on the expected \n", " effect size and number of available samples. \n", " | \n", "
\n",
" all_max_counts (int)\n", " | \n",
" \n", " The maximum number of samples drawn from each group during \n", " statistical testing. This should be set based on the expected \n", " effect size and number of available samples and should not \n", " exceed the size of the smallest group.\n", " | \n", "
\n",
" all_counts_interval (int)\n", " | \n",
" \n", " A sampling interval used to determine the number of samples\n", " which should be drawn during statistical testing.\n", " | \n", "
\n",
" fecal_cats (list of tuples)\n", " | \n",
" \n",
" A list of tuples which follow the format (category, order) . For example, to look at inflammatory bowel disease status, this might be (‘IDB’, [‘I do not have IBD’, ‘IBD’]) . The order list allows us to select which groups we’ll compare. To analyze all groups in a category, the order position may take a value of None .\n",
" | \n",
"
\n",
" fecal_control_cats (list of strings)\n", " | \n",
" \n", " The categories used to identify matched samples. So, if we are \n", " comparing in category A, but control for B, C, and D, samples \n", " will be selected where A is different but B, C, and D are the \n", " same.\n", " | \n", "
\n",
" plot_counts (array)\n", " | \n",
" \n", " The number of samples which should be drawn to plot the curve. \n", " The minimum of this should not be less than two, although the \n", " maximum can exceed the number of samples in any group.\n", " | \n", "
\n",
" plot_colormap (array, None)\n", " | \n",
" \n",
" The colors used for the lines. If None is specified, the \n",
" default colors from Statsmodels will be used.\n",
" When a custom colormap is passed, it should have at least as \n",
" many colors as there are categories in \n",
" fecal_cats .\n",
" | \n",
"
\n",
" legend_size (int)\n", " | \n",
" \n", " The size of the text appearing in the figure legend.\n", " | \n", "
\n",
" label_size (array of strings)\n", " | \n",
" \n", " The way each category should appear in the final legend. This \n", " should include body site.\n", " | \n", "
\n",
" figure_size (tuple)\n", " | \n",
" \n", " The height and width of the final figure, in inches. \n", " | \n", "
\n",
" legend_position (tuple)\n", " | \n",
" \n", " Where the legend should be placed in the final figure. The \n", " tuple gives (left, bottom) as a fraction of the axis size.\n", " | \n", "
\n",
" print_position (tuple)\n", " | \n",
" \n", " A four-element description of the size of the axis in the \n", " figure. This is given in inches. The tuple is give as (left, \n", " bottom, width, height).\n", " | \n", "
\n",
" space_position (tuple)\n", " | \n",
" \n", " To render the legend correctly, we have to create a dummy \n", " axis. This gives the location of the dummy axis within the \n", " figure in inches from the bottom left corner. Positions are \n", " (left, bottom, width, height).\n", " | \n", "
\n",
" save_pad (tuple)\n", " | \n",
" \n", " The extra space (in inches) for display around the edge of the \n", " figure.\n", " | \n", "
\n",
" save_bbox (tuple, str)\n", " | \n",
" \n",
" The size of the image to be saved. Using a value of 'tight' will display the entire figure and allow padding.\n",
" | \n",
"
check_dir
function. This will create the directories we identify if they do not exist.\n",
"\n",
"\n",
"\n",
"### Base and Working Directories\n",
"We need a general location to do all our analysis; this is the base_dir
. All our other directories will exist within the base_dir
.\n",
"\n",
"\n",
" base_dir (string)\n", " | \n",
" \n", " The filepath for the directory where any files associated with the analysis should be saved. It is suggested this be a directory called agp_analysis, and be located in the same directory as the IPython notebooks.\n", " | \n", "
\n",
" working_dir (string)\n", " | \n",
" \n",
" The file path for the directory where all data files associated with this analysis have been stored. This should contain the results of the Preprocessing Notebook. \n", "The working_dir is expected to be a directory called sample_data in the base_dir .\n",
" | \n",
"
\n",
" analysis_dir (string)\n", " | \n",
" \n",
" The file path where analysis results should be stored. This is expected to be a folder in the base_dir .\n",
" | \n",
"
\n",
" all_dir (string)\n", " | \n",
" \n",
" The filepath for the directory where all bodysite files are \n",
" stored. This should be a directory in the \n",
" working_dir .\n",
" | \n",
"
\n",
" all_map_fp (string)\n", " | \n",
" \n", " The filepath for the metadata file associated with all \n", " samples. This is expected to be a processed metadata file \n", " generated by the preprocessing notebook, and contain columns \n", " describing alpha diversity.\n", " | \n", "
\n",
" all_uud_fp (string)\n", " | \n",
" \n", " The filepath for the unweighted UniFrac distance matrix \n", " associated with the all sample file.\n", " | \n", "
\n",
" site_dir (string)\n", " | \n",
" \n",
" The filepath for the directory where data sets from fecal \n",
" samples are stored. This should be a directory in the \n",
" working_dir .\n",
" | \n",
"
\n",
" data_dir (string)\n", " | \n",
" \n",
" The filepath of the all participant single sample directory. \n",
" This should be a folder in the \n",
" site_dir .\n",
" | \n",
"
\n",
" data_map_fp (string)\n", " | \n",
" \n", " The filepath for the metadata file associated with the fecal \n", " samples. This is expected to be a processed metadata file \n", " generated by the preprocessing notebook, and contain columns \n", " describing alpha diversity.\n", " | \n", "
\n",
" data_uud_fp (string)\n", " | \n",
" \n", " The filepath for the unweighted UniFrac distance matrix \n", " associated with the fecal sample dataset.\n", " | \n", "
\n",
" results_dir (string)\n", " | \n",
" \n",
" A folder where files summarizing the power calculation results for each run should be stored. This is expected to be a folder in the analysis_dir .\n",
" | \n",
"
\n",
" site_pickle_pattern (string)\n", " | \n",
" \n",
" Individual power analyses (numpy arrays of the power curve \n",
" results) are saved using the python \n",
" Pickle \n",
" module. The blanks specify the diversity metric used for \n",
" comparison, the metadata category, and the two groups within \n",
" that category. \n", " The file pattern contains blanks which can be filled in with \n", " information about the specific sample.\n", " | \n",
"
\n",
" image_dir (string)\n", " | \n",
" \n",
" If power curves are being saved as images, this specifies the \n",
" directory where all images should be saved. This is expected \n",
" to be a folder in the \n",
" analysis_dir .\n",
" | \n",
"
\n",
" power_image_dir (string)\n", " | \n",
" \n",
" This directory allows us to specify power curve images from \n",
" other images generated during the course of analysis. This is \n",
" expected to be a directory in the \n",
" image_dir .\n",
" | \n",
"
\n",
" image_pattern (string)\n", " | \n",
" \n", " The file name pattern where images generated by this notebook \n", " should be saved. The blank indicates the type of diversity \n", " metric used to generate the image.\n", " | \n", "