output: /nas/is1/zhangz/projects/barr/2015-11_Rat_Brain_Development/result/gene_clustering/Juvenile_SC_injury # Location of output files
input: # Location of input files
  data: /nas/is1/zhangz/projects/barr/2015-11_Rat_Brain_Development/r/Juvenile_SC_injury/expr18_rescaled.rds ## a matrix of data, row=gene, column=sample 
    # The order of the samples in the matrix matters, order them based on the order of the groups
    # It's preferrable to rescale the data matrix to make the mean and SD of each row 0 and 1.0 respectively
    # No missing values allowed, remove rows with missing values or run imputation if estimate missing values
  annotation: /nas/is1/zhangz/projects/barr/2015-11_Rat_Brain_Development/r/Juvenile_SC_injury/anno.rds ## a data frame of gene annotation, match row names
  sample: /nas/is1/zhangz/projects/barr/2015-11_Rat_Brain_Development/r/Juvenile_SC_injury/smpl18.rds ## a data frame of sample manifest
  geneset: /nas/is1/rchive/data/gene.set/r/default_set_rat_5-1000.rds ## a list of 2 members, the first is metadata of gene sets, the second is a list of gene set-gene mapping
  template: https://raw.githubusercontent.com/zhezhangsh/DEGandMore/master/examples/MultiGroupCluster/ClReport.Rmd
  subtemplate: https://raw.githubusercontent.com/zhezhangsh/DEGandMore/master/examples/MultiGroupCluster/ClDetail.Rmd
  remote: yes
parameter: # All parameters
  term: Group ## Corresponding to a column name in sample; based on which samples will be grouped
  selection: ## How significant genes will be selected as seeds for initiating clusters
    min: 500 ### Minimal number of genes for further selection (no further selection if number is lower than it)
    max: 2000 ### Maximum number of genes to be selected
    fdr: 0.2 ### First selection criteria, maximum FDR value (ANOVA p value adjusted by BH method)
    p: 0.01 ### Second selection criteria, maximum ANOVA p value
    range: 2.0 ### Third selection criteria, the difference of max and min values
  cluster: ## How initial clusters will be generated
    height: 1.6 ### The height to cut clustering tree. Lower value generates more clusters
    size: 0.2 ### The minimum size of a cluster (relative ratio to expected size)
    corr: 0.6 ### The minimum correlation coefficient to the cluster centroid for a gene to be kept in the cluster
    merge: ### How to merge similar clusters
      corr: 0.5 #### The minimum correlation coefficient between 2 centroids to merge two clusters
      p: 0.05 #### The maximum p value between 2 centroids to merge two clusters
  recluster: ## How to refine clusters via reclustering cycles
    p: 0.02 ### Maximum ANOVA p value to include a gene
    r: 0.6 ### Minimum correlation coefficient between a gene and centroid to inlude a gene in the cluster
    diff: 0.2 ### Minimum correlation coefficient difference between and best and the second best
    times: 20.0 ### Number of reclustering cycles
  plot: ## How to plot results
    rescale: no ### whether to rescale data matrix
    zero: yes ### whether to use the first group as background and set its mean to 0 for all genes
    ylab: Average expression (# of SD)  ### y-axis label in plots
project: # Project background
  Title: Response of juvenile mice to spinal cord injury ## Project title
  Description: This project uses gene expression microarray data to track transcriptome
    of young mice after spinal cord injury. Measurements were made in control and
    5 time points after injury from 6 hours to 2 weeks. The goal of this analysis
    is to identify gene clusters that were signficantly different across time points. ## Full description
methods: # Data processing and analysis methods
  Processing of microarray data: All [Affymetrix](http://www.affymetrix.com) probes
    were re-grouped into unique Entrez gene IDs using custom library file downloaded
    from [BRAINARRAY](http://brainarray.mbni.med.umich.edu) database. The raw data
    in .CEL files were normalized and summarized by [RMA](http://www.ncbi.nlm.nih.gov/pubmed/12925520)
    (Robust Multichip Averaging) method to generate an N by M matrix, where N is the
    number of unique Entrez genes and M is the number of samples. The normalized data
    were log2-transformed to get final measurements mostly ranging between 1 and 16,
    so every increase or decrease of the measurements by 1.0 corresponds to a 2-fold
    difference. All data processing steps were performed within the [R](https://www.r-project.org)
    environment. The following customized code can be applied to any types of [Affymetrix](http://www.affymetrix.com/estore/browse/level_three_category_and_children.jsp?category=35868&categoryIdClicked=35868&expand=true&parent=35617)
    platforms (3'IVT, Exon, Gene, etc.) as long as the raw data were stored in [CEL](http://media.affymetrix.com/support/developer/powertools/changelog/gcos-agcc/cel.html)
    format and [BRAINARRAY](http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/CDF_download.asp)
    provides the custom library file in CDF format.
  Preparation of data for gene clustering: Processed microarray data set were re-scaled
    to make the mean and standard deviation of each gene equal to 0 and 1.0. Control
    samples were not included for this analysis because they are very different from
    all the injuried samples, and the difference will dominate the analysis.
  Selection of differentially expressed gene: ANOVA was applied to give each gene
    a p value for its differential expression across sample groups. Corresponding
    false discovery rate was calculated using the p values. Genes for clustering analysis
    were sequentially selected by FDR, p value, and range (difference between maximum
    and minimum). The selection would be stopped if the number of qualifying genes
    was already smaller than a given minimal number.
  Gene clustering analysis: The _hclust{stat}_ function was applied to selected genes
    to generate  hierchical clustering tree was first made by  The tree was cut at
    a given height to get a number of initial clusters. The cutting height was adjusted
    to lower value to generate more clusters if the number of clusters was less than
    the number of sample groups. The initial clusters were then filtered by removing
    clusters that are too small and genes not close enough to the cluster centroid.
    Two initial clusters could also be merged if their centroids were significantly
    similar to each other. Next, the clusters were refined recursively by including
    more less differentially expressed genes across groups. The genes were classified
    into a cluster if it was similar enough to the cluster centroid. The re-clustering
    procedure was repeated for given number of times or until the clusters were stabalized.
  Gene set enrichment analysis: We made a comprehensive collection of predefined gene
    sets from resources such as [KEGG](http://www.genome.jp/kegg/) and
    [BioSystems](http://www.ncbi.nlm.nih.gov/biosystems). The collection was precompiled
    into an R data object. Gene set collections of a few model animals can be downloaded from: 
    https://github.com/zhezhangsh/DEGandMore/tree/master/examples/geneset_collections