# Frequently asked questions ## Table of Contents [Where does the bias correction algorithm come from?](#where-does-the-bias-correction-algorithm-come-from) [What kind of data can be corrected by rBiasCorrection/BiasCorrector?](#what-kind-of-data-can-be-corrected-by-rBiasCorrectionbiascorrector) [Do my input files need to be in a special format?](#do-my-input-files-need-to-be-in-a-special-format) [Are there any requirements for naming the files?](#are-there-any-requirements-for-naming-the-files) [What is exactly done during rBiasCorrection's/BiasCorrector's data preprocessing?](#what-is-exactly-done-during-rBiasCorrectionsbiascorrectors-data-preprocessing) [What are the regression statistics?](#what-are-the-regression-statistics) [What are 'substitutions' in my final results?](#what-are-substitutions-in-my-final-results) ## Where does the bias correction algorithm come from? rBiasCorrection/BiasCorrector is the user friendly implementation of the algorithms, described by Moskalev et. al in their article *'Correction of PCR-bias in quantitative DNA methylation studies by means of cubic polynomial regression'*, published 2011 in *Nucleic acids research, Oxford University Press* (DOI: [https://doi.org/10.1093/nar/gkr213](https://doi.org/10.1093/nar/gkr213)). ### Citation: ``` @article{10.1093/nar/gkr213, author = {Moskalev, Evgeny A. and Zavgorodnij, Mikhail G. and Majorova, Svetlana P. and Vorobjev, Ivan A. and Jandaghi, Pouria and Bure, Irina V. and Hoheisel, Jörg D.}, title = "{Correction of PCR-bias in quantitative DNA methylation studies by means of cubic polynomial regression}", journal = {Nucleic Acids Research}, volume = {39}, number = {11}, pages = {e77-e77}, year = {2011}, month = {04}, abstract = "{DNA methylation profiling has become an important aspect of biomedical molecular analysis. Polymerase chain reaction (PCR) amplification of bisulphite-treated DNA is a processing step that is common to many currently used methods of quantitative methylation analysis. Preferential amplification of unmethylated alleles—known as PCR-bias—may significantly affect the accuracy of quantification. To date, no universal experimental approach has been reported to overcome the problem. This study presents an effective method of correcting biased methylation data. The procedure includes a calibration performed in parallel to the analysis of the samples under investigation. DNA samples with defined degrees of methylation are analysed. The observed deviation of the experimental results from the expected values is used for calculating a regression curve. The equation of the best-fitting curve is then used for correction of the data obtained from the samples of interest. The process can be applied irrespective of the locus interrogated and the number of sites analysed, avoiding an optimization of the amplification conditions for each individual locus.}", issn = {0305-1048}, doi = {10.1093/nar/gkr213}, url = {https://dx.doi.org/10.1093/nar/gkr213}, eprint = {http://oup.prod.sis.lan/nar/article-pdf/39/11/e77/16775711/gkr213.pdf}, } ``` ## What kind of data can be corrected by rBiasCorrection/BiasCorrector? Currently, both R packages, `rBiasCorrection` and `BiasCorrector`, can correct measurement biases in DNA methylation data of the type "one locus in many biological samples". The programme has been tested on data derived by bisulphite pyrosequencing, next-generation sequencing, and oligonucleotide microarrays. A future implementation is planned for correcting data of the type "many loci in one biological sample". However with some effort, the latter can be transformed to data of the first type in order be corrected with `BiasCorrector`. ## Do my input files need to be in a special format? Yes, rBiasCorrection/BiasCorrector places very strict requirements on the file format. Below is a description of the exact requirements. All uploaded files must * be in CSV format [file endings: \*.csv and \*.CSV] * contain the column headers in the first row * the number of CpG sites per locus ID has to be equal in every provided file * Experimental data: + the first column contains the sample IDs (alpha-numeric characters only) + the other columns contain the results of your methylation analysis with one column for each CpG site + sample IDs may occur more than once, indicating repeated measurements of the same sample (in this case, the mean values of the replicates will be used for bias correction) * Calibration data: + the first column contains the percentages of the actual methylation of the calibration samples (calibration steps, numeric) + these calibration steps (CS) must be in the range 0 <= CS <= 100 + a minimum of four distinct calibration steps are required + the other columns contain the calibration sample's results of the methylation analysis with one column for each CpG site + calibration steps may occur more than once, indicating repeated measurements of the same calibration sample (in this case, the mean values of the repeated measurements will be used for calculation of the calibration curve) (As the BiasCorrector software currently requires the data to be in the format "one experiment per Locus, multiple samples per experiment", results of high-throughput analyses that might be of different shape (e.g. one CSV file per calibration step) need to be formatted as described above in order to apply BiasCorrector to this type of data, i.e. one file CSV with the experiment results and one CSV file holding the calibration results with both files having equal column names.) ### Example files Example files are available for download, to demonstrate how to preprare files appropriately: * calibration data: [Example_calibration.csv](tests/testthat/testdata/cal_type_1.csv) * experimental data: [Example_experimental.csv](tests/testthat/testdata/exp_type_1.csv) ### Template files Template files are available, if you want to copy-paste your data. Please note that you might have to adjust the column headers and sample IDs or calibration steps: * calibration data: [Template_calibration.csv](inst/template_calibration.csv) * experimental data: [Template_experimental.csv](inst/template_experimental.csv) ## Are there any requirements for naming the files? The filename must not contain additional dots (".") beyond the one in the file ending. ## What is exactly done during rBiasCorrection's/BiasCorrector's data preprocessing? During the preprocessing, all requirements of the input files as stated in [Do my input files need to be in a special format?](#do-my-input-files-need-to-be-in-a-special-format) are checked. Furthermore, the mean methylation percentages of all CpG columns are calculated for every provided file. If any of the abovementioned file requirements is not met, an error will occur. For example, an error message will pop up if any calibration step is not within the range of 0 <= CS <= 100 or if you provided less than four calibration steps in your input data. ## What are the regression statistics? The regression statistics table shows the regression parameters of the hyperbolic and the cubic polynomial regression. - Column 1 presents the CpG site's ID. - Column 2 contains the mean of the relative absolute errors for every interrogated CpG site. - Columns 3-9 comprise the sum of squared errors of the hyperbolic regression ('SSE [h]') and the coefficients of the hyperbolic equation that describes the hyperbolic regression curves for the respective CpG sites. - Columns 10-15 summarise the sum of squared errors of the cubic polynomial regression ('SSE [c]') and the coefficients of the cubic polynomial equations. - The rows highlighted with a green background colour indicate the regression method (hyperbolic or cubic polynomial) that is suggested by BiasCorrector for correcting data. This automatic choice of the regression method relies on either minimising the value of SSE (the default setting) or minimising the average relative error as selected by the user in the Settings tab. ## What are 'substitutions' in my final results? Substitutions occur if no result is found in the range of plausible values between 0 and 100 during the BiasCorrection. A 'border zone' is implemented in the ranges 0 – 10% and 100 + 10%. If a result is in the range -10% < x < 0% or 100% < x < 110% , the value is substituted in the final results with 0% or 100%, respectively. Values beyond these border zones will be substituted with a blank value in the final output, as they seem implausible and could indicate substantial errors in the underlying data. For a detailed feedback, the substitutions table shows the results of the algorithm 'BiasCorrected value' and the corresponding substitution 'Substituted value' for the respective CpG site.