3.3.1 - Fixed alignment concat where results could be truncated if several empty slices followed one another (e.g., if concat A,B,C and A and B are empty, goby ca could yield an empty alignment, completely omiting alignments in part C.) 3.3.0 - Substantially reduced memory utilization for discover-sequence-variant (all modes). - discover-sequence-variant could in some rare cases output the same base twice (when indels were extending prior to the beginning of the read after equivalent indel region calculation). This fix improved indel performance when training models with variationanalysis 1.3.3+. - Initial work to develop[ models for genomic segments (see .ssi format and consurrent work in variationanalysis). This is work in progress. Protobuf schema is in goby-io/protobuf/SegmentInformationRecords.proto Models are developed in parallel with Keras (in goby3/python/dl) and DL4J (in variationanalysis). - Updated genotyping model to state of the art (models/genotyping/1510204519948/, see evaluation results in the folder) 3.2.7 - Somatic output format: report predicted somatic allele in VCF. - Variant.FromTo: defined SerializeID. This requires regenerating varmaps. - sbi output: Set position and reference base on list copy. Fix for reference base begin '\0' in sbi files. - vcf-to-genotype-map: Fix VCF to varmap. Incorrect genotypes added prior to this commit (since refactoring to use VCF reader from HTSJDK in version 3.2.6). Show better statistics when creating the map. Fix for indels not imported in varmap. - GenotypesOutputFormat: Complete rewrite Fix VCF coding of het sites. Also, when using a model, now we check sampleCount, in case the model does not use the matchesRef feature, because such models may return a default non-reference base for sites with no coverage. - Add usage to goby wrapper. Do not attempt to configure R unless the variable GOBY_USE_RJAVA is configured. 3.2.6 - Updated models for compatibility with latest code: genotyping model and somatic models are updated. - Tested that models produced with variationanalysis (genotype and somatic) load in Goby and can be used with the modes to generate VCF. - Various bug fixes to last-to-compact mode. Bugs were triggered by output from more recent versions of Last than tested previously. - Discover-sequence-variations mode: fix VCF output for indels. Genotypes format mostly rewritten. Was previously writing incorrect indels. Latest code produces VCF files tested for compatibility with RTG vcfeval. - Discover-sequence-variations mode: Add minimum-P and stringent-P options to Genotypes output format. - Rewrote VCFToGenotypeMapMode to use HTSJDK VCF parser. This should enable using BCF files as input as well. - Fix for count of indels. The first equivalent indel region did not increment the count. Counts on forward and reverse now match the number of supporting entries on each strand. - Add supporting entry the first time an indel is created in a SampleCountInfo. The supporting entry was not set on the first one. - Apply count fixer to remove bases matching ref from list, when the mandatory filter has determined the base should be removed. Previously was only removed from counts, but not from list of bases. One possible candidate for indel performance problems we have tried to fix for a while. 3.2.5 - Fix issue with toProto that prevented using more than one sample for genotyping with goby. - alignment conversion to goby: ignore missing MD tags (it is possible only some reads are missing them and we still need to convert the other aligned reads). - Upgrade goby to DL4J 0.8.0. - fasta-to-compact: Do not use an assertion, but instead reset read index to zero and explain how to avoid the problem. - SBI format: add distance from start of read and end of read. Will be mapped to a density in next genotype mapper. Should help variationanalysis models detect cases where end of alignment is fully contained within homopolymer region. 3.2.4 - Fix tally-reads mode. - Some fixes to realignment of SNPs around indels. - improvements to barcode remover (to trim bases from 5' end before removing barcode). - Goby version now reports the commit that produced the distribution. - Goby version, including commit now written to generated .sbi files. - Introduce CommitPropertyHelper to record the specific commit that produced the version of Goby being used. 3.2.3 - Fix SNP bug in realignment around read insertion. - Add queryPosition field to SBI output. - Prevent the writing of sbi entries when AddTrueGenotypeHelper indicated the entry should not be added. 3.2.2 - Fix frequency of bases when indels are also present. Now correctly removes bases that support the flanking sequence of the indel and do not double count. - Many changes to how we store varmaps introduced to support indels (vcf-to-varmap). The serialization format is incompatible with previous versions, so make sure you regenerate varmaps from VCF. - Adjust VCF output for compatibility with REF/ALT conventions. This makes it possible to measure performance with standard tools such as RTG vcfeval (http://realtimegenomics.com/products/rtg-tools/). - Keep counts of indels separately for forward and reverse strand. - vcf-to-varmap mode: improved semantic of --chromosome-prefix option allows removing (e.g., -chr) or adding (+chr) prefix to chromosome name. 3.2.1 - fast-co-compact: fix a bug introduced on 10/6/2016 which created negative read entries. - catch a number of exception that can be thrown by HTSJDK when processing BAM files. Exceptions are caught so that an error on one alignment does not interrupt processing of an entire alignment. Errors are shown in log. - vcf-to-genotype-map mode now supports (b)gzipped vcf input. - vcf-to-genotype-map: fix bug that manifested itself when the vcf had a single genotype field. - vcf-to-genotype-map: add chromosome-prefix argument to help import VCF where the chr prefix is missing. 3.2 - Remove memory leak when reading SAM/BAM files. This was the likely cause for running out of memory error in compression benchmarks (had nothing to do with compression but with the conversion of SAM/BAM to goby representation). - Disabled tests that could not succeed anymore (because of choices we made in Goby 3, such as lack of auto-upgrade for alignments produced with Goby 1 and 2.) - BAM/CRAM support. Added an option to bypass the header check on SO:COORDINATE. Use -x HTSJDKReaderImpl:force-sorted=true to force Goby to consider an alignment sorted. - SBI format: add ability to add true labels while writing the file. Add support for downsampling sites without variants. - Genotype format: reorganization to support calling with deep learning models trained with variation analysis. 3.1 - Reorganize model prediction to facilitate installing new versions of the variationAnalysis jars. Goby 3.1 is now compatible with variationanalysis 1.1.1. - Replace models with versions trained with variationanalysis 1.1.1. - Add somatic mutation models trained with whole genome data (ICGC GoldSet). 3.0.0 - Support reading BAM alignments directly with Goby APIs. - Support probabilitic models for calling somatic variations, trained with deep learning. 2.3.6 - Improve performance of realignment around indels when processing RNA-Seq reads. Previous versions of Goby had scalability issues and kept data around from previous chromosomes. This was OK when processing DNA-Seq inside GobyWeb, which splits data into genomic slices, but not when trying to process one or more RNA-Seq alignment files. Performance has also been dramatically improved by fixing a bug on indel equality. 2.3.5 - Add a mode to infer sex of samples from data (tested on exome data). Useful as quality control to check the data you get checks out with respect to the what is known about the samples. See --mode infer-sex. Works faster on sorted alignments where the index is used to jump quickly to the human sex chromosome. - Prevent AbstractAlignmentToCompactMode to print more than 10 warnings if quality scores are not available in an alignment. - suggest-position-slices: fix a bug in that caused some slices to overlap. Found with a job with hundreds of alignments, so not common. 2.3.4.1 - Add an option to the fasta-to-compact mode that will convert a set of files and concatenate the result to a single compact-reads file (see new --concat option). - Add a mode to test that the connection from Goby to R is working (requires JRI and R built with shared library support). The mode is called test-r-connection (tcr). - Restore STRICT_SOMATIC filter. - Close files opened when loading Goby Alignment header and index files. This fixes a too many file error that could occur when loading hundreds of alignments simultaneously. - Allow lenient import mode for TSV files. This makes it possible to convert TSV files to lucene.index when they have been created with Goby in the past with a \t character as last character of the column line. - Fix a bug that caused some slices to occur within annotations, despite the --annotation option being given on the command line. The problem was that the chromosome index was not /obtained from the genome and was set to zero, always. 2.3.4 - Optimize the speed of genotyping when some sites have very high coverage (>500M bases). Now sub-sampling to keep a random set of 10,000 bases for such sites. Expose the default sub-sample size with a dynamic option called sub-sample-size in IterateSortedAlignmentsListImpl. (-x IterateSortedAlignmentsListImpl:sub-sample-size ) - LastToCompact mode now supports the import of paired end alignments produced by Last's last-pair-probs.sh. - LastToCompact mode now supports the import of quality scores (lastal must be done with -Q1 since the import assumes Phred quality scores on the q lines). - Add two methods to AlignmentReader to determine the minimum and maximum genomic locations represented in the reader. This is useful when suggesting slices to split a set of alignments. This commit includes a fix for possible null start or end positions in slices generated with suggest-position-slices. - Fix a problem with run-in-parallel where some threads would never finish when they do not detect the keyword. Now indicate that the thread finished so that others can start when the processing completes. - reads-file-stats: remove any path from basename in the output. 2.3.3 - IterateSortedAlignmentsListImpl: Use a WarningCounter to limit warnings to 10 instances. This is needed to avoid writing Gb of log output when the threshold is met. - discover-sequence-variants somatic output: Make it possible to run a simple trio design by removing the requirement for a germline sample. - discover-sequence-variants somatic output: Earlier versions were reporting somatic variation candidates when two parents are homozygotes and the somatic samples was Het (the fisher p-value with each parent is very significant in this case, but does not indicate a somatic change). This also improves q-values because they are less results that need to be corrected. - discover-sequence-variants somatic output: Add an error message when a sample is mis-spelled in the covariates file. - Refactor code base to keep base counts for forward and reverse strands separately in SampleCountInfo. - Normalize somatic priority score by number of mapped reads, and number of parents and germline samples used in the calculation. - Add a StrandBiasFilter in somatic analyses. The filter rejects variations that are not represented on both strands when at least j reads support the variation. The value of j is set to 9 by default, so a variation with 10 bases needs to have at least the two strands represented. - Remove candidate somatic variation that can occur when the germline samples have less coverage than the somatic sample. Now require at least twice the coverage in the somatic sample than the minimum coverage in the germline samples. - Add a STRICT_SOMATIC filter that flags genomic sites where some bases appear in support of the variation in the parents or germline samples. Please note the VCF spec semantic: PASS indicates that all filters passed. This means that lines with the STRICT_SOMATIC value in the FILTER column failed that test. - Fix a bug in FDR mode that would not handle vcf files with non default FILTER values. 2.3.2 - run-parallel-mode now supports paired input files. - fasta-to-compact: add --force-quality-encoding option to force the quality values within the specified encoding range. - suggest-position-slices: fix problem where first slice of genome was omitted from output (with new split by number of bytes option introduced in 2.3). 2.3.1 - Fix for https://github.com/CampagneLaboratory/goby/issues/3 - Upgrade commons-io and dsiutils to latest jar versions. Log messages when scanning reads file with cfs mode. - DistinctValueCounterBitSet: now grows to biggest size at construction time. - Fixed a performance problem. When reading large reads file (>10GB), performance of ReadsReader would degrade over time. This was due to caching of data in static protobuf methods of ReadCollection. We now create a builder instance that gets garbage collected when it is no longer used. This fixes a subtle performance problem. The same fix has been applied to alignment readers. 2.3 - concatenate-alignments mode: add ability to restrict output to a genomic slice (see -s and -e options). - API change: AlignmentSliceHelper makes it easier to parse and process genomic slices for sets of alignments. - concatenate-alignments mode: now transfers read groups to output in the same way that non-sorted concat does. - concatenate-alignments mode: Add a mechanism to override/define read groups/read origin info on the fly when reading alignments that did not include them. Coupled with changes to compact-to-sam, this makes it possible to get BAM files with read groups directly from Goby alignments. - compact-to-sam mode: fixed output of read groups, which were not correctly written for platform, platform unit, and library. - suggest-position-slices: add --restrict-per-chromosome option. When this switch is provided, slices will be restricted to start and end on the same chromosome. This is useful to produce intervals to give Mutect, for instance. - Trim mode: add --trim-left --trim-right parameters to control trimming of specific sequence extremities. - Trim mode: add --verbose flag. 2.2.1 - FDR mode: add ability to read groups from VCF file and adjust columns/fields marked as p-value. Mark adjusted columns with group q-value. - Somatic variation output format: annotate somatic p-value column with 'p-value' group. Fix the type of the p-value column to be a number (was String in release 2.2). - Somatic variation output format: handle unrecognized sample-ids in the parents column. - discover-sequence-variants mode: add assertion to give hint to user that syntax is incorrect in for -s and -e options. - compact-file-stats mode: print progress when scanning reads files. Use a buffered reader to improve read file parsing performance. - discover-sequence-variants: adjust multiplier for left-over filter for somatic variations output format. - discover-sequence-variants: Add a new filter to remove indels at a site where a sample shows lots of distinct possible indels. Indels at these sites are very likely to be artefactual. We count the number of samples where three distinct indel genotypes are seen. If more than 1/4 of the samples have likely indel artifacts, we remove all indel candidates at the site. maxIndelPerSite:Maximum number of distinct indels at a given genomic site.:1 Additional filter: fractionOfSamples: Maximum fraction of samples that can have an indel candidate for the indel to be considered (indel candidates that occur in many samples are more likely to be spurious).:0.25 This filter is added to the somatic variations output format. See dynamic options for this filter with --x-help 2.2 - Remove threshold effects when calling genotypes in several samples. Modified the filters to not remove bases in specific samples when the genotype survived filters in at least another sample (previous versions reported these threshold edge effects as differences, which could be confusing, this version simply shows the marginal raw base counts in samples where the genotype could have been filtered by a filter, which makes it easier to compare the strength of the genotype support across samples). This adjustment was done for both base genotype and indel genotypes. - LeftOverFilter: now uses minVariationSupport as minimum threshold. - Mode suggest-position-slices: add option number-of-bytes to suggest slices with a uniform number of compressed bytes. This option aims to provide more balanced slices in bases where the genome as very non uniform coverage by position. With this option, the number of slices is determined to yield slices that need to decompress about the amount of bytes indicated on the command line. ` - Framework API change: introduce class PositionToBasesMap to use as type for positionToBases. The class provides methods to get the range of positions described in the map. This unfortunately requires changes to all clients/ implementations of IterateSortedAlignments. - Mode discover-sequence-variants: Fix various problems that prevented reporting genotypes for deletions (i.e., C/-). - Fix a potential NPE in GroupAssociations when samples are null. - Fix for issue #2, see https://github.com/CampagneLaboratory/goby/issues/2 - Expose comparator in SortedAnnotations. 2.1.2 - Upgrade xstream to version 1.4.3. This fixes the compatibility problem seen when running goby 2.1.1 with java 1.7+. Goby 2.1.2 should run with Java 1.7+, but more testing will be needed to rule out other migration problems. If you are running JDK 1.7+ please let us know any issues you encounter. - Fix VCFParser issue https://github.com/CampagneLaboratory/goby/issues/1. The issue could be triggered when the FORMAT column changed from line to line. - VCFWriter: improve support for VCF group associations. The Goby VCF parser makes it possible to associate columns to groups (these associations are written in a ##FieldGroupAssociations field). - Methylation rate VCF output: mark the context column with group 'indexed'. - Do not try to upgrade alignments when reading the header to concatenate permutations. This is not necessary and can open too many files when we are trying to concatenate alignments. 2.1.1 - Add extract-splicing-events mode. This mode is used by GobyWeb 1.9 to extract splicing events from spliced Goby alignments (generated either by GSNAP or STAR at this time). - Trim mode:Fix bug that caused quality scores to be duplicated (the bug triggered the assertion that checks that sequence length equal quality length). - Trim mode: Some sequence must remain after trimming to append to the output. - Fix bug in alignment-to-annotation-counts when counts would be zero for samples whose name contained a period '.' The code was incorrectly stripping alignment extensions twice. - alignment-to-annotation-counts: add comparison description to t-test statistic column name (e.g. t-test[A/B] rather than t-test). This change makes it possible to retrieve the t-test p-values when more than one comparison is performed. - Fix a bug where RandomAccessAnnotations could return results on a different chromosome. - Add annotation loading test and fix for when annotation file is truncated. Goby now loads annotations up to the truncation and logs truncated lines. - Correct calculation for fold-change-magnitude column in goby diff exp mode. Previous calculation under-estimated magnitude when comparing low rpkms. - Fix a problem where AlignmentReaderImpl.canRead would return true when the file ended with an incorrect extension (this problem could create subtle issues when the goby tried to access .info.txt files on a web server that did not return 404 errors for missing content). 2.1 - Improve compression of hybrid-1 codec by about 8% on average at similar speed. You can enable this improvement with option -x AlignmentCollectionHandler:symbol-modeling=plus. This option will be made the default in a future release. It is not currently the default since Goby 2.1 has not been integrated into IGV and will need time to propagate from IGV dev to production builds. - Remove import of NH:i bam tags as read-origin-index, since the NH tag seems to contain different types of data depending on the aligner that produced the alignment. - compact-to-sam mode: fix bug where bam tags containing a colon character (:) would be truncated after the first colon. Thanks to Vadim Zalunin for reporting this problem. - compact-file-stats: Add a feature to scan only alignment headers. - VCFParser group associations: Make it possible to lookup an INFO column by either INFO/colname or colname. - NonAmbiguousAlignmentReader: fix an NPE when reading alignments where all entries have the ambiguity field. - Fix a problem where AlignmentReaderImpl.canRead would return true when the file ended with an incorrect extension (this problem could create subtle issues when the goby tried to access .info.txt files on a web server that did not return 404 errors for missing content). Thanks to Jim Robinson and Helga Thorvaldsdottir for reporting this issue. 2.0.1 - Release Goby C/C++ APIs under the LGPL license version 3 to make it possible for companies to incorporate support for Goby formats in their tools. Thanks to Collin Hercus for the suggestion. Please note that part of the Goby Java APIs are already licensed under the LGPL (anything packaged under the Goby-io.jar file). - C++ API: Support to set placed unmapped (i.e., mate that does not map is recorded with the read that mapped) and clipleft/clipright with quality scores. - Fix problem when using a genome backed by a samtools/picard faidx file. In some cases, read bases would be returned shifted by one position. Thanks to James Bonfield for reporting this problem. - SAM/BAM tags start at column 12, index 11. --preserve-all-tags could skip the first tag on some datasets (e.g., dataset where the first tag was not a MD:Z or RG:Z). Thanks to James Bonfield for reporting this problem. - Introduce interface for ReadsWriter. Introduce mock implementation to write reads to text. This is useful to write more intelligible JUnit tests. - mode sam-to-compact now supports option --read-names-are-query-indices to indicate that the read names are integers (typically produced by compact-to-fasta from a chunk of a large file). - Fix a bug in reformat-compact-reads which did not trim quality scores for paired end reads correctly. 2.0 - Support multiple group comparisons for RNA-Seq diff exp (mode compact-alignment-to-annotation-counts). - Added a mode sam-comparison to compare a source SAM/BAM file with one that generated after sam-to-compact then compact-to-sam. - Refactor AlignmentWriter to introduce an interface and make it easier to create facades that modify the behaviour of the default writer. For instance, such a facade is BufferedSortingAlignmentWriter, which keeps a number of entries in memory to re-sort these entries by genomic position. This feature is used when importing already sorted SAM/BAM files to create sorted Goby alignments and the files contain spliced alignments that would cause mis-ordering during conversion. - Make default chunk-size dependent on the type of chunk codec used. This is useful because hybrid compression does better with larger chunk sizes (default chunk size for hybrid is 30000, 20000 for bzip2 and 10000 for gzip). The default chunk size can be overriden with -x MessageChunksWriter:chunk-size=int - Add ability to preserve SAM/BAM read groups. Read groups are automatically preserved if present in the input BAM file. The concatenate mode automatically reassigns read_origin indices (see field read_origin_index) to prevent conflicts when Goby files from different origins are concatenated. The approach we use is to keep the most specific read origin information, and let the client decide what origins/groups are equivalent given the type of analysis at hand. Read groups are supported by the hybrid codec (and therefore stored very efficiently), are imported from BAM with sam-to-compact and are exported back to SAM/BAM with the compact-to-bam mode. - Add ability to preserve all BAM attributes during import and export. Use --preserve-all-tags in mode sam-to-compact to enable this. - Add ability to preserve all quality scores. Use --preserve-all-mapped-qualities in mode sam-to-compact. - Supports bzip2 compression in fasta-to-compact mode and sam-extract-reads (use the -x MessageChunksWriter:codec=bzip2 dynamic option). - Renamed SortMode to Sort1Mode. Renamed SortLargeMode to SortMode. - Added SortLargeMode which can sort compact alignments of any size, multithreaded. - Fixes to sam-to-compact mode. Previous versions could fail for a variety of reasons. We have stress tested this mode throwing at it various input BAM files, sorted or not and fixed the bugs we found. For instance, the --sorted option would not work in some 1.9 versions of Goby after samtools/picard changed the semantic of the record comparator Goby relied upon to verify the input was indeed sorted by position. This made it impossible to convert already sorted BAM files as sorted Goby alignments). - Moved error messages produced when parsing the command line of a mode to after usage. This is a simple change that will make it easier to diagnose problems on a command line without having to scroll back up the console. - Prevent logging when the log4j system has not been configured. For some reason, LOG.isDebugEnabled can return true when the logging system is not initialized. For SamHelper, this means calling String.Format million of times to create debug output that is never shown. This change dramatically improves the performance of the sam-to-compact mode when logging is not properly configured. - Refactor dynamic options with a central registry, and make GobyDriver handle option parsing. This removes duplication of code parsing for each mode that would need dynamic options. - methylation region can now estimate empirical p-values. Empirical P-values require biological replicates in at least one of the groups under analysis. Two passes over the data are required. In the first pass, the empirical null distribution is observed by comparing pairs of samples in the same group. In the second pass, this distribution is used to estimate the p-value of observing the between group differences. Such empirical p-values can control FWER in the strong sense. - Support empirical p-value for individual bases (VCF output). Write a DMR INFO field that stores how many significant sites were found in a moving window that ends at the site (significance is judged according to a configurable threshold on the empirical p-value). - New empirical-p mode to estimate p-values from data in text files. This makes it easier to derive p-values for simulated data or counts generated by other tools than Goby. - Make it possible to open Goby alignments through HTTP. Simply specify a URL as a basename as argument to the goby tools. This is supported broadly by the API, so the concatenation reader also supports URLs, for instance. TMH files currently cannot be loaded remotely. Alignments that require upgrading will also fail to load remotely. - Fix issues with the barcode-decode mode. Add support for processing fasta/fastq files. - vcf methylation format: removed space in name of C and Cm group INFO fields. - Add a draft implementation of random access sequence interface that can read a fasta file indexed with faidx. - Introduce chunk codecs for protocol buffer encoded collection messages (supports both reads and alignments). - Added the ability in alignment-to-text mode to output HTML (-f html), to start/end at offsets (-s/-e) in the alignments and to limit the number of alignment entries to output (-n). - The RandomAccessSequenceCache had problems with bases that weren't G/A/T/C/N. Such bases would be skipped silently, causing rare, but potentially significant, problems (such as on human chr 3 of the 1000g genome reference where a R base appears). Bases not in the group G/A/T/C/N would introduce position shifts for bases immediately following the offending character. Now bases other than G/A/T/C are stored as N and maintain the position of the following bases. Please note that the problem was in a library used by RandomAccessSequenceCache, we updated the library in this release, and no change to the code of RandomAccessSequenceCache was needed to fix the problem. - last-to-compact: add option to substitute some bases with others in the aligned read. - Add test and fix for bug that went back to start of alignment file, even though iterate alignment was created for a slice of input. The problem only affected the IterateAlignments class because it was calling reposition(0,0) and the method did not enforce slice limits. - The code base was simplified by removing the now obsolete align mode. - Fix a problem where sample names with several dots were stripped of too many extensions. For instance, a.b.c.entries would be reduced to a, which could be non-unique across the remaining samples. Problem reported by Fang Fang in her data on GobyWeb. - DistinctIntValueCounterBitSet now uses LongArrayBitVector as its bit set implementation. The java BitSet implementation was found to throw java.lang.ArrayIndexOutOfBoundsException for indices that should fit easily in a bit array (e.g., 2,080,948 which can stored with about 230 MB). - AlignmentEntry field insertSize is now stored in protobuf with sint32 rather than uint32 since negative values can be stored in this field. - Support multiple group comparisons for RNA-Seq diff exp (mode compact-alignment-to-annotation-counts). - The mode sample-quality-scores now supports .sam, .sam.gz, and .bam files to make a guess at the scale of the quality scores contained in the file. - Added a mode sam-comparison to compare a source SAM/BAM file with one that generated after sam-to-compact then compact-to-sam. - Fixed a problem with concatenate-compact-reads that previously transferred only specific fields of a read to the output file. concatenate-compact-reads now transfers all fields (including pair sequence and quality score). - version mode now prints an official version number if the jar constains a VERSION.txt file. 1.9.8.3.1 - Fix a bug related to writing paired end alignments in the Gsnap parser (C API) 1.9.8.3 - Added a methylation_region format capable of averaging methylation rates for different cytosine contexts over arbitrarily defined regions. - Added a diploid genotype filter to use when calling genotypes in a diploid genome. - discover-sequence-variants format compare_groups: Write distinct fisher p-values for each comparison pair - Fix FDR mode output for TSV format. Make open --column-selection-filter work. - Fix bug that prevented methylation vcf output from writing any line. 1.9.8.2.1 - Fix bug in GenotypesOutputFormat that caused GenotypesOutputFormat to throw an exception when processing some sites. 1.9.8.2 - Make it possible to activate indel calling without recompilation. Mode discover-sequence-variants now accepts the boolean argument --call-indels true/false. - Preliminary support for calling indels with discover-sequence-variants. Candidate indels are now written in the formats that use GenotypeOutputFormat (e.g., genotypes, compare_groups, allele_frequency). The method of Krawitz et al is used to determine the equivalent indel region for each possible candidate. After possible realignment, and filtering to remove possible errors, EIR are reported with their frequencies. Please be advised that the VCF spec(s) are rather vague and as a result often interpreted differently by different programmers. This is especially true of the parts of the specification(s) that describe how to report indels. As a result of this situation, you might run into problems when trying to loading indel containing VCF files generated with Goby into other tools. - vcf-subset: Add ability to exclude positions at which all samples match the reference. - Add a replacement for the VCF-tools VCF-subset program. The Goby tool is orders of magnitude faster. - Improve vcf-compare mode. Now has the ability to provide a random samples of the positions that differ between the files being compared. Random samples are calculated for each kind of difference (missing from one file, missing one allele, two alleles, different genotypes) - vcf-compare now outputs Ti/Tv ratios for each sample in input file (in the output file only). - Fix scalability problem with local realignment code. Local realignment around indels would slow down as more entries were processed. This is now fixed so that speed is constant across large alignments. - Fixed index file writing. In some conditions, part of the alignment past the 2GB mark were not accessible with skipTo when reading files larger than 2GB. Use the upgrade mode to fix old alignments at a specific time, or use Goby as usual to have alignments upgraded on the fly. - Add mechanism to upgrade/fix large alignments indices with Goby 1.9.8.2. The upgrade mechanism uses concatenate alignment to rewrite an alignment index file if the size of the entries file exceeds 2GB. This is rather slow as the process reads and writes large alignments, to produce the new index file. While slow, upgrading is still faster than aligning the reads again. The process also requires approximately double the alignment size as the new alignment files are written. Alignments smaller than 2GB are quietly ignored since they were not affected by the bug. - Codecs: Add support to decode alignments with a codec in AlignmentReader. - Improved ReadsReader to find a suitable decoder when several codecs exist. - Prevents local realignment from running out of memory when processing positions where clonal reads create huge peaks. - Make filterIndels remove from sample count info object, not just form list of bases. - Fix VCF genotypes that could look like 0/0/1/1 to be 0/1 (seen with indels only). - only write allele base count in VCF BC field when the count is not zero (useful with indels). 1.9.8.1 - Discover-sequence-variants: add ability to describe zero, one or more group comparisons. Syntax is A/B,A/C to compare group A to B and group A to C. Additional pairs can be described, separated by coma. - Extend methyl-stats mode to estimate fraction of methylated cytosine observed in CpX contexts. - Discover-sequence-variants, genotype format: Fix a bug where alleleSet was cleared in each sample, rather than before any sample is processed. This made it possible for some positions to be ignored erroneously when samples were given on a specific order on the command line. Specifically, positions would be ignored if they were not typed (i.e., not enough good bases) in the last sample given on the command line. - Optimize merging of TMH when the files are large (>100M compressed). - Fixed a major bug where NonAmbiguousAlignmentReader would stop iterating after encountering an ambiguous alignment. Alignments with shorter reads were much more likely to be affected. - Fix sam-extract-reads for paired-end BAM files. Each BAM file contains both pairs. To convert to compact reads, the input BAM file must be sorted by read name, since this is the only way we can put the pairs back together in one Goby record. - Mode discover-sequence-variants now limits the maximum coverage per site in order to limit the impact on peak memory of a few very high coverage sites. The default setting is set to 500,000x and can be changed with option --max-coverage-per-site - Switched IndexedIdentifier to an AVLTreeMap to help scale when we have millions of elements to compare in diff exp. - Fixed a subtle bug in IterateSortedAlignment that would cause iteration to return partial results for some alignments when restricting results to a window. The problem would manifest more clearly for alignments against genomes where contigs have smaller indices than chromosomes and chromosome sequences are listed in non-increasing order (e.g., chr 16 appearing before chr 10) and restricting to window from chr16 to MT (which should include chr 10 in that genome, but returned no result on chr 10). - Trim mode: Fix exception that could occur when trimming reads with no quality scores. - Change goby script to request the bash shell explicitly. This is needed on systems where bin/sh is not a synonym for bash. Thanks to Martin Frith for catching this on Ubuntu. - Change how targetLengths are concatenated. It turns out that last-to-compact needs alignment entries matching the target to record the length in the alignment. We need to keep any length seen when we concat because the first chunk may just not have the length for the remaining parts.. - Improved logic for --paired-end filename support in the fastaToCompactMode. - Fix a NPE in suggest-position-slices that could occur with very small alignment files. 1.9.8 - The BaseStats utility was transformed into a Goby mode (base-stats). The new mode has the ability to tally occurrence of CpX motifs in reads. Useful as a proxy to the amount of unconverted Cs in bisulfite converted reads. - The methyl-stats mode take a VCF file produced by Goby methylation output and a genome and calculates various statistics about the distribution of fragment lengths between CpG interrogated by the assay. - FDR mode now accepts --column-selection-filter to select columns matching string. - Proof of principle that protocol buffer can seamlessly cohabit with data-specific compression schemes. The --codec option on fasta-to-compact is introduced to activate compression of reads when writing compact reads. The codec provided (called read-codec-1) achieves about 10-12% better compression of read files than pure protocol-buffer encoding. This read-codec-1 codec stores bases and quality scores with an arithmetic coder in a protocol buffer field called 'compressed_data'. Please note that we do not recommend using this option at this stage since the C/C++ APIs cannot load data encoded with this codec at this time. - Add ability to run alignment-to-annotation-counts on a specific genomic region (see --start-position and --end-position). - alignment-to-annotation mode has a new option (--remove-shared-segments). When active, this option will remove annotation segments when they partially overlap with more than one primary annotation id. When this option is selected and the primary id is a gene, and secondary id is an exon, the mode will remove exons that are associated with several genes. When the option is used with transcript id as primary and exon as secondary, exons are removed that are shared across different transcripts of the same gene. - mode base-stats now supports multiple input files. - VCFParser will now set column type when reading TSV files by using TabToColumnInfoMode to scan the actual values stored in the TSV file. The first time this is done for a each file, a .colinfo file will be created and then used if the file is read again by VCFParser in the future. - Added the mode tab-to-column-info to read the data from TSV files to determine the the column types (double/integer/string). Write a .colinfo file detailing the column names and types. - Upgraded to SAM JDK 1.52 - Modes sam-to-compact and sam-extract-reads now set SILENT validation before reading file header. This is required because the SAM JDK validation rules are more stringent than required by the specification. This means that some valid SAM files (per the SAM spec) cannot be parsed without error when the strict validation is used. - Fixed a bug with ReadsQualityStatsMode when when SampleFraction == 1.0d, such as for files with a small number of reads. - Mode sam-extract-reads now supports extracting reads from paired samples. See the new options --paired-end and --pair-indicator. These options work similarly to the fasta-to-compact options. - Fix problem with suggestion-position-slices that could create empty slices. - Fix bug in discover-sequence-variants methylation format that wrote methylation rates only for up to two samples. - Fix bug in alignment-to-counts that caused problems with large alignments. 1.9.7.3 - Fix allele frequency format to write genotype first in FORMAT per vcf spec. - Add new INFO fields in compare group vcf format to show allele counts in each group. - Ability to support short versions of mode names, such as "compact-file-stats" has the short mode name "cfs". There is a default short mode name generation implementation in AbstractCommandLineMode.getShortModeName() but each mode class can override this method in the case of short mode name collisions. In the case of collisions, the command line parser will not offer/accept ANY short mode names for the classes in question. - SamToCompact: Generate sorted goby alignments when a sorted BAM files is provided as input (use --sorted flag to activate this option). Thanks to Bradford Powell for the suggestion and draft implementation. - Fixed a bug in tally-reads that was triggered by reads of different lengths. Thanks to Adrian Platts for the bug report. 1.9.7.2 - Fix realignment around indels bug that prevented reads from being realigned to the left in exome data. Now correctly updates the start position of the moving window. - Renamed AlignmentEntry.splicedAlignmentLink to AlignmentEntry.splicedForwardAlignmentLink and added AlignmentEntry.splicedForwardAlignmentLink so splice links can be both bidirectional and more than two segments long. This change is included in the C/C++ APIs and make it possible for GSNAP to write splice information to Goby alignment files. - FDR mode now supports reporting the top n hits irrespective of corrected q-value threshold (top n hits are defined by the ranking produced by ordering the hits by increasing p-value, for the last column adjusted). - Significantly reduced memory consumption when performing FDR BH adjustment on hundreds of million of elements. - VCFWriter now writes missing value '.' in ID, ALT and FILTER fields, as required by VCF 4.1 documentation (http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41) This change is required to read the files generated by Goby with the latest version of Tribble used in IGV EA. - AlignmentToTextMode will now display splice information. 1.9.7.1 - alignment-to-counts now generates indexed base-level histogram files. Indexing makes it possible to jump quickly to a new genomic location in IGV. This is especially useful when viewing coverage for tens of tracks. - Filter out ambiguous reads from alignment-to-counts base level histogram output. Pre-1.9.7.1 behaviour can be obtained by setting the argument --filter-ambiguous-reads to false. alignment-to-counts: also tried a new way to create base-level histograms from sorted alignment files. This turns out to be about 3 times slower than the current approach. We still keep the new approach because it should scale to any size alignment. Mode alignment-to-count will use to the new approach if an alignment is sorted and has more than 50 million aligned reads. - Filter out ambiguous reads from alignment-to-annotation-counts by default. Pre-1.9.7.1 behaviour can be obtained by setting the argument --filter-ambiguous-reads to false. - Add ability to switch off the recording of sampleIndex. This is useful when concat is just used to put pieces of a large alignment back together after splitting reads for parallel processing. - Do not print indices at the end of upgrade. This caused upgrade to fail on some alignments with an exception. - Extended IterateAlignments to create alignment reader with a configurable AlignmentReaderFactory. - Set the default normalization method for alignment-to-annotation-count to bullard normalization only. - Fix a bug in VCFParser that affected parsing tab delimited files. Some files would be parsed with a tab in the value of the last column, separating the values of the last two actual columns. 1.9.7 - Now using protobuf 2.4.1. Please upgrade your local version of protobuf if you are recompiling from sources. - AlignmentWriter now correctly records Goby version in header upon close(). This fixes a problem when alignments read from read-only files would fail upon trying a new upgrade. - Optimized the performance of VCFParser on files with large number of columns. The VCF format seems designed without performance in mind, so it is hard to come up with a reasonably fast implementation. The current implementation of the Goby VCF parser can only process about 8,000 lines of compressed VCF per second on a desktop machine. - AlignmentEntry schema change: a new field sample_index holds the index of the alignment from which the entry was read. This is useful when concatenating over multiple alignments and realigning reads that span indels, to reliably track the alignment origin of each entry. The concatenation readers have been modified to set sample_index accordingly. Please note that the activeIndex field of the sorted reader is not a reliable way to identify the alignment of origin when realignment is active. Please use the new sample_index field instead. - We have added the capability to perform on the fly realignment around indels. This feature is available in mode discover-sequence-variants and in concatenate-alignments. The feature is activated with the new --processor realign_near_indels option. When the option is provided, a compressed reference genome must also be given on the command line (with the --genome option). This will trigger realignment of reads in regions where candidate indels are found by the aligner. The algorithm is very fast, in fact much faster than previously described approaches and consumes a reasonable amount of memory (function of maximum depth of coverage in the region where candidate indels are observed, but typically <2GB). Realignment correctly removes artefactual SNPs that can be introduced when an aligner fails to align the read ends properly through a read deletion. Please note that this version realigns read deletions. Realignment of read insertions has not been implemented. - Make it possible to open an alignment if the header file is present, but the entries file is missing. This allows to read the header only, for instance when we need to load counts and have access to targetIds. - Add mode to convert annotations to counts archive format. - Add new coverage mode to calculate coverage stats over annotation regions. When annotation regions are defined with capture regions, this mode outputs enrichment efficiency efficiency and depth of coverage for specific proportions of captured sites. The mode uses just .header and .count files and traverses count transitions. The algorithm used to iterate through count transitions is very efficient (for instance it takes about ~20 seconds to estimate coverage stats for an alignment with ~20M aligned reads). Count files are produced with GobyWeb together with the alignment or with the alignment-to-counts mode. - Add CountBinningAdaptor, useful to bin counts on the fly at any resolution for display in IGV. - Added ability to record total number of bases and sites seen in count archive. - Added a new mode (file-to-attributes) to generate a sample attribute file suitable for loading in IGV. Useful when files are named with the convention attr1-attr2-attr3.counts 1.9.6.1 - Patched VCF output for compatibility with VCF specification. Specifically, we now write . in the QUAL field and write genotype as the first field in the methylation output format. Additionally, we only write a VCF line if the site can be typed in at least on of the samples. This changes make Goby VCF output compatible with the IGV 2.0 VCFTrack. - Fix a bug in merge that could trigger a ArrayIndexOutOfBoundsException with some alignments. 1.9.6 - AlignmentReaderImpl now supports full random access to an alignment. Use reposition(ref,pos) followed by skipTo(ref,pos) to obtain the first entry matching at location (ref,pos). Prior to 1.9.6, the reposition method would not reposition to a location already visited forcing clients to close the alignment reader and reopen it (this new behaviour should improve performance in IGV). - The indexing logic used in versions of Goby up to 1.9.5 (inclusive) had subtle flaws. This could cause the skipTo method to behave incorrectly for some aligments. For instance, if reads matched on target N at a position larger than the length of target N+1, these reads would not be returned by skipTo. Thanks to Alec Chapman for identifying these issues. We have corrected the problem and added additional unit tests to check the behavior of the implementation in various edge cases. A consequence of this change is that the new indexing logic requires recalculating the .index data structure for alignments sorted and indexed with a version of Goby prior to 1.9.6. We provide a new mode, goby upgrade, to perform these calculations and fix such alignments. To upgrade alignments off-line, simply do: goby 3g upgrade [files]. This command will upgrade each alignment corresponding to the filenames provided. It skips those alignments produced by versions of Goby that do not require upgrading. The upgrade process creates a backup of the files that are affected: .index and .header are backed to .index.bak and .header.bak respectively. The upgrade process is relatively fast, in our tests we upgraded a 750Mb alignment file in 2'30". - Version 1.9.6 will try to upgrade alignments on the fly to the new version of the index data structures. - Detect when FastaToCompact is running in API mode versus command line. Do NOT do System.exit in API mode and instead throw exceptions. Also, API mode doesn't run conversions in parallel but instead runs them serially for easier exception catching. - VCFParser now splits headers by tab instead of whitespace so column names that contain spaces are read correctly. 1.9.5 - Determine alignment sortedness and index state from the header and by checking that the index file exists. This allows to recover alignments when the index file was deleted. In such cases, sorting the alignment can be done again, this is preferable to losing the alignemnt data. - New mode simulate-reads will generate reads artifically against a reference sequence. We use this mode to create simulated datasets of bisulfite converted reads or mutated reads and to test that Goby produces the expected results. - Show phred scores in DisplaySequenceVariants (tab + base) - Add a QualityEncoding.PHRED in case one just wants to transfer quality scores without changing quality scale - Rewritten sam-to-compact mode that handles sequence variations better, handles bsmap sam files better, and handles quality score conversions more flexibly. The old mode is still around called sam-to-compact-old for comparison. The new mode has slightly different command line paramters. - Added a discover-sequence-variants mode format 'methylation' to estimate methylation rates for RRBS and Methyl-Seq alignments. - Dramatically improved TMH loading times for large alignemnts. - Completely removed support for queryLength in header. This usage was deprecated in Goby 1.7, complicates the code unecessarily and is error prone (because we had two ways to store read length in the previous versions of Goby). Note that versions since 1.7 had a concat mode that transfered information from the header to the alignment entries transparently. Use this mode from a pre 1.9.4 release if you need to migrate a 1.6- alignment to work with Goby 1.9.5+. - Fixed a bug where merge-compact-alignments would throw an ArrayIndexOutOfBounds because a TMH query index was smaller than the first query index in the alignment. - Changed discover-sequence-variant mode to filter out alignment entries whose read mapped multiple locations in the reference (as determined by the aligner argument (i.e., -n for gsnap)). - Made AlignmentReader an interface. The previous AlignmentReader class is now called AlignmentReaderImpl. - ConcatSortedAlignmentReader and ConcatAlignemntReader now support a configurable AlignmentReaderFactory. The factory makes it possible to plug in alignment reads that filter entries as they are read. The default factory returns all reads. However, if NonAmbiguousAlignmentReader factory is installed, the concatenate reader returns only entries for which the read did not match other locations in the genome. Other filtering behaviour can be implemented in a sub-class of AlignmentReader (see NonAmbiguousAlignmentReader for an example) and a factory created to return instances of this class. This mechanism is used to filter out entries whose reads match several locations on the reference sequence. - Goby now includes a VCFParser class (see package edu.cornell.med.icb.goby.readers.vcf). VCF stands for Variant Call Format. The VCF format is described at http://www.1000genomes.org/node/101. The Goby VCFParser class implements a VCF 4.0+ parser. Importantly, this implementation also can be used to parse plain TSV files, or VCF that do not include the fixed VCF columns. It therefore support an extended version of the VCF format that is as generic as a TSV file, but can also provide meta-information about the columns in the specific file. Another difference with VCF 4.0 is that we support the Group attribute on column fields. This makes it possible to indicate that fields are part of the same group. Such a feature can be used by user interfaces that would like to offer the ability to manipulate multiple column fields as a group (for instance to hide or show an entire group of fields). - FDR mode now supports VCF input files and outputs. See the option --vcf to activate processing of VCF formatted files. - Added a VCFWriter class to write files in the VCF4 format. This class is now used by discover-sequence-variants when writing in genotypes format. This should make it possible to use vcf-tools on the genotype files produced. - Fix logic for IterateSortedAlignments which, in turn, fixes sequence-variation-stats2. The issue primarily dealt with insertions, deletions, and left and/or right padding. - Fixed the logic for TAB_SINGLE_BASE in display-sequence-variation mode to report the correct read_index and ref_position. 1.9.4 - The C API (used by BWA, GSNAP) has been updated to more accurately write sequence variations (this version fixes problems in reporting of the read index). We have created examples of how sequence variations are encoded in Goby alignment files. These examples are available at http://tinyurl.com/goby-sequence-variations - Mode concatenate-alignments now propagates names and versions of the aligners that contributed input alignments. - Mode sort now propogates the name and version of the aligner that produced the alignment. - Mode compact-file-stats now reports the name and version of the aligner that produced a Goby alignment file. - Mode discover-sequence-variants has been extended to support multiple types of outputs (see --format flag). One output format prints genotypes (--format genotypes), while another estimates the proportion of the reference allele in each sample (--format allele_frequencies). - Added a mechanism to support base filters in discover-sequence-variants. To activate these filters, you must provide the --eval option with the "filter" option. Two filters are currently active when --eval filter is used: one filters variant bases by quality score (keeping only bases with q-phred>=30) and another is a simple and efficient strategy to remove bases that do not quite agree across all the observations. Future versions will make it possible to customize the set of filters and their options. - sequence-variation-stats2 now runs in parallel up to the available number of threads when multiple alignments are given as input. - display-sequence-variations and sequence-variation-stats modes: Fix problems in the logic to calculate read-index for large insertions/deletions. 1.9.3 - This release has a C API compatible with our development version of GSNAP. A version of GSNAP released after 2011-03-11 should compile with Goby 1.9.3. - Add new statistics for discover-sequence variants. Notably, we now record the log odds ratio, the estimated standard error of the log odds ratio, as well as a Z-score for the log odds. Standard error and Z-score are only estimated if more than 10 counts exist in each cell of the contingency table. Also added the proportion of reference allele (refCount / (refCount+varCount). - Fix reformat-compact-reads bug where quality scores where longer by 1 than the sequence. - Reduce the memory needed by compact-file-stats to determine the number of reads in a compact reads file. - Changed how the number of reads in an alignment file is determined by compact-file-stats. We now report the number stored in the alignment header. - Change how log2 fold change was estimated. We used to estimate as ((log2_rpkm_group_a+1)/ (log2_rpkm_group_b+1)). This can cause problems when log2 rpkm are negative in one group and positive in the other. We now add 1 to counts before calculating RPKMs and taking the log. Similar changes were done to the fold-change. RPKM columns now return PRKM of (count+1). - Mode reformat-compact-reads now takes an optional -f argument to filter reads. This option can be used to remove redundant reads from a compact-reads file (see tally-reads mode to produce the read filter). It is no longer necessary to do round-trips to fastq to remove redundant reads. 1.9.2 - Fixed a major bug in discover-sequence-variants that sometimes could cause confusion in the group of origin of a variation. This bug could affect between group p-values. A Junit test now checks for the error condition and is part of regression testing. - sam-to-extract mode: append ".compact-reads" to output filename when the extension is missing. - Added a mode to display aligned reads for a region of the reference sequences. The reads are written in fasta format, suitable for viewing with a sequence alignment viewer such as JalView, CINEMA, etc. The mode is called alignment-to-pileup. - ConcatenateAlignmentReader would consume excessive amounts of memory when several large alignments (e.g., with >100 million reads) were concatenated. The reader was trying to allocate very large queryLength arrays, even though each underlying reader indicated that it its entries carried the queryLength. The fix consists in detecting that all the concatenated readers support queryLength in entries, and not allocating these arrays at all. This is a major bug fix that makes makes it possible to run more instances of goby modes on the same server (i.e., differential expression and sequence variant discovery modes have significantly improved memory usage). - Mode sam-extract-reads now supports an optional --quality-encoding argument. Default is BAM encoding. - QualityEncoding now supports BAM encoding (no offset or adjustment, the value of the character in ascii is the Phred score). - Fixed sam-extract-reads. Was not extracting sequences from BAM files. - compact-to-fasta mode: now supports reading an arbitrary slice of input. - sam-to-compact mode: draft support for importing SAM files produced by BSMAP. - fixed a bug that prevented running sam-to-compact mode from command line. An assertion prevented the code from running from the command line. Clarified the text of the assertion error and read the required parameter from the command line argument so that the mode will run again on SAM files generated outside of Goby. - reformat-compact-reads must trim quality scores in the same way that it trims the sequence. Quality scores were not trimmed in previous versions. This is now fixed. - reformat-compact-reads now correctly processes sequence pairs. Sequence pairs and quality scores can now be trimmed in the same way as the primary sequence. - Expose sampleFraction via API and command line for read-quality-stats mode - Make fasta-to-compact mode more callable via API - reformat-compact-reads during 'mutate' will no longer complain when there is no sequence-pair that it cannot mutate (mutation will not be attempted nor complained about if sequence.length is zero). 1.9.1 - fasta-to-compact mode: fix bug that prevented checking that quality encoding are in the allowed range. quality score must now be converted within the correct score range before the compact-reads file can be written successfully. - Paralellize the estimation of statistics. This can speed up mode alignment-to-annotation-counts. - Introduced a field spliced_alignment_link and spliced_flags in AlignmentEntry to represent relation between parts of reads that span exon-exon junctions. - Introduced insert_size in Alignment entry to represent the size of the insert used when making the sequence library. - Introduced meta-data in compact-reads files. Meta-data provide a way to document how the sample was opbtained. Suggested information to be recorded includes when the library was sequenced (useful to detect batch-effect, as suggested by a participant to the SEQC meeting at the NIH Bethesda campus), as well as sequencing instrument. Modes fasta-to-compact, compact-file-stats and reformat-compact-reads have been updated to define, transfer or display meta-data when appropriate. - Mode compact-alignment-stats now prints statistics about paired-end reads. - Removed spurious SAM header when writing alignments in plain text format. 1.9 - New fdr mode provides a tool to combine tab delimited file where some columns contain P-values and adjust selected P-values for multiple testing with the Benjamini Hochberg method. The tool is efficient in that it only keep P-values that need to be adjusted in memory, but otherwise keeps other column on disk. This strategy is expected to scale to hundreds of millions of lines of information. - Add a way to open only a slice of an indexed alignment file by position. This feature makes it possible to retrieve all alignment entries that start between specific position boundaries. See new constructor in AlignmentReader and ConcatSortedAlignmentReader. - The mode discover-sequence-variant has been updated to take advantage of the alignment position slicing feature introduced in Goby 1.9. See the new arguments --start-position and --end-position. - Fix a bug in skipTo that caused some alignment entries to fail to be returned (skipTo previoulsy ignored entries that occured in the chunk just before where the index points). This behaviour is incorrect because the chunk just before where the index points may contain entries with positions equal to the skipTo requested position. The index contract is to return the chunk that starts with an entry with the requested location. Because chunks contain multiple entries with increasing positions, the chunk immediately before the indexed chunk must be scanned and filtered to remove entries with positions before the skipTo requested position. A new test was written to check for this issue (TestSkipTo.testFewSkips4). - Provide Building/Installation instructions for the Goby C++/C API. - Implemented a fast concatenation operation for read files. The new -q flag in ConcatenateCompactReadsMode activates the fast concatenation. Chunks of compressed data are appended without requiring decompression and compression of the entries. This results in much faster concatenation that are bounded only by available IO. - Add mapping_quality field to AlignmentEntry protobuf schema. - Add aligner name and version in AlignmentHeader protobuf schema. - Added C/C++ api methods to set aligner name and version, and alignment entry mapping quality. - Updated the C API to be more generic, less oriented toward any one particular 3rd party tool. The read-API is now more generic, the write-API hasn't changed. The C API files, including the .h header files, have been renamed. - In C_Alignments.c/.h & C_CompactHelpers.h added CSamHelper and samHelper_* methods to assist with conversion of BWA to support CompactAlignments as the data stored in BWA just prior to writing alignments is effectively already in SAM format. These methods make it possible to reconstruct the aligned query and reference so data can be written in compact alignment. - Goby C/C++ API now requires the pcre (regex) >=8.10 library. See http://www.pcre.org/ - Compact alignments now support paried-end alignments in Java / C++ / C APIs. - In alignment-to-text mode, output support in PLAIN and SAM for Paired End alignments - in alignemt .stats file rename the stat "number.aligned.reads" to the more accurate name of "number.alignment.entries" for both the Java API and the C++ api. 1.8 - C API introduced to support native Goby support in GSNAP. - We now distribute a subset of Goby as the Goby IO API. This subset is packaged in the goby-io.jar file and released under the LGPL3 license. This was done to make it possible to include Goby format input output code directly into other software licensed under the LGPL3. - Fixed a bug that prevented Goby opening large alignment files (>3Gb). - Fixed a bug in AlignmentIterator triggered when reading alignment files with targetIndices starting at numbers larger than zero. - Removed dependency on colt (because it is not a pure LGPL license by adding restriction in military applications) - SGE helper scripts bz2compact.sh and keep-unique-reads.sh help process hundred of lanes in parallel on an SGE grid. bz2compact extracts fastq files compressed with BZip2 and converts them to compact-reads format. keep-unique-reads.sh determines the set of reads that are unique in each input .compact-reads and writes this information to a .uniqset-keep.filter - Mode concatenate-compact-reads now supports read index filters. This makes it possible to concatenate and keep only reads that are unique within each file. - Draft helper to iterate through individual reference positions of a sorted set of alignments (see IterateSortedAlignments). - Alternative implementation of sequence-variation-stats mode (called sequence-variation-stats2) that determines the number of reference bases matched at a given read index. This info is needed to call sequence variants, but slows down the stats. The initial implementation is preserved for compatibility. - New mode discover-sequence-variants will either (i) identify sequence variants within a group of sample or (ii) identify variants whose frequency is significantly enriched in one of two groups. This mode requires sorted/indexed alignments as input. - SamToCompact mode now populates the read quality scores for sequence variations (toQuality field). - Update picard/samtools to version 1.25. - In the mode "alignment-to-annotation-counts" the "--eval" options supports a new value "counts" which will output a format specifically designed for use with R's DESeq and notably for the R script geneDESeqAnalysis.R which is used with GobyWeb. - Fix bug in extract sequence variations for SAM format, where matches on the reverse strand got a read-index larger than one from the correct value. - By default, don't use "counts" in DiffExp as it is a specialized output for preparing for DESeq. - API interface for ReadsToWeightsMode. - LastToCompactMode wasn't writing target lengths. Fixed. - Read TMH in Python using Gzip. - Fixed Python utilies so -o actually writes to a file. - Added transcript-align.sh script to assist with aligning via transcripts. - In MessageChunksWriter, flush logic should occure on a COMPLETELY empty file, but otherwise it should only occure if entries have been added since the last flush(). In both C++ and Java. - DiffAlignmentMode can better compare differences when alignments were done by two different aligners and the Target Indexes are the same in label but not the same TargetIndex by building a master TargetIndex and translation maps for the two different alignments. Targets are now shown by label name instead of TargetIndex. - CompactFileStats --verbose on a compact alignment shows the targetIndex -> targetIdentifier map and also displays the targetLength for that targetIndex. 1.7 - Extended fasta-to-compact and compact-to-fasta to handle paired end runs. See new command line arguments --paired-end and pair-indicator arguments in fasta-to-compact and --pair-output argument in compact-to-fasta. - Draft support for paired sequence runs. The compact file format is extended to store sequence, sequence length and quality scores for the paired run. This extension makes it possible to store both paired end runs in a single compact file. This should help keep the data together. - Implemented translation back and from Solexa quality score encoding in fasta-to-compact and compact-to-fasta. Thanks to Cock PJA et al NAR 2010 for the clear description of the Solexa base quality scores. - The sort mode now supports reading only a slice of an input alignment (see options --start-position and --end-position). - Refactored CompactAlignmentToAnnotationCountsMode to use IterateAlignments (provides large speed ups when working with sorted/indexed alignments and selecting a subset of reference sequences for DE). - IterateAlignments now takes advantage of the skipTo method when the alignment is sorted and indexed. This provides large performance improvements when one needs to access data for only a few reference sequences in an alignments. All the modes that use IterateAlignments benefit, including display-sequence-variations, and sequence-variation-stats. - Index alignments that are sorted upon writing. The skipTo method leverages the index to provide fast semi-random access to entries by genomic location. This feature is used by the IGV Goby plugin, which requires Goby 1.7+. - Concatenate alignment now produces sorted alignments if all the input alignments are sorted. - Added a mode to sort alignment by reference sequence and then by position on the reference sequence. - Support to estimate read weights described in Hansen KD et al NAR 2010. See http://campagnelab.org/software/goby/tutorials/estimate-heptamer-weights/ In contrast to the initial publication, Goby supports using the weights to reweight annotation counts and transcript counts. - Support to estimate GC content weights for reads and to reweight raw counts to remove the dependence of counts on GC read content. - Preliminary support for barcoded reads (barcodes in the sequence), see new mode decode-barcodes (and tutorial online at http://campagnelab.org/software/goby/tutorials/handling-barcoded-reads/). - alignment-to-*-counts: New --eval argument allows to specify which statistics to evaluate when comparing samples. - alignment-to-*-counts: New eval options 'samples' will write a column per sample for RPKM, log2(RPKM) and raw counts. RPKM and log2(RPKM) are written once per sample and global normalization method. - Reduce memory requirements when concatenating many alignments. A change introduced in 1.6 caused more memory than needed to be allocated for each split of an alignment (as much as the number of reads in the file that was split). Each split now uses only as much memory as needed to keep query lengths for the split. - Dramatically improved performance for differential expression tests with millions of differentially expressed elements (e.g., exon+gene+other). The code previously incorrectly grew internal arrays from zero to the number of new DE element described in the annotation file. Changes that impact the compact alignment format: - The compact file format is extended to store sequence, sequence length and quality scores for the paired run. This extension makes it possible to store both paired end runs in a single compact file. This should help keep the data together. - Moved query lengths from header to alignment entries. This scales much better when processing large alignment files (generated from more than a few hundred million reads). - The optional 'sorted' attribute in header indicates if an alignment has been sorted. 1.6 - First draft of the Goby Python API and demonstration tools (see directory python). - Fix bug where compact file stats mode reported that a compact alignment had query identifiers but actually did not - Added within-group-variability mode. This mode estimates Fisher P-values between pairs of samples taken from a group of homogeneous samples. Summary statistics such as average p-value, or minimum p-value are reported for each gene in each pair considered. - Update JRI.jar to version 0.8-4 which now works properly with 64-bit Windows. - Update commons-lang to version 2.5. - Optimized DE type storage. - Fixed a race condition in CompactAlignmentToAnnotationCountsMode.java when running in parallel by moving .reserve() out of the for loop. - Renamed DifferentialExpression.ElementTypes enum to ElementType - Fixed a bug in the DifferentialExpressionCalculator which reset ElementType for a value from the actual value to OTHER (in occurred in CompactAlignmentToAnnotationCountsMode). Now once ElementTypes is set for a label it cannot be changed. - CompactFileStatsMode now supports an optional -o to write the output to a file. If not specified the output will be written to stdout. - Reformat reads now preserve read indices from the input file. This is necessary when using concat alignment with --adjust-query-indices false 1.5 - Added a mode to calculate counts and perform differential expression analysis for transcript runs (alignment-to-transcript-counts). Transcript runs are performed against a cDNA library. They find matches through through exon-exon junctions represented in the input cDNA library. They are a faster alternative to mapping the genome and exon-exon boundaries separately. Disadvantage is that these searches will only map to transcripts represented in the input library. - Changes to fasta-to-compact mode: - Add parallel processing in fasta-to-compact mode. Use the --parallel flag to activate. - Will now only regenerate compact-reads that do not exist, or are older than the input file. - Added a mode to write a read set to text format (set-to-text). The output will show the multiplicity of each query index. ReadSets can be efficiently created with tally-reads as before. - Changes to CompactAlignmentToAnnotationCountsMode - Added new option --write-annotation-counts boolean, defaults to true. If set to false the annotation counts intermediate files will not be written. - Lines where "average count group *" values are ALL NaN or <= 0 will not be written. This makes it so lines that don't add anything to the output are just omitted. - Added new option --omit-non-informative-columns, defaults to false. If set to true, columns in which all of the data is non-informative (values are ALL NaN or <= 0) will be omitted. - Support for alternative global normalization methods. We currently provide an implementation of the upper quartile normalization method by Bullard et al (BUQ) and the normalization method provided in Goby 1.4 (CAC, normalize by the number of alignment record in a sample) See the --normalization-methods argument. New normalization methods can be used with Goby by creating an implementation of the NormalizationMethod interface, and adding a jar on the classpath that defines a ServiceProvider (see build.xml goby-jar target for an example of how this is done). When several normalization methods are given as an argument to --normalization-methods Goby will produce derived statistics for each normalization method and append them as new columns in the summary stats output. This makes it easy to compare alternative normalization methods on the same dataset. - Added support for sequence variations: - Changed the compact alignment format to support recording sequence variations. - The new mode display-sequence-variations provides text output of sequence variations in several formats. - The new mode sequence-variation-stats will print statistics about sequence variations found in a set of alignments. - Added support for quality scores: - Changed fasta-to-compact and compact-to-fasta to read and write with the Sanger or Illumina quality encoding. - Modified aligners to indicate which format they require (bwa needs fastq format, lastag fasta format, lastal fastq format). This will need extensive testing as some of these changes can affect gobyweb. We use the FASTQ-SANGER encoding to communicate with lastal. We don't yet support the Solexa quality score encoding (it is a bit obsolete anyway). Please note that the output format in compact-to-fasta now defaults to Fasta format. This format has no quality scores, and consequently, we now never write quality scores when Fasta is requested. The aligners that need quality scores must request FASTQ format explicitly. See also: http://en.wikipedia.org/wiki/FASTQ_format http://maq.sourceforge.net/fastq.shtml http://last.cbrc.jp/last/doc/last-manual.txt (look for FASTQ-SANGER) - Changes to the Compact format: - Store target/reference sequence lengths in the alignment header. This information is helpful when calculating statistics such as RPKMs (transcript-level searches). - Store constant query lengths as one integer. Goby 1.4.1 stored one length for each read. This can become very memory consuming when the number of reads is very large. This change saves memory and storage. 1.4.1 - Added a mode to write a read set to text format (set-to-text). The output will show the multiplicity of each query index. ReadSets can be efficiently created with tally-reads as before. 1.4 - Last aligner (http://last.cbrc.jp/) is now supported "out of the box". Tested against version last-96. Support for the enhanced version "lastag" still exists. - Alignment-to-annotation-counts mode now computes a p-value using R (if available on the host) - Update to protobuf 2.3.0 (http://code.google.com/p/protobuf/) - Default extension for files written in Wiggle Track Format is now ".wig" for easier integration with the Integrative Genomics Viewer (http://www.broadinstitute.org/igv/). Similarly, the default extension for BedGraph Track Format files is now ".bed". 1.3 - New "counts-to-bedgraph" mode which is similar to "counts-to-wiggle" but writes the data in "bedgraph" format, which is another format the Genome Browser accepts. - New mode "version" to write the jar's version number to stdout - counts-to-wiggle mode: - Write at most one entry per resolution-sized window of data (averaging the data in that window) - Don't write data past the end of the size of the chromosome (which is possible with resolution > 1) - compact-alignment-to-annotation-counts mode: - Fixed problem with BH FDR adjustment caused by NaN p-values. - ChiSquare test p-values are now correctly reported. - Adjusted P-values (Bonferroni and BH) are set to 1.0 if they would be larger than 1. - Added magnitude of fold change to group comparison tsv output. 1.2 - compact-alignment-to-annotation-counts mode: - Added chi square test statistic and associated FDR adjusted stat. Chi-square statistics support multi-group comparisons. - Added the --parallel option to speed up computations on multiple core machines. 1.1 - compact-alignment-to-annotation-counts mode: - Make it possible to process multiple alignment files in one run of the mode. - Added support for group comparisons. Group statistics are now computed and written to a summary file (see --comparison --stats and --groups options). The following statistics have been implemented: T-Test and fold-change across RPKMs in the comparison groups, Benjamini-Hochberg FDR adjustment for t-test P-value and Bonferroni correction for t-test P-value. Average RPKM in each group. - Fix a bug where data matching chromosome "chr1" was excluded from wiggle tracks created from Goby count data. (Mantis issue #1349) 1.0 - First public release.