Fast Lossless Compression via Cascading Bloom Filters

TitleFast Lossless Compression via Cascading Bloom Filters
Publication TypeConference Paper
Year of Publication2014
AuthorsRozov, R., Shamir R., & Halperin E.
Volume15
Other Numbers3799
Abstract

Background
Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes.

Results

We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters.
 

Bibliographic Notes

Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq 2014), Pittsburgh, Pennsylvania, BMC Bioinformatics 2014, Vol. 15, Suppl 9:S7

Abbreviated Authors

R. Rozov, R. Shamir, and E. Halperin

ICSI Publication Type

Article in conference proceedings