A Simple and Powerful Approach To Compress Fastq Files

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/233879218

KungFQ: A Simple and Powerful Approach to Compress fastq Files

Article  in  IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM · November 2012
DOI: 10.1109/TCBB.2012.123 · Source: PubMed

CITATIONS READS
16 1,169

3 authors, including:

Elena Grassi Ivan Molineris


Università degli Studi di Torino Università degli Studi di Torino
32 PUBLICATIONS   612 CITATIONS    52 PUBLICATIONS   535 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Alternative Polyadenylation View project

All content following this page was uploaded by Elena Grassi on 30 May 2014.

The user has requested enhancement of the downloaded file.


IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 9, NO. X, XXXXXXX 2012 1

KungFQ: A Simple and Powerful Approach there is no standard format and therefore it is hard to define a fixed
algorithm to compress them—we decided to create ad hoc
to Compress fastq Files procedures to encode two of the most frequent formats and leave
the rest to the chosen downstream compression algorithm.
Elena Grassi, Federico Di Gregorio, and Since sometimes one is only interested in the read sequences,
Ivan Molineris we added the possibility not to encode the IDs and/or even the
quality values and in the latter case to choose if one wants the
Abstract—Nowadays storing data derived from deep sequencing experiments
base calls corresponding to qualities lower than a defined cutoff
has become pivotal and standard compression algorithms do not exploit in a converted to N.
satisfying manner their structure. A number of reference-based compression The mainstream approach when compressing genomic data is to
algorithms have been developed but they are less adequate when approaching annotate the difference between the data that should be com-
new species without fully sequenced genomes or nongenomic data. We pressed and a reference genome and yields good results [3], [4], [5],
developed a tool that takes advantages of fastq characteristics and encodes them [6]. This is efficient and feasible when dealing with species that
in a binary format optimized in order to be further compressed with standard tools have a valid genomic reference or experiments which are aimed at
(such as gzip or lzma). The algorithm is straightforward and does not need any
obtaining the whole genome of the given sample, but is less
external reference file, it scans the fastq only once and has a constant memory
adequate when approaching new species or nongenomic data due
requirement. Moreover, we added the possibility to perform lossy compression,
losing some of the original information (IDs and/or qualities) but resulting in smaller to the lack of a standard reference to align reads; for instance, RNA
files; it is also possible to define a quality cutoff under which corresponding base sequencing data should be aligned against the entire “theoretically
calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various possible” transcriptome and is not practical to account each
fastq files without losing information and 5.37 to 8.77 losing IDs, which are often possible alternative transcript splicing or unknown exon.
not used in common analysis pipelines. In this paper, we compare the algorithm Since our approach is completely different from those based on
performance with known tools, usually obtaining higher compression levels. a reference sequence, we compared our performance against the
best available software with a comparable approach: DSRC [7];
Index Terms—Biology and genetics, algorithms for data and knowledge which outperforms other specific (G-SQZ [8]) and generic tools
management (gzip and bzip2) in terms of compression ratios and therefore
represents the best alternative. We did not take in consideration
Ç
SOLiDzipper [9] because it uses different input files (csfasta and
1 INTRODUCTION QV, thus color space) and its main purpose is high speed in order
DEEP sequencing techniques have posed new problems to to use NGS data with cloud computing, thus compression rates are
bioinformaticians: up until now, experimental improvements considered less important and are lower than G-SQZ ones.
never created amounts of data requiring hardware upgrades that Similarly, we did not compare ourselves with DNACompress
went beyond easily achieved ones—memory and storage capabil- [10] as it is only able to compress DNA sequences and is an
ities were raising at an appropriate pace. In September 2010, the implementation of the Lempel Ziv algorithm which we use as a
Sequence Read Archive [1] had more than 500 billion reads, downstream compression tool.
corresponding to 60 trillion base pairs and it nearly had to close DSRC is a C++ software that is able to compress fastq files,
because it was not able to keep up with the pace of sequencing decompress them and access single reads in the compressed ones.
output. Every researcher interested in working on data down- The whole approach is based on dividing the reads in blocks and
loaded from SRA or derived from individual experiments must be superblocks and compute statistics over each superblocks in order
able to manage gigabytes of files deriving from each single to be able to summarize common parts of them, for example,
experiment; since most common approaches in bioinformatics storing only once common data shared by all the IDs of the reads
require using data derived from many different experiments this contained in a superblock. For sequences, it allows characters
easily becomes a problem. different from A, C, T, G, and N considering the rare occurrences
Our interest, thus, lies in developing a tool capable of
of IUPAC codes moving their encoding in the quality streams;
compressing fastq files, the most common format used by
when a superblock contains a lot of “extra symbols” the approach
downstream analysis software (such as alignment tools like bowtie
changes to Huffman encoding, while other superblocks are treated
[2]), exploiting their peculiar characteristic and creating files that
with LZ77-style encoding. Qualities are divided in three categories
could be further compressed by standard software (like gzip and
and then encoded according to the predefined category structure,
lzma) which are not designed specifically for this format and thus
do not exploit all compression possibilities. Ideally, this tool applying Huffman or run length encoding.
should be simple, fast (particularly when decompressing, because
all typical downstream analyses will start from already com- 2 METHODS
pressed files) and not too memory-hungry.
We decided to follow a simple approach to compress read IDs, 2.1 General Approach
sequences (in a base space format, which is the most used—more- The compression algorithm relies on treating IDs, sequences, and
over the algorithm could easily be adapted to color space) and base qualities as three whole independent streams. For sequences and
quality data from fastq files. Dealing with IDs is a problem because quality data the basic idea is to use a run length encoding: when
there are more than n equal characters one can annotate the
number of repetitions and the repeated character. The first bit is
. E. Grassi and I. Molineris are with the Department of Genetics, Biology used as a flag to distinguish bytes encoding single characters from
and Biochemistry, Molecular Biotechnology Center, via Nizza 52, Torino run lengths. Quality values and sequences are considered as two
10126, Italy. E-mail: grassi.e@gmail.com, ivan.molineris@unito.it. long sequences—single reads are reconstructed when decoding by
. F. Di Gregorio is with DNDG srl, via Maria Vittoria 2, Torino 10123,
Italy. E-mail: fog@dndg.it.
knowing the read length, which is the same for every read in a
fastq with recent technologies, this approach should ensure similar
Manuscript received 14 Nov. 2011; revised 21 Aug. 2012; accepted 8 Sept.
compression ratios with different read lengths deriving from
2012; published online 25 Sept. 2012.
For information on obtaining reprints of this article, please send e-mail to: different sequencing platforms. After recognizing the ID format,
tcbb@computer.org, and reference IEEECS Log Number TCBB-2011-11-0299. the fixed part is encoded only once in a header of the encoded file
Digital Object Identifier no. 10.1109/TCBB.2012.123. while the variable portion is encoded for every read. The whole
1545-5963/12/$31.00 ß 2012 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 9, NO. X, XXXXXXX 2012

Fig. 1. The encoding process: steps of the compression algorithm on portions of a sample read.

encoded result is then saved as is or further compressed by gzip or single file it skipped a value so we had to encode it for every read)
lzma according to the options used. Fig. 1 is a schematic and “length ¼ 36” is optional so one just needs to store in the
representation of the encoding phase. header the information about its presence.

2.2 IDs Encoding 2.3 Sequence Encoding


The most frequent formats for fastq read IDs that we encountered Sequences have only five possible characters (A, C, G, T, and N)
are the ones used by SRA and the ENCODE Project [11], therefore and therefore in a single byte we can store a single bit flag and up
when encoding a fastq we try to detect via regular expressions if to three base calls (we have 53 ¼ 125 possible triplets of nucleotides
one of those format is used or not. In the former case, we store the and 27 ¼ 128 usable bits combinations) or a run length for
constant portion of IDs only once and only encode the variable repetitions longer than four bases; the bit flag is necessary to
parts of every read, while in the latter we simply pass the IDs to the discriminate between these two cases. We encode in a single byte
chosen downstream compressor. repetitions up to 19 bases, as long as we use 1 bit for the flag, 3 to
encode which character is repeated and in the 4 bits left the length:
Example for the SRA format:
the biggest representable integer is 15 (the shortest run lengths are
@SRX000571_SRR002322.18437692
of four bases, therefore we always sum 4 to the encoded lengths
080317_CM-KID-LIV-2-REPEAT_0003:7:330:466:87
when decoding). Repetitions longer than 19 bases will be encoded
length=36 in two separate bytes.
The first part up until the dot and “080317_CM-KID-LIV-2- An example of the process is depicted in Fig. 1: the “0” flag at
REPEAT_0003” are constant while the other numbers are variable the beginning of the first byte indicates that we are encoding a
(the one just after the dot seems to be a simple counter but in a triplet, “CCT,” that corresponds to the bit sequence “0100010”;
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 9, NO. X, XXXXXXX 2012 3

Fig. 2. Histogram representing the quality data of SRR002321 from SRX000571.

while the sequence of “G” stored in the second byte uses a “1” flag more efficient encoding on sequences or quality data in different
followed by the length of the sequence (5  4 ¼ 1, therefore we parts of the fastq)—the output stream is directly compressed
have “0001”) and then “G” encoded as “010.” with gzip using SharpZipLib [12] or with lzma using the LZMA
Another possible approach, which in the end was not adopted SDK [13]. When using gzip a compression block size equal to our
because it yielded worse results with regard to final compression blocks is adopted in order to reach the maximum possible gain
ratios, was to break the “byte modulus” and use only the needed in compression.
7 bits to encode a triplet and the two of the three (128  125 ¼ 3)
remaining combinations as flags for run lengths and end of 2.6 Lossy Compression
sequence. The end of sequence flag is needed because when Lossy compression is performed by simply avoiding to encode the
decoding one does not know how many reads have to be decoded IDs or the quality data blocks and modifying the stored sequence if
and without a flag that says to stop it is possible to add a spurious a cutoff quality is given. To help with the choice of the cutoff value
read at the end (in the last encoded sequence block it is possible we added an option which creates a figure with an histogram
that the final 1, 2, or 3 bases have to be encoded padding some “N” depicting the occurrences of different quality values and a text file
to yield a complete triplet). with raw counts; see Fig. 2 for an example.
Analyses of run lengths and triplets statistics in our sample
fastq files showed that run lengths were, on average, rarer (1 run
3 RESULTS AND DISCUSSION
length every 26 triplets, on average, in SRA data) than triplets and
that 60 triplets represented the 95 percent of the total occurrences 3.1 Comparisons with the Gold Standard
and 65 the remaining ones. Thus, it seemed sensible to try an We compared the compression ratios of our tool with the encoded
approach that uses less space to encode the 60 more frequent stream compressed with lzma, without any lossy options, with
triplets, using only the 6 needed bits to encode them and sparing those of DSRC—test data are from RNA sequencing experiments
three 6 bits codes to indicate where a rare triplet or a run length obtained from SRA, the ENCODE Project and some genomic data
followed. Rare triplets in this schema are represented by 7 bits and from the DDBJ Sequence Read Archive (38 files); three files where
there are two classes of run lengths, one for repetitions from the same used for comparison in [7] and [8]. Our compression
4 bases up to 19 and the other one for longer ones (between 20 and ratios are always larger than DSRC ones for ENCODE data, while
259 bases), that are encoded in 11 bits. only sometimes smaller for SRA data and genomic data; in this
This solution was developed but was not chosen for the final latter cases the origin of our less effective compression could be the
tool due to its limited gain of compression ratios with respect to the additional number, part of the IDs, that we had to store for every
simpler one explained at the beginning and its dependence on the read because in a single file it skipped a value, while in all other
statistical characteristics of fastq files, which could change in files it seemed a simple counter.
different experimental settings (RNAseq, DNAseq, ChIP-seq, etc.). The average gain in encoded file size on these data is
66 Mb—our compressed files altogether use 2.5 Gb less than those
2.4 Quality Encoding compressed by DSRC and 170 Gb less than the original fastq files
Quality data have a broader range of characters (between 33 and (total size: 228 Gb); Tables 1 and 3 show some summarized data
126 depending on the different ASCII encoding used by different with comparisons between specific (DSRC, G-SQZ, and BAM) and
sequencing platforms) so we only adopted the run length generic (gzip, bzip2, 7zip, and lzma) compression formats. When
approach: single characters are encoded in a byte while run fastq files had a repetition of the ID on the third line of every entry
lengths use two bytes (the first bit is always the flag and the (which is a redundancy as long as in the fastq format this line just
remaining 7 bits encode the character, possibly followed by a byte has to be composed of a single “+”), it was removed prior to
storing the number of repetitions, up to 255 in this case). evaluating original fastq sizes to get fair compression ratios. On
some files (SRA data and one of the genomic ones) G-SQZ failed
2.5 Merging and Compressing the Encoded Streams thus we were able to evaluate its ratios and running times only on
The compressed streams are created together when reading the a subset of the whole test set.
input fastq file and are written as fixed size blocks interleaved on Since long range regularities of the data are more easily
the output file, always using the first byte of the written blocks identified when working on separated streams with similar data
as a flag that tells if we have a sequence or a quality compressed (IDs, sequences, and quality values), to further investigate the
block (repetitions of the same kind of block are possible due to compression gain derived from our encoding prior to lzma
4 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 9, NO. X, XXXXXXX 2012

TABLE 1
Compression Ratios Comparisons between KungFq, DSRC, G-SQZ, and BAM

TABLE 2
Running Times (Encode and Decode, in Seconds) Comparisons between KungFq, DSRC, G-SQZ, and BAM

Fig. 3. Comparison of compression ratios between lossy and non lossy compression.

compression, we listed the compression ratios obtained with lzma the IDs, which yields a further space saving of about 12 percent on
compression on the three separated streams (“split ratios” in encoded files and brings the total space saving to 83 percent with
Table 3): the mean ratio obtained with KungFq is still larger. respect to the original fastq size. The gain is even higher when
As expected results are notably better when performing lossy quality data are not stored (for the tests with a cutoff we chose 42,
compression: we suppose that many end user will at least ignore the ASCII character ’*’, which is a fairly low quality for the
Sanger encoding used in SRA data—ratios does not change if one
uses a cutoff or not when losing quality values): see Fig. 3 and
TABLE 3 Tables 5, 6, and 7. We compared the results of our tool with generic
Compression Ratios Comparisons between gzip, bzip2, 7zip, and lzma
compression tools (compressing only the relevant portions of the
fastq files: sequences and quality values, id, and sequences or only
sequences) to point out that its performances are better even with
the lossy options. All these lossy ratios are evaluated only on a
subset of the fastq files, namely the SRA data.
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 9, NO. X, XXXXXXX 2012 5

TABLE 4
Running Times (Encode and Decode, in Seconds) Comparisons between gzip, bzip2, 7zip, and lzma

TABLE 5
Compression Ratios Comparisons between KungFq and Generic Compression Tools Losing IDs

TABLE 6
Compression Ratios Comparisons between KungFq and Generic Compression Tools Losing Quality Values

TABLE 7
Compression Ratios Comparisons between KungFq and Generic Compression Tools Losing IDs and Quality Values

Our prototype in C# is somewhat slower than DSRC (almost gzip lowered too much, supposedly because this “padding” did
10 times when encoding and 3 when decoding, with DSRC times in not create a sufficiently uniform distribution of bytes.
the order of 10 minutes, see Table 2 for a list of encoding and In the end the first approach yields the best compression ratios.
decoding times for all the tested specific tools and Table 4 for We also tried to use a recent compression software, lrzip [16], to
generic ones), which is written in C++, but performance gains are compress our encoded output. This tool scans the file to be
possible by rewriting it in C or by algorithmic improvements like, compressed as far as it can (depending on the physical memory
available) and reorganizes it, in order to exploit long range
for example, encoding IDs, sequence, and quality data blocks in
repetitions of patterns, before compressing it with other standard
parallel. When using gzip instead of lzma encoding times are algorithms, comprising lzma. It gave results similar to those of
better but the compression ratio is lower, thus depending on what lzma (mean ratio 3.99), thus we performed the main comparison
factor is more important one could choose the preferred option. using lzma also considering that lrzip is not available as a library,
Encoding and decoding times are listed as reported by the unix that its compression ratios depend on the amount of RAM
tool time (“elapsed” data were used). Supplementary material, available and that its approach to load in memory as much data
which can be found on the Computer Society Digital Library at as it can to prescan the file to be compressed is in opposition with
http://doi.ieeecomputersociety.org/10.1109/TCBB.2012.123, with our algorithm that proceeds with small (1 Mb) encoded blocks.
encoding and decoding times with the lossy options are provided Nonetheless if one wants to use lrzip there is an option which
separately. excludes the compression phase with lzma or gzip and just creates
the encoded file that could be compressed with any other tool.
Our tool supports only fastq with equal read lengths, like the
4 APPROACHES TO GO FURTHER other specific tools that we compare to (DSRC [7] and G-SQZ [8]).
We suppose that the most frequent use of compression tools would
Lzma is already a really good compression algorithm but our
precede eventual trimming of adapter sequences and low-quality
encoding managed to beat its performances and using it as a
base calls as long as one would like to keep the original data.
downstream compressor allowed to reach compression levels on Nonetheless our approach could be modified to support variable
average higher than DSRC. We tried and managed to achieve read lengths, as long as we are using only 125 of the 128 possible
smaller encoding by two approaches that do not write whole bytes 7 bits codes for our triplets encoding and we could use one of the
but in this way lzma (and gzip) compression became considerably spare ones to signal the end of the reads, even if this would lead to
less efficient. To overcome this problem, we developed a modified a less efficient compression rate.
version of the simpler of these alternatives: we “fill” the 7 bits used
to encode triplets to a whole byte with the 7 bits gotten from the
eight triplet in a row (to perform this test we used only the short 5 CONCLUSIONS
run length class, which needs 7 bits to be encoded—so we always In this work, we propose a simple and linear algorithm that
had 7 bits blocks); but even in this way performances of lzma and reaches good compression levels on base space fastq files—the tool
6 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 9, NO. X, XXXXXXX 2012

should be useful for those who need to reduce the space required . For more information on this or any other computing topic, please visit our
Digital Library at www.computer.org/publications/dlib.
to store high quantity of next generation sequencing data. We think
that the definitive approach in fastq compression still does not
exist but in the present work we made a proof of concept that the
pivotal step of our algorithm (run length and triplets encoding)
may be an important piece of the puzzle.
The source code and the executables are freely available from
Source Forge (http://quicktsaf.sourceforge.net/).

ACKNOWLEDGMENTS
The author would like to thank Paolo Provero for his useful
suggestions. This work was supported by the Italian Ministry of
University and Research FIRB piattaforme/reti rbpr05zk2z_006.

REFERENCES
[1] R. Leinonen, H. Sugawara, M. Shumway and Int’l Nucleotide Sequence
Database Collaboration “The Sequence Read Archive.” Nucleic Acids
Research, vol. 39, pp. D19-D21, Jan. 2011.
[2] B. Langmead, C. Trapnell, M. Pop, and S. Salzberg, “Ultrafast and Memory-
Efficient Alignment of Short DNA Sequences to the Human Genome,”
Genome Biology, vol. 10, no. 3, p. R25, 2009.
[3] C. Kozanitis, C. Saunders, S. Kruglyak, V. Bafna, and G. Varghese,
“Compressing Genomic Sequence Fragments Using Slimgene,” Research in
Computational Molecular Biology, vol. 6044, pp. 310-324, 2010.
[4] S. Christley, Y. Lu, C. Li, and X. Xie, “Human Genomes as Email
Attachments,” Bioinformatics, vol. 25, no. 2, pp. 274-275, Jan. 2009.
[5] K. Daily, P. Rigor, S. Christley, X. Xie, and P. Baldi, “Data Structures and
Compression Algorithms for High-Throughput Sequencing Technologies,”
BMC Bioinformatics, vol. 11, no. 1, article 514, 2010.
[6] M. Hsi-Yang Fritz, R. Leinonen, G. Cochrane, and E. Birney, “Efficient
Storage of High Throughput DNA Sequencing Data Using Reference-Based
Compression,” Genome Research, vol. 21, no. 5, pp. 734-740, May 2011.
[7] S. Deorowicz and S. Grabowski, “Compression of DNA Sequence Reads in
FASTQ Format,” Bioinformatics, vol. 27, no. 6, pp. 860-862, Mar. 2011.
[8] W. Tembe, J. Lowey, and E. Suh, “G-SQZ: Compact Encoding of Genomic
Sequence and Quality Data,” Bioinformatics, vol. 26, pp. 2192-2194, July
2010.
[9] Y.J. Jeon, S.H. Park, S.M. Ahn, and H.J. Hwang, “Solidzipper: A High Speed
Encoding Method for the Next-Generation Sequencing Data,” Evol Bioin-
form Online, vol. 7, pp. 1-6, 2011.
[10] X. Chen, M. Li, B. Ma, and J. Tromp, “DNACompress: Fast and Effective
DNA Sequence Compression,” Bioinformatics, vol. 18, pp. 1696-1698, 2002.
[11] S.E. Celniker, L.A.L. Dillon, M.B. Gerstein, K.C. Gunsalus, S. Henikoff, G.H.
Karpen, M. Kellis, E.C. Lai, J.D. Lieb, D.M. MacAlpine, G. Micklem, F.
Piano, M. Snyder, L. Stein, K.P. White, and R.H. Waterston, “Unlocking the
Secrets of the Genome,” Nature, vol. 459, no. 7249, pp. 927-930, June 2009.
[12] M. Krueger, “Sharpziplib,” http://www.sharpdevelop.net/OpenSource/
SharpZipLib/, 2010.
[13] I. Pavlov, “LZMA SDK,” http://www.7-zip.org/sdk.html, 2011.
[14] Picard, http://picard.sourceforge.net, 2012.
[15] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth,
G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing
Subgroup, “The Sequence Alignment/Map Format and SAMtools,”
Bioinformatics, vol. 25, no. 16, pp. 2078-2079, Aug. 2009.
[16] C. Kolivas, “lrzip,” http://freshmeat.net/projects/lrzip, 2011.

View publication stats

You might also like