Biowulf at the NIH
RSS Feed
quip on Helix & Biowulf

Program quip (http://homes.cs.washington.edu/~dcjones/quip/) compresses a file of next-generation sequencing data based on knowledge of the data format, which can be FASTQ, SAM, or BAM.

For FASTQ files quip can achieve a compression ratio of more than 6:1 (original-bytes:compressed-bytes), and for SAM or BAM files a compression ratio of more than 5:1.

Program Location

You can add program quip to your PATH environment variable most easily by using the module command, as in the example:

[user@biowulf]$ module avail quip            (see what versions are available)

------------------- /usr/local/Modules/3.2.9/modulefiles -------------------
>quip/1.1.5

[user@biowulf]$ module load quip             (load the default version)

[user@biowulf]$ module list                  (see what version is loaded)
Currently Loaded Modulefiles:
  1) quip/1.1.5

[user@biowulf]$ module unload quip           (unload this version)

[user@biowulf]$ module load quip/1.1.5       (load a specific version)

[user@biowulf]$ module list
Currently Loaded Modulefiles:
  1) quip/1.1.5

Program Usage

With the quip module loaded, you can run the quip program, for example,

[user@biowulf]$ quip --help
Usage: quip [option]... [file.FORMAT]...
Compress, or decompress high-throughput sequencing data 
 
Options:
  -d, --decompress     decompress (default is compress)
  -c, --stdout         write on standard output
  -h, --help           print this message
  -V, --version        display program version

The principal uses of program quip are

Compression / Decompression of 'fastq' data files

[user@biowulf]$
[user@biowulf]$ module load quip
[user@biowulf]$
[user@biowulf]$ ls -l READS.fastq
-rw-r----- 1 owner owner 21512783306 Sep 11 11:00 READS.fastq
[user@biowulf]$
[user@biowulf]$ quip READS.fastq
[user@biowulf]$ ls -l READS.fastq.qp
-rw-------  1 owner owner  3217821916 Dec 23 12:23 READS.fastq.qp
[user@biowulf]$
[user@biowulf]$ quip -c -d .READS.fastq.qp > READS.fastq.deqp  ## quip decompression to new file
[user@biowulf]$ ls -l READS.fastq.deqp
-rw------- 1 owner owner 18797488456 Dec 23 15:03 READS.fastq.deqp
[user@biowulf]$

Notes
  • Program quip wrote the compression of "READS.fastq" to the file "READS.fastq.qp"
  • Program quip does not over-write an existing file without asking your permissiom
  • The quip decompression (with option "-d"), is run with option "-c" to send the output to STDOUT which has been redirected to the file, "READS.fastq.deqp"
  • In this real example quip achieves a compression ratio of more than 6:1.
Compression / Decompression of 'sam' data files

[user@biowulf]$
[user@biowulf]$ module load quip
[user@biowulf]$
[user@biowulf]$ ls -l ALIGNS.sam
  -r--r--r--  1 owner owner   554036876 Jan 12 13:34 ALIGNS.sam
[user@biowulf]$
[user@biowulf]$ quip ALIGNS.sam
[user@biowulf]$ ls -l ALIGNS.sam.qp
  -rw-------  1 owner owner   100937045 Jan 12 13:35 ALIGNS.sam.qp
[user@biowulf]$
[user@biowulf]$ quip -c -d ALIGNS.sam.qp > ALIGNS.sam.deqp
[user@biowulf]$
[user@biowulf]$ ls -l ALIGNS.sam.deqp
  -rw-r--r--  1 owner owner   554036876 Jan 12 13:49 ALIGNS.sam.deqp
[user@biowulf]$

Notes
  • Program quip wrote the compression of "ALIGNS.sam" to the file "ALIGNS.sam.qp"
  • The quip decompression (with option "-d"), is run with option "-c" to send the output to STDOUT which is redirected to the file, "ALIGNS.sam.deqp"
  • In this real example, quip achieves a compression ratio of more than 5:1.
Performance

Performance of program quip

[user@biowulf]$
[user@biowulf]$ ls -l READS.fastq
-rw-r----- 1 owner owner 21512783306 Sep 11 11:00 READS.fastq
[user@biowulf]$
[user@biowulf]$ time quip READS.fastq 
real	10m17.069s
user	18m38.571s
sys	0m21 .085s
[user@biowulf]$
[user@biowulf]$ time quip --output=FASTQ -c -d READS.fastq.qp >READS.fastq.deqp
real	13m8. 382s
user	24m19.852s
sys	0m19. 741s
[user@biowulf]$
[user@biowulf]$ fastq-md5 READS.fastq
Reading FNAME.fastq
Writing FNAME.fastq.prep
real    8m12.971s
user    5m32.243s
sys     1m35.223s
[user@biowulf]$
[user@biowulf]$ time openssl md5 -c READS.fastq.preqp READS.fastq,unqp
MD5(READS.fastq.prep)= a7:3e:7a:49:cd:22:68:cc:6c:7d:24:79:b1:d9:f9:b1
MD5(READS.fastq.deqp)= a7:3e:7a:49:cd:22:68:cc:6c:7d:24:79:b1:d9:f9:b1
real    20m40.992s
user    1m10.092s
sys     2m7.566
[user@biowulf]$

Notes

Comparison with program gzip

[user@biowulf]$ bench gzip -c READS.fastq >READS.fastq.gzip
real 2817.19
user 2801.60
sys  13.63
cpu  99%
mem  2992kb
[user@biowulf]$ 
[user@biowulf]$ ls -l READS.fastq.gzip
 -rw-r--r--  1 owner owner  6628538744 Jan 19 16:55 READS.fastq.gzip

Notes
Verifying the corrrectness of a quip compression

Before replacing a FASTQ, BAM, or SAM data file with its quip-compression, it is prudent to verify that the compression contains all necessary information from the original data file. Because the results of the quip compression strategy can complicate verification, quip provides two utilities, program fastqmd5 and program bammd5. to assist with verification of quip-compression.

The next two sections explain why these programs are needed, and how to use them.

FASTQ files

PROBLEM:
The byte size of file "READS.fastq.deqp" may be less than that of the original data file, "READS.fastq".

When files "READS.fastq" and "READS.fastq.deqp" are not identical, we can't verify the correctness (of necessary information) by any straightforward comparison of files.

EXPLANATION:
A schematic of the FASTQ format might look like,

@ <ReadSequenceHeader>
<ReadSequence>
+ <QualitySequenceHeader>
<QualitySequence>

Nominally, the <QualitySequenceHeader> is identical to the <ReadSequenceHeader>,
but in some FASTQ usage each <QualitySequenceHeader> is left blank.

To compensate for this FASTQ input variability, quip removes every <QualitySequenceHeader> before compression.
Thus the 'fastq.deqp' file can be smaller (i.e. not identical) to the original 'fastq' file.

Since the <QualitySequenceHeader> is redundant, its removal is not important for sequence analysis.

SOLUTION:
Program fastqmd5 can do such verification because it removes all the <QualitySequenceHeader> records from a FASTQ format file before computing its md5 checksum.

[user@biowulf]$ ls -l READS.fastq READS.fastq.deqp
-rw-r----- 1 owner owner 21512783306 Mar  6 12:28 READS.fastq
-rw-r----- 1 owner owner 18797488456 Mar  6 12:28 READS.fastq.deqp
[user@biowulf]$
[user@biowulf]$ module load quip
[user@biowulf]$ fastqmd5 READS.fastq READS.fastq.deqp
b7589f49dbbea3bd4cd4d00ea77667cf READS.fastq
b7589f49dbbea3bd4cd4d00ea77667cf READS.fastq.deqp
[user@biowulf]$

SAM and BAM files

PROBLEM:
Even when the byte size of file "ALIGNS.bam.deqp" ("ALIGNS.sam.deqp") is the same as that of the original data file, "ALIGNS.bam" ("ALIGNS.sam"), the files may not be identical.

So then, as in the case of FASTQ files, we can't verify the correctness (of necessary information) by any straightforward comparison of files.

EXPLANATION:
A SAM alignment record has a required section and an optional section.(See SAM Format for details).

The optional section consists of none, one, or more fields in format

Tag:Type:Value

The Tag part of any occuring field is unique to that field in the parent record.

quip-compression preserves each occuring optional field but leaves optional fields within a SAM alignment record in an arbitrary order.

quip-compression has the same effect on an "ALIGNS.bam" file, the binary form of "ALIGNS.sam".

SOLUTION:
Program bammd5 can verify correctness for a quip-compressed BAM file, because it performs the equivalent of removing any difference in the order of optional fields of every alignment record from the SAM format file, before computing the md5 checksum.

[user@biowulf]$ ls -l ALIGNS.bam ALIGNS.bam.deqp
-rw-r----- 1 owner owner   130169057 Nov 16 10:38 ALIGNS.bam
-rw-r----- 1 owner owner   132181426 Mar  7 14:49 ALIGNS.bam.deqp
[user@biowulf]$
[user@biowulf]$ module load quip
[user@biowulf]$ bammd5 ALIGNS.bam ALIGNS.bam.deqp
0c2aa57fb8191dee12f2b7bda7efe38c ALIGNS.bam
0c2aa57fb8191dee12f2b7bda7efe38c ALIGNS.bam.deqp
[user@biowulf]$

Verification of correctness for a quip-compressed BAM file, "ALIGNS.bam" identical to "ALIGNS.bam.deqp", by program bammd5 is equivalent to verification of correctness for the associate quip-compressed SAM file, "ALIGNS.sam" identical to "ALIGNS.sam.deqp", as shown in Figure 1. below:

Figure 1. quip action on SAM and BAM files
ALIGNS.sam quip
———>
ALIGNS.sam.qp quip -d
———>
ALIGNS.sam.deqp
|^
|  samtools view  |
v|
  |||   ^
|samtools view
|
ALIGNS.bam quip
———>
ALIGNS.bam.qp quip -d
———>
ALIGNS.bam.deqp
|
|
———————————————>
|||
bammd5
|
|
<———————————————

Notes
  1. The left-hand column of Figure 1. depicts files "ALIGNS.sam" and "ALIGNS.bam" as being associated by a "samtools view" transformation.
    • It is easy to verify for any "ALIGNS.sam" file that
      when "ALIGNS.bam" is the file resulting from "samtools view -h ALIGNS.sam",
      then the file resulting from "samtools view -h ALIGNS.bam" is identical to "ALIGNS.sam".
  2. The right-hand column depicts the results of compression-decompression by program quip.
    • It is easy to verify that the file resulting from "samtool view -h ALIGNS.bam.deqp" is identical to "ALIGNS.sam.deqp".
  3. The bottom row indicates that program bammd5 has been used to show that "ALIGNS.bam" and "ALIGNS.bam.deqp" are identical.
  4. So the identical files "ALIGNS.bam" and "ALIGNS.bam.deqp" are transformed by "samtools view" into identical files "ALIGNS.sam" and "ALIGNS.sam.deqp".

Known Bugs

quip failed (with a SEGMENTATION_FAULT) to compress a 7.9 Gb BAM file, The job was run with 128Gb memory. The BAM file was written by program tophat from a 21.5 Gb FASTQ file produced by a state-of-the-art, paired-end, mRNA data acquisition.

The author of the quip program has not yet addressed this problem. Until the capabilty of program quip to compress SAM/BAM files is clarified, we recommend care in making plans to use quip for this purpose.

Note, that whenever program quip does complete a compression of a SAM/BAM files, you can use the procedures described above to verify the correctness of the compression.

Note also, that since the structure of a FASTQ file is much simpler than that of a SAM/BAM file, the performance of quip with a SAM/BAM file is not likely to be related to its performance with a FASTQ file. For example, quip was able to compress/decompress the 21.5 Gb FASTQ file which was the source of the BAM file mentioned in the above bug report.

Documentation

  • quip(1)quip command description
    [user@biowulf]$ man quip
    
  • quip(5) — Structure of a quip compressed FASTQ file.
    [user@biowulf]$ man 5 quip
    
  • quip(7) — Discussion of quip compression strategy with references
    [user@biowulf]$ man 7 quip
    
  • quip home page

    http://homes.cs.washington.edu/~dcjones/quip/