In bioinformatcis, bgzip files are important for random access to big files . Bgzip is a program modified from gzip program that uses block compression and is fully backwards compatible with gzip. But I have issues when using bgzip compressed vcf files with Perl scripts that uses IO::Uncompress::Gunzip (that I believe it uses zlib under the hood). A similar problem happen to my recently with snpeff program (Java). In both cases the data is decompressed but truncated after a few hundred lines aprox. I could be totally wrong but I was wondering if zlib (or whatever gzip compatible library they are using) is getting confused with the bgzip bloks and only processing one or a few of them leaving the output incomplete. perl code that does not work:
#!/usr/bin/env perl
use strict;
use IO::Uncompress::Gunzip qw(gunzip $GunzipError) ;
my $infile = shift;
my $infh = IO::Uncompress::Gunzip->new( $infile )
or die "IO::Uncompress::Gunzip failed: $GunzipError\n";
my $line_count = 0;
while (my $line=<$infh>){
$line_count++
}
print "total lines read = $line_count\n";
$ perl /home/pmg/tmp/test_zlib-bgzip.pl varsit.vcf.gz
total lines read = 419
#!/usr/bin/env perl
use strict;
# I can use bgzip intead gzip
my $infile = shift;
open(my $infh , 'gzip -dc '.$infile.' |');
my $line_count = 0;
while (my $line=<$infh>){
$line_count++
}
print "total lines read = $line_count\n";
$ perl /home/pmg/tmp/test_gzip-bgzip.pl varsit.vcf.gz
total lines read = 652829
[UPDATE 2014-02-28]: finally a clue come from biostars where Heng Li remind me a footnote in the SAM specs about a java library for gzip that only sees first block of bgzip when decompressing. Seems that Perl gzip implementations had the same problem.