Thursday 27 February 2014

perl gzip libraries (probably zlib issue) does not play well with bgzip files.


In bioinformatcis, bgzip files are important for random access to big files . Bgzip is a program modified from gzip program that uses block compression and is fully backwards compatible with gzip. But I have issues when using bgzip compressed vcf files with Perl scripts that uses IO::Uncompress::Gunzip (that I believe it uses zlib under the hood). A similar problem happen to my recently with snpeff program (Java). In both cases the data is decompressed but truncated after a few hundred lines aprox. I could be totally wrong but I was wondering if zlib (or whatever gzip compatible library they are using) is getting confused with the bgzip bloks and only processing one or a few of them leaving the output incomplete. perl code that does not work:

#!/usr/bin/env perl
use strict;
use IO::Uncompress::Gunzip qw(gunzip $GunzipError) ;

my $infile = shift;
my $infh = IO::Uncompress::Gunzip->new( $infile ) 
         or die "IO::Uncompress::Gunzip failed: $GunzipError\n";
my $line_count = 0;
while (my $line=<$infh>){

    $line_count++
}
print "total lines read = $line_count\n";
This gives 419 lines
    $ perl /home/pmg/tmp/test_zlib-bgzip.pl varsit.vcf.gz
    total lines read = 419
but using open with gzip pipe works:
    #!/usr/bin/env perl
    use strict;
    # I can use bgzip intead gzip
    my $infile = shift;

    open(my $infh , 'gzip -dc '.$infile.' |'); 
    my $line_count = 0;
    while (my $line=<$infh>){

        $line_count++
    }
    print "total lines read = $line_count\n"; 
Gives the expected number of lines
    $ perl /home/pmg/tmp/test_gzip-bgzip.pl varsit.vcf.gz
    total lines read = 652829
I googled about and I was unable to find quickly any relevant entry, but this is something that I am sure other people would have already faced. Do someone have a clue about why is this happening? I am using ubuntu 12.04.4 with perl 5.16

[UPDATE 2014-02-28]: finally a clue come from biostars where Heng Li remind me a footnote in the SAM specs about a java library for gzip that only sees first block of bgzip when decompressing. Seems that Perl gzip implementations had the same problem.

No comments: