In bioinformatcis, bgzip files are important for random access to big files . Bgzip is a program modified from gzip program that uses block compression and is fully backwards compatible with gzip. But I have issues when using bgzip compressed vcf files with Perl scripts that uses IO::Uncompress::Gunzip (that I believe it uses zlib under the hood). A similar problem happen to my recently with snpeff program (Java). In both cases the data is decompressed but truncated after a few hundred lines aprox. I could be totally wrong but I was wondering if zlib (or whatever gzip compatible library they are using) is getting confused with the bgzip bloks and only processing one or a few of them leaving the output incomplete. perl code that does not work:
#!/usr/bin/env perl
use strict;
use IO::Uncompress::Gunzip qw(gunzip $GunzipError) ;
my $infile = shift;
my $infh = IO::Uncompress::Gunzip->new( $infile ) 
         or die "IO::Uncompress::Gunzip failed: $GunzipError\n";
my $line_count = 0;
while (my $line=<$infh>){
    $line_count++
}
print "total lines read = $line_count\n";
    $ perl /home/pmg/tmp/test_zlib-bgzip.pl varsit.vcf.gz
    total lines read = 419
    #!/usr/bin/env perl
    use strict;
    # I can use bgzip intead gzip
    my $infile = shift;
    open(my $infh , 'gzip -dc '.$infile.' |'); 
    my $line_count = 0;
    while (my $line=<$infh>){
        $line_count++
    }
    print "total lines read = $line_count\n"; 
    $ perl /home/pmg/tmp/test_gzip-bgzip.pl varsit.vcf.gz
    total lines read = 652829
[UPDATE 2014-02-28]: finally a clue come from biostars where Heng Li remind me a footnote in the SAM specs about a java library for gzip that only sees first block of bgzip when decompressing. Seems that Perl gzip implementations had the same problem.
No comments:
Post a Comment