In bioinformatcis, bgzip files are important for random access to big files . Bgzip is a program modified from gzip program that uses block compression and is fully backwards compatible with gzip. But I have issues when using bgzip compressed vcf files with Perl scripts that uses IO::Uncompress::Gunzip (that I believe it uses zlib under the hood). A similar problem happen to my recently with snpeff program (Java). In both cases the data is decompressed but truncated after a few hundred lines aprox. I could be totally wrong but I was wondering if zlib (or whatever gzip compatible library they are using) is getting confused with the bgzip bloks and only processing one or a few of them leaving the output incomplete. perl code that does not work:
#!/usr/bin/env perl use strict; use IO::Uncompress::Gunzip qw(gunzip $GunzipError) ; my $infile = shift; my $infh = IO::Uncompress::Gunzip->new( $infile ) or die "IO::Uncompress::Gunzip failed: $GunzipError\n"; my $line_count = 0; while (my $line=<$infh>){ $line_count++ } print "total lines read = $line_count\n";
$ perl /home/pmg/tmp/test_zlib-bgzip.pl varsit.vcf.gz total lines read = 419
#!/usr/bin/env perl use strict; # I can use bgzip intead gzip my $infile = shift; open(my $infh , 'gzip -dc '.$infile.' |'); my $line_count = 0; while (my $line=<$infh>){ $line_count++ } print "total lines read = $line_count\n";
$ perl /home/pmg/tmp/test_gzip-bgzip.pl varsit.vcf.gz total lines read = 652829
[UPDATE 2014-02-28]: finally a clue come from biostars where Heng Li remind me a footnote in the SAM specs about a java library for gzip that only sees first block of bgzip when decompressing. Seems that Perl gzip implementations had the same problem.
No comments:
Post a Comment