Thursday 27 February 2014

perl gzip libraries (probably zlib issue) does not play well with bgzip files.


In bioinformatcis, bgzip files are important for random access to big files . Bgzip is a program modified from gzip program that uses block compression and is fully backwards compatible with gzip. But I have issues when using bgzip compressed vcf files with Perl scripts that uses IO::Uncompress::Gunzip (that I believe it uses zlib under the hood). A similar problem happen to my recently with snpeff program (Java). In both cases the data is decompressed but truncated after a few hundred lines aprox. I could be totally wrong but I was wondering if zlib (or whatever gzip compatible library they are using) is getting confused with the bgzip bloks and only processing one or a few of them leaving the output incomplete. perl code that does not work:

#!/usr/bin/env perl
use strict;
use IO::Uncompress::Gunzip qw(gunzip $GunzipError) ;

my $infile = shift;
my $infh = IO::Uncompress::Gunzip->new( $infile ) 
         or die "IO::Uncompress::Gunzip failed: $GunzipError\n";
my $line_count = 0;
while (my $line=<$infh>){

    $line_count++
}
print "total lines read = $line_count\n";
This gives 419 lines
    $ perl /home/pmg/tmp/test_zlib-bgzip.pl varsit.vcf.gz
    total lines read = 419
but using open with gzip pipe works:
    #!/usr/bin/env perl
    use strict;
    # I can use bgzip intead gzip
    my $infile = shift;

    open(my $infh , 'gzip -dc '.$infile.' |'); 
    my $line_count = 0;
    while (my $line=<$infh>){

        $line_count++
    }
    print "total lines read = $line_count\n"; 
Gives the expected number of lines
    $ perl /home/pmg/tmp/test_gzip-bgzip.pl varsit.vcf.gz
    total lines read = 652829
I googled about and I was unable to find quickly any relevant entry, but this is something that I am sure other people would have already faced. Do someone have a clue about why is this happening? I am using ubuntu 12.04.4 with perl 5.16

[UPDATE 2014-02-28]: finally a clue come from biostars where Heng Li remind me a footnote in the SAM specs about a java library for gzip that only sees first block of bgzip when decompressing. Seems that Perl gzip implementations had the same problem.

Monday 24 February 2014

automatic encrypting decrypting with emacs

I have been for long time wondering if there was a way to automatically encrypt and decrypt files when accessing them, not leaving an unencrypted file around.

Today I have an epiphany: "let's see if emacs can do that". As an emacs power user I felt as an idiot. Emacs has in the core this ability since emacs 23!! and I never realized about it :-(.

Emacs has incorporated the EasyPG code into its core.
http://www.emacswiki.org/emacs/EasyPG

 Things can not be more easy. Put in your .emacs

(require 'epa-file)
(epa-file-enable)

;; as it is annoying to be asked if I want passphrase or publick/private key
;; I set passphrase as default
(setq epa-file-select-keys nil)

I also would like to have a gpg agent cache. In ubuntu 12.04 the package is called gnupg-agent

sudo apt-get install gnupg-agent
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following extra packages will be installed:
  libassuan0 pinentry-gtk2
Suggested packages:
  pinentry-doc
The following NEW packages will be installed:
  gnupg-agent libassuan0 pinentry-gtk2
0 upgraded, 3 newly installed, 0 to remove and 1 not upgraded.  


Now you only need to open a file ended with .gpg and automatically emacs would ask you for the pasword for decryption and again for saving. No unencrypted temporary files are stored, or at least I am not aware of them.