Thursday, 27 February 2014

perl gzip libraries (probably zlib issue) does not play well with bgzip files.


In bioinformatcis, bgzip files are important for random access to big files . Bgzip is a program modified from gzip program that uses block compression and is fully backwards compatible with gzip. But I have issues when using bgzip compressed vcf files with Perl scripts that uses IO::Uncompress::Gunzip (that I believe it uses zlib under the hood). A similar problem happen to my recently with snpeff program (Java). In both cases the data is decompressed but truncated after a few hundred lines aprox. I could be totally wrong but I was wondering if zlib (or whatever gzip compatible library they are using) is getting confused with the bgzip bloks and only processing one or a few of them leaving the output incomplete. perl code that does not work:

#!/usr/bin/env perl
use strict;
use IO::Uncompress::Gunzip qw(gunzip $GunzipError) ;

my $infile = shift;
my $infh = IO::Uncompress::Gunzip->new( $infile ) 
         or die "IO::Uncompress::Gunzip failed: $GunzipError\n";
my $line_count = 0;
while (my $line=<$infh>){

    $line_count++
}
print "total lines read = $line_count\n";
This gives 419 lines
    $ perl /home/pmg/tmp/test_zlib-bgzip.pl varsit.vcf.gz
    total lines read = 419
but using open with gzip pipe works:
    #!/usr/bin/env perl
    use strict;
    # I can use bgzip intead gzip
    my $infile = shift;

    open(my $infh , 'gzip -dc '.$infile.' |'); 
    my $line_count = 0;
    while (my $line=<$infh>){

        $line_count++
    }
    print "total lines read = $line_count\n"; 
Gives the expected number of lines
    $ perl /home/pmg/tmp/test_gzip-bgzip.pl varsit.vcf.gz
    total lines read = 652829
I googled about and I was unable to find quickly any relevant entry, but this is something that I am sure other people would have already faced. Do someone have a clue about why is this happening? I am using ubuntu 12.04.4 with perl 5.16

[UPDATE 2014-02-28]: finally a clue come from biostars where Heng Li remind me a footnote in the SAM specs about a java library for gzip that only sees first block of bgzip when decompressing. Seems that Perl gzip implementations had the same problem.

Monday, 24 February 2014

automatic encrypting decrypting with emacs

I have been for long time wondering if there was a way to automatically encrypt and decrypt files when accessing them, not leaving an unencrypted file around.

Today I have an epiphany: "let's see if emacs can do that". As an emacs power user I felt as an idiot. Emacs has in the core this ability since emacs 23!! and I never realized about it :-(.

Emacs has incorporated the EasyPG code into its core.
http://www.emacswiki.org/emacs/EasyPG

 Things can not be more easy. Put in your .emacs

(require 'epa-file)
(epa-file-enable)

;; as it is annoying to be asked if I want passphrase or publick/private key
;; I set passphrase as default
(setq epa-file-select-keys nil)

I also would like to have a gpg agent cache. In ubuntu 12.04 the package is called gnupg-agent

sudo apt-get install gnupg-agent
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following extra packages will be installed:
  libassuan0 pinentry-gtk2
Suggested packages:
  pinentry-doc
The following NEW packages will be installed:
  gnupg-agent libassuan0 pinentry-gtk2
0 upgraded, 3 newly installed, 0 to remove and 1 not upgraded.  


Now you only need to open a file ended with .gpg and automatically emacs would ask you for the pasword for decryption and again for saving. No unencrypted temporary files are stored, or at least I am not aware of them.

Tuesday, 20 August 2013

Reducing batch effects in methylation analysis

In our sequencing facility we have processed 4 illumina 450K methylation chips for the same study at 2 diferent times. We took care of spreading case and controls among all the chips. We can see a clear time-of-processing batch effect:

The chips were run in two different days one month apart. first 24 samples and then the other 24 samples.



The negative controls of the second batch  have a wider range.

And wen looking at the MDS and PCA we can see that the main parameter for grouping samples is the run day (the two chips starting with 835... where done the same day and the other two another day)



I have used the nice rnbeads R package for reporting the batch effects.

Now I am exploring two methods of correcting the batch effect:
  1.   limma using the batch date as block effect
    • design <- model.matrix(~Block+Treatment)
  2. ComBat in the R SVA package for correcting this
I will post the results soon. Meanwhile any comment for dealing with batch effects in 450K chips is more than wellcome.
 

Tuesday, 2 July 2013

where is vnc viewer vinagre storing its history of connections?

not easy to find manually so I did

find ~/ -name vinagre

.local/share/vinagre

analysis[~] cat .local/share/vinagre/history
localhost:8900
localhost:8910
localhost:8920
localhost:8930
localhost:8907

unbuntu 12.10 install hangs in first boot 'waiting for battery status'. Seems it was a problem with my graphics card.

I have two new i7 servers and I installed ubuntu 12.10. Unfortunately after installation (without network) both of them did not started the Xs and stoped with a message ''waiting for battery status". After googling for answers for this issue, I saw that many of the solutions either fixed it feedling with the display manager or Xauthorities or Xorgs or simply being practical and starting a Xserver from ctr-alt-f1 but for some users none of these actions fixed the problem.

Seems that all solutions are related with the Xs or display. In my case both computers have an Asus GForce 210 graphical card but I have a third identical server without the card has no problem. So my first guess was a possible issue with the linux drivers for the card. So first solution was to go to the BIOS and change the GPU to iGPU (the integrated one). VoilĂ , the machine boots into graphical mode with no problem.

Once I obtained the IP for my mac address, I upgraded (on computer only upgrade, and the odter dist-upgrade to 13.04) and now I am able to boot correctly using my PCI GPU card from both computers.

So if you find this problem (that seems not so rare, but I have never seen before and I have intalled a lot of ubuntus in a lot of machines since the 'noname-distribution'), before doing other things try to upgrade your computer to the same or  a newer distro. If none of the web suggestions fix the problem you can always try to use your integrated GPU o change your graphics card.

Monday, 24 June 2013

fixing "Error while parsing file:freemind.main.XMLParseException: XML Parse Exception during parsing of a map element at line 1: Expected: ="

I am a fan of mindmap programs. I am all the time using both, FreeMind and Xmind.

Xmind is cute and has all the bells and whistles needed to enjoy the ride, but it has a resources hungry monster under the hood and is very slow when you have more than 6 or 7 big mindmaps with more than 6 or 7 tabs in each of them.

For brainstorming and quick and simple mindmaps FreeMind is my piece of choice. I love its simplicity and the low level of resources usage it has compared with Xmind.

Under this duality I thanks the import and export capabilities of XMind. It makes the life easier..... until java came along.

Xmind exports to freemind 0.8 format. But now the stable version of FreeMind is 0.9. FreeMind 0.9 due to a java bug does not always opens FreeMind 0.8 maps, and fails with :

Error while parsing file:freemind.main.XMLParseException: XML Parse Exception during parsing of a map element at line 1: Expected: =

This has been also reported here:
 http://sourceforge.net/p/freemind/discussion/330666/thread/d09d930c/?limit=50

And the solution is to install the latest FreeMind. In my case the 1.0 RC4

http://sourceforge.net/projects/freemind/files/freemind-unstable/1.0.0_RC4/freemind-bin-max-1.0.0_RC_4.zip/download

Monday, 13 May 2013

Issues Installing R package ncdf

Just for future reference:

Short story:
before installing ncdf in R intall in your linux libnetcdf-dev and netcdf-bin

Long story:

I found a problem installing the R package ncdf:

> install.package('ncdf')

        |checking whether we are using the GNU C compiler... yes
        |checking whether gcc -std=gnu99 accepts -g... yes
        |checking for gcc -std=gnu99 option to accept ISO C89... none needed
        |checking how to run the C preprocessor... gcc -std=gnu99 -E
        |checking for grep that handles long lines and -e... /bin/grep
        |checking for egrep... /bin/grep -E
        |checking for ANSI C header files... no
        |checking for sys/types.h... no
        |checking for sys/stat.h... no
        |checking for stdlib.h... no
        |checking for string.h... no
        |checking for memory.h... no
        |checking for strings.h... no
        |checking for inttypes.h... no
        |checking for stdint.h... no
        |checking for unistd.h... no
        |checking netcdf.h usability... no
        |checking netcdf.h presence... no
        |checking for netcdf.h... no
        |configure: error: netcdf header netcdf.h not found
        |ERROR: configuration failed for package ‘ncdf’
        |* removing ‘/usr/local/lib/R/site-library/ncdf’


Well I though, this is the classical "install the -devel package first" but alas, I have the netcdf.h in /usr/include !!! and I did already intalled the libnetcdf-dev package in my biolinux-7 (ubuntu 12.04)

I went to the tmp folder where the R packages was downloaded and tried to install it by hand. I read the INSTALL  file and tried adding  environmental variables:


The location of the netcdf library can be specified by the environment
variable NETCDF_LIB or the configure argument --with-netcdf-lib.

or adding FLAGS to ./configure

LDFLAGS=-L/usr/lib CPPFLAGS=-I/usr/include sh configure

Nothing was working.

Finally I remembered that at the beginning of the INSTALL file it is mentioned that the program nc-config  is a helper for the installer to find out the location of the dev files, but as it was not installed and it is said in this file that is optional I skipped this part during my first trials. After failed several times to find out how to pass the paths of the dev files I decided to give a try to the nc-config route. I searched for it and here http://cirrus.ucsd.edu/~pierce/ncdf/install_ncdf_v4.html I read:

The only issue is getting the libraries right. All the libraries used by the netcdf library, version 4, must be visible to the R software. To get a list of what libraries must be visible, run "nc-config --libs" (note: a recent version of the netcdf library, version 4, must be installed correctly for this command to work).


But I did not have nc-config

root@pmg-analysis:/tmp/RtmpiNeIMZ/downloaded_packages/ncdf# nc-config --libs The program 'nc-config' is currently not installed. You can install it by typing: apt-get install netcdf-bin

So I installed it
root@pmg-analysis:/tmp/RtmpiNeIMZ/downloaded_packages/ncdf# apt-get install netcdf-bin

And then knowing that the R package ncdf uses this program for finding the headers and libs for netcdf, I run the install.package in R:

>install.packages('ncdf')
Installing package into ‘/home/pmg/R/x86_64-pc-linux-gnu-library/3.0’
(as ‘lib’ is unspecified)
trying URL 'http://cran.univ-paris1.fr/src/contrib/ncdf_1.6.6.tar.gz'
Content type 'application/x-gzip' length 79403 bytes (77 Kb)
opened URL
==================================================
downloaded 77 Kb

* installing *source* package ‘ncdf’ ...
** package ‘ncdf’ successfully unpacked and MD5 sums checked
checking for nc-config... /usr/bin/nc-config
configure: creating ./config.status
config.status: creating src/Makevars
** libs
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I/usr/include     -fpic  -O3 -pipe  -g  -c ncdf.c -o ncdf.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I/usr/include     -fpic  -O3 -pipe  -g  -c ncdf2.c -o ncdf2.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I/usr/include     -fpic  -O3 -pipe  -g  -c ncdf3.c -o ncdf3.o
gcc -std=gnu99 -shared -o ncdf.so ncdf.o ncdf2.o ncdf3.o -L/usr/lib -lnetcdf -L/usr/lib/R/lib -lR
installing to /home/pmg/R/x86_64-pc-linux-gnu-library/3.0/ncdf/libs
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (ncdf)

Job done!!

Sunday, 10 February 2013

the perils of Excel: any one can do it



A nice story to read about the perils of using excel when you don't know what you are doing. Fools rush in where angels fear to trade.

http://baselinescenario.com/2013/02/09/the-importance-of-excel/
[...]The new model “operated through a series of Excel spreadsheets, which had to be completed manually, by a process of copying and pasting data from one spreadsheet to another.” The internal Model Review Group identified this problem as well as a few others, but approved the model, while saying that it should be automated and another significant flaw should be fixed.** After the London Whale trade blew up, the Model Review Group discovered that the model had not been automated and found several other errors. Most spectacularly,
After subtracting the old rate from the new rate, the spreadsheet divided by their sum instead of their average, as the modeler had intended. This error likely had the effect of muting volatility by a factor of two and of lowering the VaR . . .”
[...]
But while Excel the program is reasonably robust, the spreadsheets that people create with Excel are incredibly fragile. There is no way to trace where your data come from, there’s no audit trail (so you can overtype numbers and not know it), and there’s no easy way to test spreadsheets, for starters. The biggest problem is that anyone can create Excel spreadsheets—badly. Because it’s so easy to use, the creation of even important spreadsheets is not restricted to people who understand programming and do it in a methodical, well-documented way.***
[...]

The importance of logarithmic transformation in 'natural' data


Reading the Edward Tuft book about data analysis in politics and policy

http://www.edwardtufte.com/tufte/dapp/

Edward Tuft is one of the gurus of Data Analysis visualization [0], and in this chapter [1] he show in a very didactic and clear way the importance of logarithmic transformation for data of naturally occurring counts.
   [0] http://en.wikipedia.org/wiki/Edward_Tufte
   [1] http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003uF

The importance of logarithmic transformation


http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003uF

This is a very clear and didactic explanation of the importance of logarithmic transformation that anyone on doing data analysis in natural sciences or epidemiology must read.

And a very important point is to raise the point that regression analysis of a model DOES NOT TEST the relationship but SHOWS the proportionality GIVEN THE MODEL BEING TRUE 


The end part of this section has a bit more of mathematics that some biologist probably have already forgotten but it is worthy to read it anyway.

I truly recommend reading this even it is a very old book (ed. 1976).

Final note: Remember to add 1 to your data before log transform in order to avoid log(0). Don't do that if you have negative number ;-). Other option is to add a small quantity to all your 0s

Saturday, 9 February 2013

BioPerl is thinking about to be more practical and adaptative


There has been a good number of BioPerl threads in the mailing list [0] last week about how to make BioPerl more fitted to the current times.

  [0] http://thread.gmane.org/gmane.comp.lang.perl.bio.general

I like the phrase of George Hartzell about being able to move forward because we need to support Perl 5.8

But why should the all-volunteer BioPerl community be stuck supporting
code from 12 years ago because it's cost effective for someone else to
avoid spending *their* $/time/people to stay up to date.

And the links to the discussion:
Next BioPerl release : http://thread.gmane.org/gmane.comp.lang.perl.bio.general/26348
dependencies on perl version : http://thread.gmane.org/gmane.comp.lang.perl.bio.general/26344
BioPerl future :  http://thread.gmane.org/gmane.comp.lang.perl.bio.general/26394
removing packages from bioperl-live: http://thread.gmane.org/gmane.comp.lang.perl.bio.general/26341