Saturday, 17 November 2012

Nice positional sequencing hack in an illumina flowcell by Shendure's lab

Another interesting paper from the Shendure's lab

 2012 Nov 13;109(46):18749-54. doi: 10.1073/pnas.1202680109. Epub 2012 Oct 29.

Capturing native long-range contiguity by in situ library construction and optical sequencing.


Department of Genome Sciences, University of Washington, Seattle, WA 98195.


The relatively short read lengths associated with the most cost-effective DNA sequencing technologies have limited their use in de novo genome assembly, structural variation detection, and haplotype-resolved genome sequencing. Consequently, there is a strong need for methods that capture various scales of contiguity information at a throughput commensurate with the current scale of massively parallel sequencing. We propose in situ library construction and optical sequencing on the flow cells of currently available massively parallel sequencing platforms as an efficient means of capturing both contiguity information and primary sequence with a single technology. In this proof-of-concept study, we demonstrate basic feasibility by generating >30,000 Escherichia coli paired-end reads separated by 1, 2, or 3 kb using in situ library construction on standard Illumina flow cells. We also show that it is possible to stretch single molecules ranging from 3 to 8 kb on the surface of a flow cell before in situ library construction, thereby enabling the production of clusters whose physical relationship to one another on the flow cell is related to genomic distance.
[PubMed - in process]

Looking for a Perl tool like Ipython notebook.

I have just recently discovered Ipython notebooks[1] they are fantastics and like R sweave[2] they are essential for Reproducible Research.

I use Perl and pdl for bioinformatics research. Being pdl a scientific tool these kind of reproducible research utilities seems a perfect match.

Some time ago in the bio(perl|python) communities they started to look to use a similar approach to sweave but I have not seen much progress on that. A cheap alternative (if your OS is called emacs ;-)) is to use org-mode babel[3]. I like to use babel because I almost always work under ssh connection to my servers using also screen session with emacs -nw . Babel could be a substitute for sweave, but Ipython notebook is a more advanced and interactive beast. You can see this fantastic presentation from Fernando PĂ©rez about scientific Python[4] where he demonstrates its powers.

I don't know yet any Perl tool similar for Ipython notebooks but if it exist I would like to find and use it.

Any comments on this topic and links would be more than wellcome.

Key points summary in the GATK licence change

Recently I posted about GATK change of licence to a commercial one dropping MIT licence for GATK 2.0.

There is a page with the thread generated by the licence change at gatkforums. And first of all, I am very glad to see how scientist are discussing with argumentation and opinion instead of trolling and insulting. In a informatics forum this would generate a rather unpolite flame war. I am very please that holocaust and Hitler has not been mentioned yet and seem that never will be ;-). Kudos to the community.

From this thread seems that the licence change was done to prevent some companies from making money selling GATK analysis meanwhile the lab "producing" GATK is year by year, as the rest of us, desperately trying to secure economic resources to keep on going the projects in this time of big cuts on science. That is understandable but this would rise another debate: when are tools like GATK, SAMtools pipeline, tuxedo suit etc. getting mature enough to fly solo and pass from a scientific project to a software production?. When this happens, should they be funded by scientific founds or by industry founds?

The key of the dabate here is more about the lack of expertise of the lab in licencing and commercial and retriction implications as the licence terms where ambiguous and not clear enough at some passages.

[dePristo reply in the blog explaining the two licences]

Hi Pepetideo,

You can share your GATK results -- that was a language slip up on our part. You can see it's clear in the license and FAQ now.

As for why, its two fold. One is to ensure that we can continue to develop and support the GATK into the future by creating a sustainable revenue source for the team. Two is that a commercial version will be able to support a large team providing tier-one support, such as long-term maintenance of specific GATK versions, which my research group simply cannot provide. Note that any commercial entity who wants to stay with GATK-Lite can go the full open source route, at the cost of foregoing premium support and access to the best possible tools.

We recognize that this is a change, and of course we are big supporters of open source software -- the vast majority of the GATK2 is open source. We considered creating a "GATK foundation" mozilla style, accepting micro donations, or even providing pay-for-services on top of the GATK but ultimately the commercial/non-commercial divide seemed the option that provides the most value to the entirely community.

    Purchasing a commercial GATK2 license will give you the right to run the GATK2 within the company and share / publish / etc your results. This is what I'd think of as a standard commercial license, and most places would fit in this bucket. The example here is buying Adobe Photoshop and using it in house to manage and edit photos.

    The more complex question is around third-party pipeline executors, which only take in data from others and who effectively sell the running of the GATK. Here I think there will be a separate license with specific terms, but it's something we'd like to enable. The analogy here is setting up a for-profit web portal for photo editing that backends to photoshop. A valuable activity but one not covered by the standard end-user license agreement.
[some answers to that]
August 2

From this side of the pond the Wellcome Trust Sanger Institute does have a policy on the software we develop here, it has to be open sourced and under specified licenses. This is in harmony with our policy that research funded by Wellcome Trust money must be published in an open access journal.

I can't speak for the institute itself but I have a sinking feeling this decision will spark a lot of debate. The concern this gives me and what I intend to find out about, is how this will interact with any collaborations on or contributions we with to make to GATK. UK charity law is quite tight on what kind of profit making activities charities can take part in so it may involve lawyer time.

I do understand the Broad's point of view, people are making money on software that the Broad has invested money in producing and the Broad is not getting a cut from it. Ideally the way they'd pay it forward would be to contribute testing time and improvements back, but in practice I imagine quite a few are taking a free ride. That said companies take a free ride on most of the research we do, it's just harder to make money from most of it though. This whole debate does bring the name Celera to mind though.
August 3

Companies pay taxes too. Some additionally indirectly support the development of tools such as the GATK through academic collaborations. I guess the way you are framing the question though, it comes down to this: taxpayers paid for the development of the GATK. Most taxpayers aren't doing genetic research. So what will benefit the taxpayers most? (Should companies be paying for the reference genome sequence? For SNP databases? Where do we draw the line?) The reason the government (taxpayers) invest in basic research is to stimulate the downstream discovery. Help us translate research into helping patients!

I even take exception to the phrase "If the time comes when they're asked to contribute back and they don't, then yes they are leeches." Who is asking to contribute back? Not the people who paid for the development in the first place (NIH, Eli & Edyth Broad, Harvard, MIT)! The GATK became widely used not only because it was good (it is), and not only because it was free (it was), but because of a huge investment from other projects (most notably the 1000 genomes project, but others as well) that got it free publicity and turned it into a de facto standard. It's hard to compete when the GATK team has earlier access to taxpayer-funded projects/data/sequencing, and guaranteed publications when these projects come out. Also, gaining market share and then raising prices sounds more capitalist than Marxist, to reference an earlier comment.

I don't mean to go on a tirade; I guess I just feel strongly about this. I definitely understand where you are coming from; I too have written popular open-source software that cost me personally plenty of time of support, and meanwhile surely helped some for-profit entities do research (I hope!) and one for-profit company in particular have success. Nonetheless, I knew it was my duty to share the software freely. Also please know that I say all this with the utmost respect for the whole GATK team (most of which I think I know--I don't think we've had the pleasure of meeting though, Geraldine). You guys are doing a great job, and it's wonderful that taxpayers have been able to fund the development of (documented, supported) academic software. I'd be happy to take everyone out for drinks after work some day and thank you personally; justifying paying for the software is difficult.

*Note: these views are my own, and do not necessarily reflect those of my company or colleagues.
August 3

Hi all,

I want to chime in with three clarifying points:

    We don't yet know the pricing scheme, but we are keenly aware of the complications of per-use licensing as TechnicalVault brings up

    Overall I want everyone to tone down the moral issue surrounding commercial licensing. The discussion of moral issues, extracting of rents, leeching off taxpayers, are all counterproductive to helping understand what we have decided and the best path forward. All of this is just business, after all.

    The NIH is clear that when funding basic research that the support IP is generally owned the developing institution (I'm sure there are exceptions), and this is true for software in general. The only key software restriction I know of is that the software must be made available to federal employees upon request. The reason for this policy is obvious -- it would be extremely difficult to accept the trade-off in a grants with IP ownership if you are creating high-value IP. Even federal SBIR grants spell out clearly that the government does not own any IP associated with the support. The federal granting system is to foster innovation, not to own innovation. It's a subtle difference but important for anyone creating real IP value with federal money in the US

    Many users of the GATK would like a much higher level of support than the Broad institute could possibly provide, as this is off track for the Broad's mission to transform medicine through genomics. We believe that having a commercial license for the GATK will allow us to actually deliver on this superior support and continue to grow the GATK as a reliable standard for NGS analysis in the commercial sphere and beyond. Without a commercial version we simply cannot follow through on this opportunity.

    We attempted to make the GATK easy for others to contribute code to, but our experiences in this area have been disappointing. Many people use the GATK for developing tools -- and we are committed to ensuring the programming framework and libraries remain MIT licensed -- we have had little contribution over 3 years to the master codebase from independent third-parties. Certainly some of our collaborators have contributed impressive tools and extensions, but again they aren't really independent. There's a good wikipedia article on the experiences of mySQL similar to this, and they release with a dual-license similar to our approach. Still though I'd like to release the source code to all of the tools -- if we can find a way consistent with the commercial license -- for transparency to the community and to allow others to contribute, in so far as they like.

-- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard
But people still confused 2 month later
TechnicalVault Posts
October 17

Hi Mark, I have a couple of questions stemming from the FAQ posted by your new commercial partners:

    Regarding, "Why not stay with GATKLite?" According to the FAQ at your new partner's site "Broad has indicated that GATK-lite tools will soon be obsolete, and it plans to stop supporting the tools by the end of 2012." Can you confirm exactly what this means please? Is it all of GATK-lite which will be dropped or just tools which have been replaced by new ones from GATK2?

    "Use by a not-for-profit organization to generate revenue requires a commercial license", can you clarify what this means? For example providing sequencing services to other academic institutions generates revenue, however it is usually done at cost so does not generate profit.

    If a not-for-profit was interested in buying support, but not in buying a commercial license is there an option for this? Who would it be with?

    Finally who will be the final arbiter of usage terms? Does that remain with the Broad or have you signed enforcement over to your partners?

Thank you for all your hard work
Post edited by TechnicalVault on October 17
October 17

Hi Martin,

    CORRECTION: The wording in the FAQ was incorrect due to a miscommunication. We will in fact continue providing support for GATK-Lite tools indefinitely, and although we will eventually stop providing a separate build (jar file), the GATK-Lite codebase will remain publicly available on our open source GitHub repository. In addition, tools from GATK 2 will be migrated into the GATK-Lite codebase over time.
    and 3. Please direct these questions to our partner, Appistry. They will be able to tell you based on your specific circumstances. They have a contact form that you can use, and within a few days they will also have a discussion forum that you can use for this purpose.
    See above.
    I believe Mark will answer that for you, or direct you to Issi Rozen here at the Broad, if you want an answer from our side. Otherwise I expect the Appistry people should be able to answer that as well.

Friday, 16 November 2012

GATK 2.0 drops MIT licence

Sadly Broad Institute's great contribution to new generation sequencing NGS, GATK, is not open source anymore.

GATK 2.0 is moving to a mixed open/closed-source model
The complete GATK 2.0 suite will be distributed as a binary only, without source code for the newest tools. We plan to release the source code for these tools, but its unclear the timeframe for this.
Personally I dislike this movement. If reason is money there are successful business models build around open source software without the need of leaving the open source community.

Doing this sort of things is a sacrilege for many of the members of our community fighting day and night for the #OA. I can see RMS deeply depress and crying in a corner every time such a thing happens. Whatever the reason behind this change is, presumably it has been  a long meditated one. So I would expect a better explanation of why this change was needed, why keep being an open source software was impeding this goal, and what is more important, the thing that annoys me more, and  denotes poor strategic plan for such a big change "We plan to release the source code for these tools, but its unclear the timeframe for this".  Hummm, this seems to me that commitment for open source has vanished here. Even if you believe in OS, some times you are forced to make concessions to  close software but I would expect to leave clear note when and how your source code would be open again.

How come we are going to lead a war for open access journals and against patents but want to make money of our software. This seems a deadly friendly fire to me until I read a good detailed explanation of the reasons behind.

It is a good thing to have samtools around as a backup.

== follow up ==

Thursday, 15 November 2012

bioperl popularity (measured by searches) going down day by day.

According to the following figure bioperl is loosing all its appealing day by day.  One critique to this plot is that big bio projects still are in perl like Ensembl, biopieces etc,  but not shown here. The pity is that R and biophyton have very good tools/pipelines for New Generation Sequencing and bioperl or other perl bio projects don't.

Wednesday, 7 November 2012

how to use R introspection for obtaining a data.frame from an object

A very helpful post from Charlot Wickham from the ggplot2 mailing list about how to find out which methods to use to extract a data.frame with the data from an object.

Lecomte Jean-Baptiste
6:16 PM (20 hours ago)
to ggplot2
Dear all,

I'm trying to plot with ggplot2 the result of the function variofit of the geoR package.
It's quite simple with the basic plot :

vario100 <- variog(s100, max.dist=1)
ini.vals <- expand.grid(seq(0,1,l=5), seq(0,1,l=5))
ols <- variofit(vario100, ini=ini.vals, fix.nug=TRUE, wei="equal")
wls <- variofit(vario100, ini=ini.vals, fix.nug=TRUE)
lines(ols, lty=2)

I can plot the points of the empirical variogram, but I can't plot the line representing the fitted variogram with ggplot2




I have make a quick search on both ggplot2 and R-sig-Geo user list without finding any solutions.
I will appreciate any advice.

Jean-Baptiste Lecomte

Charlotte Wickham
11:08 PM (15 hours ago)

to Lecomteggplot2
Hi Jean-Baptiste,

There are a couple of problems with geom_line(wls).  Firstly ols and wls are special objects that ggplot knows nothing about, and secondly you are passing them in as the first argument where geom_line expects a mapping.  A general solution for this type of problem is to write a function that takes the special type of object you are interested in as an argument, and outputs a data.frame you can use ggplot to plot.

You know lines(wls) must at some point calculate enough info to draw the line you require, it's just a matter of finding out where.

This first step is too figure out what is happening in lines(ols):
> class(ols)
[1] "variomodel" "variofit"  

Tells us ols has class "variomodel", so our best bet is to look at lines.variomodel:

> ?lines.variomodel
> lines.variomodel
function (x, ...) 
<environment: namespace:geoR>

Not very helpful...looks like lines.variogram is a generic function, we need to figure out which method is being called.  

> methods("lines.variomodel")
[1] lines.variomodel.default*     lines.variomodel.grf*        
[3] lines.variomodel.krige.bayes* lines.variomodel.likGRF*     
[5] lines.variomodel.variofit*   

   Non-visible functions are asterisked

Look like lines.variomodel.variofit is our candidate:

will spit out the function, and it looks like most of it is calculating the fitted line to be used in the call to the function curve.  One way to get ggplot to plot the fitted line would be to reuse that code but instead of calling curve, outputting a data.frame that ggplot can use.  I put my attempt here:

Source that in and try:
qplot(u,v,data=df_vario) +
 geom_line(aes(x, fit), data = fitted_variofit(ols)) +
 geom_line(aes(x, fit), data = fitted_variofit(wls), linetype = "dashed")

Hope that helps,


Friday, 19 October 2012

icon fonts, a nice way of dealing with icon/images as tex

Icon Fonts

A nice way of plotting icons or small graphic elements as text

Tuesday, 9 October 2012

DBD::myql 4.022 not passing test 80procs.t and how to FIX it

I failed to install DBD::mysql. This time I had the dev files and everything.
FAIL Installing DBD::mysql failed. See /home/pmg/.cpanm/build.log for details. $ cpanm DBD::mysql [...] DBD::mysql::db do failed: alter routine command denied to user ''@'localhost' for routine 'test.testproc' at t/80procs.t line 41.
Looking at this line of code:
my $drop_proc= "DROP PROCEDURE IF EXISTS testproc";

ok $dbh->do($drop_proc);
And looking at my mysql db table I can see that there is no privileges for alter or execute procedures
mysql> select * from db\G
                 Host: localhost
                    Db: test
           Select_priv: Y
           Insert_priv: Y
           Update_priv: Y
           Delete_priv: Y
           Create_priv: Y
             Drop_priv: Y
            Grant_priv: N
       References_priv: Y
            Index_priv: Y
            Alter_priv: Y
 Create_tmp_table_priv: Y
      Lock_tables_priv: Y
      Create_view_priv: Y
        Show_view_priv: Y
   Create_routine_priv: Y
    Alter_routine_priv: N  # <<====
          Execute_priv: N  # <<====
            Event_priv: Y
          Trigger_priv: Y

        mysql> show grants for ''@localhost;
        | Grants for @localhost                                                                                                                                                                                       |
        | GRANT USAGE ON *.* TO ''@'localhost'                                                                                                                                                                        |
        4 rows in set (0.00 sec)

FIXING the problem: Add permision to anyone from localhost to test

 mysql> grant ALL on test.* to ''@'localhost';
  mysql> show grants for ''@localhost;
  | Grants for @localhost                                                                                                                                                                                       |
  | GRANT USAGE ON *.* TO ''@'localhost'                                                                                                                                                                        |
  | GRANT ALL PRIVILEGES ON `test`.* TO ''@'localhost'                                                                                                                                                          |
  4 rows in set (0.00 sec)
  mysql> select * from db\G
  *************************** 1. row ***************************
                   Host: localhost
                     Db: test
            Select_priv: Y
            Insert_priv: Y
            Update_priv: Y
            Delete_priv: Y
            Create_priv: Y
              Drop_priv: Y
             Grant_priv: N
        References_priv: Y
             Index_priv: Y
             Alter_priv: Y
  Create_tmp_table_priv: Y
       Lock_tables_priv: Y
       Create_view_priv: Y
         Show_view_priv: Y
    Create_routine_priv: Y
     Alter_routine_priv: Y  ## <<<<==== OK
           Execute_priv: Y  ## <<<<==== OK
             Event_priv: Y
           Trigger_priv: Y

After this change I installed DBD::mysql and all test passed.

Sunday, 23 September 2012

windows audiodg.exe memory leak

Windows7 still have a memory leak in audiodg.exe. Unfortunately I have discovered 4 month and a 4GB memory expansion late. You can solve it restarting Windows Audio service from time to time. 

I am a linux person but my computer was too new for linux so I ended with Windows  7 professional (even for windows I needed to install a lot of samsung's drivers and still are things that does not work properly). I usually does not shut down my laptop for months (suspend to ram works OK). But with this computer I needed to restart it once a week. When it had only 4GB it was getting quickly out of memory and windows, in its big desire to be helpful in its own ways, was closing programs at will with a polite notice ('sorry your computer has low memory I am going to close some programs') but without actually allowing you to decide. I bought another 4GB memory card, and then it takes 10 days before reaching to 31GB of virtual memory plus the 8GB of real memory and start to complain. 

I thought that this was due to my abuse of firefox tabs (usually 150-200 opened tabs) and the cache problem, plus a lot of other programs in use like  thunderbird, a linux virtual machine with 2GB, word, Xmind, excel, R....

I have seen several times one process with a lot of memory but always left aside (audiodg.exe). Finally today with all 31GB  virtual memory full and some windows warnings, closed firefox and chrome and I did not see any relevant change in the memory consumed, still 29GB after 1 hour. Looking at the Processes I saw that audiodg.exe had 25GB virtual memory allocated and 2GB real memory in use!!!!

A quick google search  shows that since 2009 is well known that audiodg.exe has a memory leak and there is a hot-fix for it (unfortunately people say that it does not work with windows 7). Some folks solved the problem restarting the audio server. I went to adminitrative_tools/componet_services and searched for audiodg. It was not there but there was a windows audio service. I restarted it and then the real memory used went from 8 to 3GB and the virtual memory from 29 to 4GB!!!!!!!!!!!

My computers just after restarting Windows Audio

Sunday, 16 September 2012

VCFtools 0.1.9 perl does not work with perl 5.14 or newer. use the VCFtools svn or this patch.

Error: bare qw() in a for loop: for $x qw() {} # deprecated. It needs to be surrounded by parents. . See my previous blog entry about this feature.

$  diff 
<         for my $key (qw(fmtA fmtG infoA infoG)) { if ( !exists($$out{$key}) ) { $$out{$key}=[] } }
>         for my $key qw(fmtA fmtG infoA infoG) { if ( !exists($$out{$key}) ) { $$out{$key}=[] } }
Now all tests work OK. [upate] the svn version 779 of is fixed
for my $key (qw(fmtA fmtG infoA infoG)) { if ( !exists($$out{$key}) ) { $$out{$key}=[] } }

perl 5.14 and up will break some of well known programs! Think twice (test thrice) before use it!

Seems that at some point someone thought that a qw() in a for loop should be surrounded by parents (qw()).
Lines like
$  perl -e 'foreach $x qw(1 2 3){print $x}'
Use of qw(...) as parentheses is deprecated at -e line 1.
are not permitted any longer. People has revealed against this and patches were proposed, but perl 5.16 still does not permit it. Some people say that this was a 'feature' of the parser not an intended behaviour (see comments in Reini's post).
In my case I switched to perl 5.14 and VCFtools was not working any more. At least I discovered it testing the test suit. 20 test failed all with the same error:
not ok 30 - Testing vcf-consensus .. perl -I../perl/ -MVcf ../perl/vcf-consensus  consensus.vcf.gz < consensus.fa
#   Failed test 'Testing vcf-consensus .. perl -I../perl/ -MVcf ../perl/vcf-consensus  consensus.vcf.gz < consensus.fa'
#   at test.t line 418.
#     Structures begin differing at:
#          $got->[0] = 'Use of qw(...) as parentheses is deprecated at ../perl// line 1622.
#     '
#     $expected->[0] = '>1:1-500
I am lucky to be using perlbrew so I can change quickly from one version of perl to another (not so lucky that I need to reinstall a lot of CPAN modules.)

Test::Most not automatically installed with cpanm. Some inner dependencies failed

After installing the the CVFtools, I went to test the perl scripts with
perl vcftools/perl/test.t
But failed becuase I needed Test::More. I tried to install it but failed:
$  cpanm
--> Working on Test::Most
Fetching ... OK
Configuring Test-Most-0.31 ... OK
==> Found dependencies: Exception::Class
--> Working on Exception::Class
Fetching ... OK
Configuring Exception-Class-1.33 ... OK
Building and testing Exception-Class-1.33 ... FAIL
! Installing Exception::Class failed. See /home/pmg/.cpanm/build.log for details.
! Bailing out the installation for Test-Most-0.31. Retry with --prompt or --force.
Looking at .cpanm/build.log I saw that Class/Data/ was missing. The dependency was not automatically picked and installed so I installed by hand. Then tried again Test::Most, and again Exception-Class was failing. This time needed Devel::StackTrace. I installed this one also by hand and now all went OK.
cpanm Devel::StackTrace
cpanm Test::More
These missing dependencies are not uncommon in CPAN but are a bit annoying, mainly if your cpanm install action becomes large and the failed module is at the beginning and your last line says, 11 packages installed (all dependencies of your wanted package, but the wanted one is not installed). In situations like this, is easy to skip that your modules was not installed.

Wednesday, 11 April 2012

LD calculation

LD is a very very blurry concept and the ways of measure it is very diverse. In the following paper you would be able to gain a deeper view of the problem:

Slatkin M. Linkage disequilibrium--understanding the evolutionary past and
mapping the medical future. Nat Rev Genet. 2008 Jun;9(6):477-85. Review. PubMed
PMID: 18427557.

Just in case you don't have access here you have the starting paragraph rant:
Linkage disequilibrium (LD) is one of those unfortunate
terms that does not reveal its meaning. As every
instructor of population genetics knows, the term is a
barrier not an aid to understanding. LD means simply
a nonrandom association of alleles at two or more loci,
and detecting LD does not ensure either linkage or a
lack of equilibrium. The term was first used in 1960 by
Lewontin and Kojima1 and it persists because LD was
initially the concern of population geneticists who were
not picky about terminology as long as the mathematical
definition was clear. At first, there were few data with
which to study LD, and its importance to evolutionary
biology and human genetics was unrecognized outside
of population genetics. However, interest in LD grew
rapidly in the 1980s once the usefulness of LD for gene
mapping became evident and large-scale surveys of
closely linked loci became feasible. By then, the term
was too well established to be replaced.

and some formula:

From wikipedia:

Tuesday, 3 April 2012

Calculate years between two dates with Perl

Giving two dates calculate the number of years between them.

This seems a trivial problem BUT it is not: you could take the year of each date subtract them and if > 0, then if the month of the second date is smaller than the first, add -1, and if it is equal, check the day. What could possibly go wrong?

First and more important, there are zillions of date formats and some very difficult to parse with RE. Then comes the problem that you would like to see the difference in years forward... and backwards!! and probably you would like to show the years with decimal point, so you need to count days .... and so on, and at some point your script would need to report days and months, that is life!.

This is a task for the industry grade CPAN's DateTime family modules in Perl.

DateTime::Format::xxx is a family of modules specialised in parsing and formatting any date format you can imagine. Then you can calculate things with Date::Calc or DateTime::Duration.

#!/usr/bin/env perl

=head1 [progam_name]

 description: Calculate years between dates (two dates or one date and current time)

=head2 First version

  a file with three columns (id, start date, end date)

  - parse the date

      | strptime($strptime_pattern, $string)
      | Given a pattern and a string this function will return a new DateTime object.
      | %F
      | Equivalent to %Y-%m-%d. (This is the ISO style date)

  - check that first date < last date

  - output years as a new column


use feature ':5.10';
use strict;
use warnings;
use Getopt::Long;
use DateTime::Format::Strptime;
use File::Slurp;

my $prog = $0;
my $usage = <<EOQ;
Usage for $0:

  $prog [-test -help -verbose] file_with_dates_in_column_2_and_3


my $has_header =1;

my $file_status;

my $file = shift;

my $fmt = DateTime::Format::Strptime->new(
    pattern => '%F',
    locale  => 'en_US',

# take care of windows end of line in a linux machine (need both chomp and s/\r$//)
my $dates = [map{chomp;s/\r$//;[split /\t/]} read_file($file)];

my $header = shift @$dates if $has_header;

my $line;
foreach my $date_aoa (@$dates) {

    # get the DateTime objects
    my $start = $fmt->parse_datetime($date_aoa->[1]);
    my $end   = $fmt->parse_datetime($date_aoa->[2]);

    unless ($start || $end){
        die "Error parising id '$date_aoa->[0]' at  line $line\n"
    # get a DateTime::Duration object (that is automatic when doing math with DateTime objects)
    my $dur = $end - $start;

    $date_aoa->[3] = $dur->years;
    $line ++;


sub print_result {

    my ($dates) = @_;

    say join("\n", map{join("\t",@$_ )}@$dates);


Some Links:

The question and its answer are also interesting

The Many Dates and Times of Perl

Thursday, 1 March 2012

figshare: sharing your scientific results

Figshare is intended to be the place for all researcher to share its supplementary data or prepublication data so it can be reviewed. Also would be a way of sharing data between groups when working in large interdisciplinary and international multigroup projects.

The only caveats that I see are:

- Only 1  GB of free space. This is nothing when talking about images and data for a lab. It is OK to pay for your own backup , but if this initiative is successful and people start to use it for backing their papers, you would like to have all the data there for long time. Who is going to pay for keep the research data when you change lab or your are forced to leave remunerated science?

- I hope they have easy ways of export your data and metadata generated so all the effort and time used in creating this sceintific corpus in the web can be kept safe and replicated if the site disappear

This site has the support from Nature’s sister company, Digital Science.

blog entries about figshare:

Tuesday, 28 February 2012

New edition of Chromatic's "Modern Perl" book

new version of Chromatic's book "Modern Perl" covering perl 5.14

Free PDF version.

Thursday, 23 February 2012

Perl programming 4th edition published

$20 the PDF in O'Reilly
25 GBP the paper back in Amazon (but not available until march)

Adopted as the undisputed Perl bible soon after the first edition appeared in 1991, Programming Perl is still the go-to guide for this highly practical language. Perl began life as a super-fueled text processing utility, but quickly evolved into a general purpose programming language that’s helped hundreds of thousands of programmers, system administrators, and enthusiasts, like you, get your job done.

In this much-anticipated update to "the Camel," three renowned Perl authors cover the language up to its current version, Perl 5.14, with a preview of features in the upcoming 5.16. In a world where Unicode is increasingly essential for text processing, Perl offers the best and least painful support of any major language, smoothly integrating Unicode everywhere—including in Perl’s most popular feature: regular expressions.

Important features covered by this update include:

New keywords and syntax
I/O layers and encodings
New backslash escapes
Unicode 6.0
Unicode grapheme clusters and properties
Named captures in regexes
Recursive and grammatical patterns
Expanded coverage of CPAN
Current best practices

Prof. dr. N.G. de Bruijn has passed away

In his obituary at Korteweg-de Vries Institute for Mathematics they mention that his de Bruijn sequences had practical applications in magic tricks but does not mention de novo assembly :-(

Prof. dr. N.G. de Bruijn, 1918-2012

N.G. (Dick) de Bruijn passed away on 17 February 2012. He was an emeritus professor of the Technical University Eindhoven, where he was a professor of mathematics from 1960 until 1984. He came to Eindhoven from the University of Amsterdam, where he was a professor of mathematics during 1952-1960. He published in many different areas, in particular in analysis, number theory, combinatorics and logic. Famous is his book Asymptotic methods in analysis (1958). In later years he developed Automath, a computer program for automatic theorem proving. The De Bruijn sequences, named after him, are widely discussed in literature and have even applications in magic (see the recent book Magical mathematics by Persi Diaconis and Ron Graham).

For more information, see (pdf file from Kleine TUE Encyclopedie) and (MacTutor History of Mathematics biography). See also on the video of the lecture by de Bruijn during the symposium on the occasion of his ninetieth birthday.

Fermi, a new de novo assembler using FMD-index from Heng Li

This is the README:
Fermi is a de novo assembler with a particular focus on assembling Illumina
short sequence reads from a mammal-sized genome. In addition to the role of a
typical assembler, fermi also aims to preserve heterozygotes which are often
collapsed by other assemblers. Its ultimate goal is to find a minimal set of
unitigs to represent all the information in raw reads.

Fermi follows the overlap-layout-consensus paradigm and uses the FM-DNA-index
(FMD-index) as the key data structure. It is inspired by the string graph
assembler (Simpson and Durbin, 2010 and 2012) and has a similar workflow.

As a typical de novo assembler, fermi tends to produce contigs with slightly
longer N50. However, the major weakness of fermi is the high misassembly rate.
Although fermi provides a tool to fix misassemblies by using paired-end reads
to achieve an accuracy comparable to other assemblers, this is not a favorable

Fermi is designed to be used on a multi-core Linux machine with large shared
memory. The easiest way to run fermi is to use the script. It
generates a Makefile. The actual assembly is done by invoking make. Premature
assembly processes can be resumed. Here is an example: -dAPe ./fermi -p NA12878 -t16 -f18 reads*.fq.gz > NA12878.mak
make -f NA12878.mak -j16

This asks fermi to use 16 CPUs to assemble paired-end reads and will produce
raw unitigs in file "NA12878.ext.mag.gz". Graph cleaning and misassembly fixing
need to be applied manually. Future version of fermi will integrate these
steps into the script. It takes about a wall-clock week and a peak
memory of 85GB to assemble 35X human reads with 16 CPUs.

If you have any questions, ask me at .

Wednesday, 22 February 2012

fixing disabled atheros wifi after windows7 hibernation.

While I am waiting for my new computer a coworker lend me an acer-Aspire-5535 (4GB RAM AMD Turion-X2 64). This is a cheap but nice machine when using windows7 (without fancy decorations or animations) and a virtualized Ubuntu-LXDE with virtual box.

I enjoyed the machine until I hibernate it. When started up again the system the joy disapeared: the wifi was not working (the wifi button had some sporadic flashes but no network at all). I restated the machine and wifi came back without a problem. The coworker knew the problem and lived with it for more than 3 month ( a windows update created the problem ).Solution, of the owner (a typical windows person) when no wifi after hibernation: restart it. Problem, what problem? "Restart windows and it will works fine" she said to me. But I can not live without proper hibernation. It is not only the booting time, there is also  the programs' state and the 60 firefox tabs that I don't want to reload again, and the pdfs that I am reading or using as reference etc.

PROBLEM: when windows 7 suspend to ram or hibernate, it shutdown the wifi card, but then is unable to activate it after restoring.

As a linux person I am get use to ifdown;ifup for resurrecting a collapsed wifi card/driver. In the old windows XP I believe I have use a restart network but I was unable to find it in windows 7. After a lot of googleing and reading a lot of unhelpful advises 90% giving as solution to reboot the computer (have any of them understood that if you do hibernation is because you don't want to shutdown !!), I have found a solution to avoid the reboot:

# hit win-R
# type 
# at the bottom of the list you will find "WLAN AutoConfig"
# click on it an slect 'stop' wait click again and select start

This will activate again the wifi.

Also, in order to prevent this happening again, I am disabling the automatic turning off of the wifi device by the system. I went to 'control_panel'->network and sharing center->manage wireless network->adapter_properties->configure->power_management->"allow computer to turn off this device to save power" and unmark this option.

Tuesday, 21 February 2012

minimal server with VirtualBox: remember to install linux-headers

I have a minimal ubuntu linux server inside a windows machine, created with VirtualBox. But I was unable to install the GuestAditions

The headers for the current running kernel were not found. If the following module compilation fails then this could be the reason

As I did not know exactly the name of the packages to install, I did some googling looking if more people have had the same problem, and i found that a quick solution is to select the intallation of the generic vbox guest additions in your package manager and see which dependencies you need to install.

$ sudo aptitude install virtualbox-ose-dkms virtualbox-ose-guest-dkms virtualbox-ose-guest-utils virtualbox-ose-guest-x11

The following NEW packages will be installed:
  dkms{a} linux-headers-2.6.32-38{a} linux-headers-2.6.32-38-generic{a} linux-headers-generic{a} 
  virtualbox-ose-dkms virtualbox-ose-guest-dkms virtualbox-ose-guest-utils virtualbox-ose-guest-x11 

$ sudo aptitude install dkms{a} linux-headers-2.6.32-38{a} linux-headers-2.6.32-38-generic{a} linux-headers-generic{a}

# now go to the GuestAdditions CD and 
ubuntu:/media/VBOXADDITIONS_4.1.8_75467$ sudo bash 

Monday, 20 February 2012

2012 would be a busy year for NGS business

First was the IonProton ( the illumina introduced last month the HiSeq-2500

Now is the turn for Oxford Nanopore with GridION and MinION:

DNA sequencing: Oxford Nanopore

Monday, 6 February 2012

Converting your perl module documentation into confluence

Short story

perldoc -u | pod2wiki -s confluence > myConfluenceDoc

The key point is to use -u for printing out the original POD code (without any formatting).

Probably it is possible to use perldoc -owiki but I do not know yet how to pass the -s confluence. I will look at that later.

Long story

Today I was documenting some of my perl modules in our Confluence wiki when I decided to paste the POD. First I tried the HTML output copying it between {html}{html} directives, but as the .css was not created, the \blocks where not boxed, and I wanted them boxed.
HTML output
perldoc -ohtml
Then I made the output in text and used a oneliner to put some confluence markup
But these was not complete and needed some hand curation.
So I search in CPAN and find that Confluence is supported in the Pod::Simple::Wiki and also give you a script for that pod2wiki
pod2wiki accepts POD as input from STDIN, nice, so I printed the pod and passed to pod2wiki: perldoc | pod2wiki -s confluence
NO OUTPUT!!!!!!!!!!
Today I was a bit slow, and took me some time to realize that perldoc does not print 'POD' but man-formatted POD. I did perldoc perldoc to find out how to print the raw POD of the document: option -u
Confluence output
perldoc -u | pod2wiki -s confluence > myConfluenceDoc