Saturday, 17 November 2012

Nice positional sequencing hack in an illumina flowcell by Shendure's lab

Another interesting paper from the Shendure's lab

 2012 Nov 13;109(46):18749-54. doi: 10.1073/pnas.1202680109. Epub 2012 Oct 29.

Capturing native long-range contiguity by in situ library construction and optical sequencing.


Department of Genome Sciences, University of Washington, Seattle, WA 98195.


The relatively short read lengths associated with the most cost-effective DNA sequencing technologies have limited their use in de novo genome assembly, structural variation detection, and haplotype-resolved genome sequencing. Consequently, there is a strong need for methods that capture various scales of contiguity information at a throughput commensurate with the current scale of massively parallel sequencing. We propose in situ library construction and optical sequencing on the flow cells of currently available massively parallel sequencing platforms as an efficient means of capturing both contiguity information and primary sequence with a single technology. In this proof-of-concept study, we demonstrate basic feasibility by generating >30,000 Escherichia coli paired-end reads separated by 1, 2, or 3 kb using in situ library construction on standard Illumina flow cells. We also show that it is possible to stretch single molecules ranging from 3 to 8 kb on the surface of a flow cell before in situ library construction, thereby enabling the production of clusters whose physical relationship to one another on the flow cell is related to genomic distance.
[PubMed - in process]

Looking for a Perl tool like Ipython notebook.

I have just recently discovered Ipython notebooks[1] they are fantastics and like R sweave[2] they are essential for Reproducible Research.

I use Perl and pdl for bioinformatics research. Being pdl a scientific tool these kind of reproducible research utilities seems a perfect match.

Some time ago in the bio(perl|python) communities they started to look to use a similar approach to sweave but I have not seen much progress on that. A cheap alternative (if your OS is called emacs ;-)) is to use org-mode babel[3]. I like to use babel because I almost always work under ssh connection to my servers using also screen session with emacs -nw . Babel could be a substitute for sweave, but Ipython notebook is a more advanced and interactive beast. You can see this fantastic presentation from Fernando PĂ©rez about scientific Python[4] where he demonstrates its powers.

I don't know yet any Perl tool similar for Ipython notebooks but if it exist I would like to find and use it.

Any comments on this topic and links would be more than wellcome.

Key points summary in the GATK licence change

Recently I posted about GATK change of licence to a commercial one dropping MIT licence for GATK 2.0.

There is a page with the thread generated by the licence change at gatkforums. And first of all, I am very glad to see how scientist are discussing with argumentation and opinion instead of trolling and insulting. In a informatics forum this would generate a rather unpolite flame war. I am very please that holocaust and Hitler has not been mentioned yet and seem that never will be ;-). Kudos to the community.

From this thread seems that the licence change was done to prevent some companies from making money selling GATK analysis meanwhile the lab "producing" GATK is year by year, as the rest of us, desperately trying to secure economic resources to keep on going the projects in this time of big cuts on science. That is understandable but this would rise another debate: when are tools like GATK, SAMtools pipeline, tuxedo suit etc. getting mature enough to fly solo and pass from a scientific project to a software production?. When this happens, should they be funded by scientific founds or by industry founds?

The key of the dabate here is more about the lack of expertise of the lab in licencing and commercial and retriction implications as the licence terms where ambiguous and not clear enough at some passages.

[dePristo reply in the blog explaining the two licences]

Hi Pepetideo,

You can share your GATK results -- that was a language slip up on our part. You can see it's clear in the license and FAQ now.

As for why, its two fold. One is to ensure that we can continue to develop and support the GATK into the future by creating a sustainable revenue source for the team. Two is that a commercial version will be able to support a large team providing tier-one support, such as long-term maintenance of specific GATK versions, which my research group simply cannot provide. Note that any commercial entity who wants to stay with GATK-Lite can go the full open source route, at the cost of foregoing premium support and access to the best possible tools.

We recognize that this is a change, and of course we are big supporters of open source software -- the vast majority of the GATK2 is open source. We considered creating a "GATK foundation" mozilla style, accepting micro donations, or even providing pay-for-services on top of the GATK but ultimately the commercial/non-commercial divide seemed the option that provides the most value to the entirely community.

    Purchasing a commercial GATK2 license will give you the right to run the GATK2 within the company and share / publish / etc your results. This is what I'd think of as a standard commercial license, and most places would fit in this bucket. The example here is buying Adobe Photoshop and using it in house to manage and edit photos.

    The more complex question is around third-party pipeline executors, which only take in data from others and who effectively sell the running of the GATK. Here I think there will be a separate license with specific terms, but it's something we'd like to enable. The analogy here is setting up a for-profit web portal for photo editing that backends to photoshop. A valuable activity but one not covered by the standard end-user license agreement.
[some answers to that]
August 2

From this side of the pond the Wellcome Trust Sanger Institute does have a policy on the software we develop here, it has to be open sourced and under specified licenses. This is in harmony with our policy that research funded by Wellcome Trust money must be published in an open access journal.

I can't speak for the institute itself but I have a sinking feeling this decision will spark a lot of debate. The concern this gives me and what I intend to find out about, is how this will interact with any collaborations on or contributions we with to make to GATK. UK charity law is quite tight on what kind of profit making activities charities can take part in so it may involve lawyer time.

I do understand the Broad's point of view, people are making money on software that the Broad has invested money in producing and the Broad is not getting a cut from it. Ideally the way they'd pay it forward would be to contribute testing time and improvements back, but in practice I imagine quite a few are taking a free ride. That said companies take a free ride on most of the research we do, it's just harder to make money from most of it though. This whole debate does bring the name Celera to mind though.
August 3

Companies pay taxes too. Some additionally indirectly support the development of tools such as the GATK through academic collaborations. I guess the way you are framing the question though, it comes down to this: taxpayers paid for the development of the GATK. Most taxpayers aren't doing genetic research. So what will benefit the taxpayers most? (Should companies be paying for the reference genome sequence? For SNP databases? Where do we draw the line?) The reason the government (taxpayers) invest in basic research is to stimulate the downstream discovery. Help us translate research into helping patients!

I even take exception to the phrase "If the time comes when they're asked to contribute back and they don't, then yes they are leeches." Who is asking to contribute back? Not the people who paid for the development in the first place (NIH, Eli & Edyth Broad, Harvard, MIT)! The GATK became widely used not only because it was good (it is), and not only because it was free (it was), but because of a huge investment from other projects (most notably the 1000 genomes project, but others as well) that got it free publicity and turned it into a de facto standard. It's hard to compete when the GATK team has earlier access to taxpayer-funded projects/data/sequencing, and guaranteed publications when these projects come out. Also, gaining market share and then raising prices sounds more capitalist than Marxist, to reference an earlier comment.

I don't mean to go on a tirade; I guess I just feel strongly about this. I definitely understand where you are coming from; I too have written popular open-source software that cost me personally plenty of time of support, and meanwhile surely helped some for-profit entities do research (I hope!) and one for-profit company in particular have success. Nonetheless, I knew it was my duty to share the software freely. Also please know that I say all this with the utmost respect for the whole GATK team (most of which I think I know--I don't think we've had the pleasure of meeting though, Geraldine). You guys are doing a great job, and it's wonderful that taxpayers have been able to fund the development of (documented, supported) academic software. I'd be happy to take everyone out for drinks after work some day and thank you personally; justifying paying for the software is difficult.

*Note: these views are my own, and do not necessarily reflect those of my company or colleagues.
August 3

Hi all,

I want to chime in with three clarifying points:

    We don't yet know the pricing scheme, but we are keenly aware of the complications of per-use licensing as TechnicalVault brings up

    Overall I want everyone to tone down the moral issue surrounding commercial licensing. The discussion of moral issues, extracting of rents, leeching off taxpayers, are all counterproductive to helping understand what we have decided and the best path forward. All of this is just business, after all.

    The NIH is clear that when funding basic research that the support IP is generally owned the developing institution (I'm sure there are exceptions), and this is true for software in general. The only key software restriction I know of is that the software must be made available to federal employees upon request. The reason for this policy is obvious -- it would be extremely difficult to accept the trade-off in a grants with IP ownership if you are creating high-value IP. Even federal SBIR grants spell out clearly that the government does not own any IP associated with the support. The federal granting system is to foster innovation, not to own innovation. It's a subtle difference but important for anyone creating real IP value with federal money in the US

    Many users of the GATK would like a much higher level of support than the Broad institute could possibly provide, as this is off track for the Broad's mission to transform medicine through genomics. We believe that having a commercial license for the GATK will allow us to actually deliver on this superior support and continue to grow the GATK as a reliable standard for NGS analysis in the commercial sphere and beyond. Without a commercial version we simply cannot follow through on this opportunity.

    We attempted to make the GATK easy for others to contribute code to, but our experiences in this area have been disappointing. Many people use the GATK for developing tools -- and we are committed to ensuring the programming framework and libraries remain MIT licensed -- we have had little contribution over 3 years to the master codebase from independent third-parties. Certainly some of our collaborators have contributed impressive tools and extensions, but again they aren't really independent. There's a good wikipedia article on the experiences of mySQL similar to this, and they release with a dual-license similar to our approach. Still though I'd like to release the source code to all of the tools -- if we can find a way consistent with the commercial license -- for transparency to the community and to allow others to contribute, in so far as they like.

-- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard
But people still confused 2 month later
TechnicalVault Posts
October 17

Hi Mark, I have a couple of questions stemming from the FAQ posted by your new commercial partners:

    Regarding, "Why not stay with GATKLite?" According to the FAQ at your new partner's site "Broad has indicated that GATK-lite tools will soon be obsolete, and it plans to stop supporting the tools by the end of 2012." Can you confirm exactly what this means please? Is it all of GATK-lite which will be dropped or just tools which have been replaced by new ones from GATK2?

    "Use by a not-for-profit organization to generate revenue requires a commercial license", can you clarify what this means? For example providing sequencing services to other academic institutions generates revenue, however it is usually done at cost so does not generate profit.

    If a not-for-profit was interested in buying support, but not in buying a commercial license is there an option for this? Who would it be with?

    Finally who will be the final arbiter of usage terms? Does that remain with the Broad or have you signed enforcement over to your partners?

Thank you for all your hard work
Post edited by TechnicalVault on October 17
October 17

Hi Martin,

    CORRECTION: The wording in the FAQ was incorrect due to a miscommunication. We will in fact continue providing support for GATK-Lite tools indefinitely, and although we will eventually stop providing a separate build (jar file), the GATK-Lite codebase will remain publicly available on our open source GitHub repository. In addition, tools from GATK 2 will be migrated into the GATK-Lite codebase over time.
    and 3. Please direct these questions to our partner, Appistry. They will be able to tell you based on your specific circumstances. They have a contact form that you can use, and within a few days they will also have a discussion forum that you can use for this purpose.
    See above.
    I believe Mark will answer that for you, or direct you to Issi Rozen here at the Broad, if you want an answer from our side. Otherwise I expect the Appistry people should be able to answer that as well.

Friday, 16 November 2012

GATK 2.0 drops MIT licence

Sadly Broad Institute's great contribution to new generation sequencing NGS, GATK, is not open source anymore.

GATK 2.0 is moving to a mixed open/closed-source model
The complete GATK 2.0 suite will be distributed as a binary only, without source code for the newest tools. We plan to release the source code for these tools, but its unclear the timeframe for this.
Personally I dislike this movement. If reason is money there are successful business models build around open source software without the need of leaving the open source community.

Doing this sort of things is a sacrilege for many of the members of our community fighting day and night for the #OA. I can see RMS deeply depress and crying in a corner every time such a thing happens. Whatever the reason behind this change is, presumably it has been  a long meditated one. So I would expect a better explanation of why this change was needed, why keep being an open source software was impeding this goal, and what is more important, the thing that annoys me more, and  denotes poor strategic plan for such a big change "We plan to release the source code for these tools, but its unclear the timeframe for this".  Hummm, this seems to me that commitment for open source has vanished here. Even if you believe in OS, some times you are forced to make concessions to  close software but I would expect to leave clear note when and how your source code would be open again.

How come we are going to lead a war for open access journals and against patents but want to make money of our software. This seems a deadly friendly fire to me until I read a good detailed explanation of the reasons behind.

It is a good thing to have samtools around as a backup.

== follow up ==

Thursday, 15 November 2012

bioperl popularity (measured by searches) going down day by day.

According to the following figure bioperl is loosing all its appealing day by day.  One critique to this plot is that big bio projects still are in perl like Ensembl, biopieces etc,  but not shown here. The pity is that R and biophyton have very good tools/pipelines for New Generation Sequencing and bioperl or other perl bio projects don't.

Wednesday, 7 November 2012

how to use R introspection for obtaining a data.frame from an object

A very helpful post from Charlot Wickham from the ggplot2 mailing list about how to find out which methods to use to extract a data.frame with the data from an object.

Lecomte Jean-Baptiste
6:16 PM (20 hours ago)
to ggplot2
Dear all,

I'm trying to plot with ggplot2 the result of the function variofit of the geoR package.
It's quite simple with the basic plot :

vario100 <- variog(s100, max.dist=1)
ini.vals <- expand.grid(seq(0,1,l=5), seq(0,1,l=5))
ols <- variofit(vario100, ini=ini.vals, fix.nug=TRUE, wei="equal")
wls <- variofit(vario100, ini=ini.vals, fix.nug=TRUE)
lines(ols, lty=2)

I can plot the points of the empirical variogram, but I can't plot the line representing the fitted variogram with ggplot2




I have make a quick search on both ggplot2 and R-sig-Geo user list without finding any solutions.
I will appreciate any advice.

Jean-Baptiste Lecomte

Charlotte Wickham
11:08 PM (15 hours ago)

to Lecomteggplot2
Hi Jean-Baptiste,

There are a couple of problems with geom_line(wls).  Firstly ols and wls are special objects that ggplot knows nothing about, and secondly you are passing them in as the first argument where geom_line expects a mapping.  A general solution for this type of problem is to write a function that takes the special type of object you are interested in as an argument, and outputs a data.frame you can use ggplot to plot.

You know lines(wls) must at some point calculate enough info to draw the line you require, it's just a matter of finding out where.

This first step is too figure out what is happening in lines(ols):
> class(ols)
[1] "variomodel" "variofit"  

Tells us ols has class "variomodel", so our best bet is to look at lines.variomodel:

> ?lines.variomodel
> lines.variomodel
function (x, ...) 
<environment: namespace:geoR>

Not very helpful...looks like lines.variogram is a generic function, we need to figure out which method is being called.  

> methods("lines.variomodel")
[1] lines.variomodel.default*     lines.variomodel.grf*        
[3] lines.variomodel.krige.bayes* lines.variomodel.likGRF*     
[5] lines.variomodel.variofit*   

   Non-visible functions are asterisked

Look like lines.variomodel.variofit is our candidate:

will spit out the function, and it looks like most of it is calculating the fitted line to be used in the call to the function curve.  One way to get ggplot to plot the fitted line would be to reuse that code but instead of calling curve, outputting a data.frame that ggplot can use.  I put my attempt here:

Source that in and try:
qplot(u,v,data=df_vario) +
 geom_line(aes(x, fit), data = fitted_variofit(ols)) +
 geom_line(aes(x, fit), data = fitted_variofit(wls), linetype = "dashed")

Hope that helps,