Monday, 13 May 2013

Issues Installing R package ncdf

Just for future reference:

Short story:
before installing ncdf in R intall in your linux libnetcdf-dev and netcdf-bin

Long story:

I found a problem installing the R package ncdf:

> install.package('ncdf')

        |checking whether we are using the GNU C compiler... yes
        |checking whether gcc -std=gnu99 accepts -g... yes
        |checking for gcc -std=gnu99 option to accept ISO C89... none needed
        |checking how to run the C preprocessor... gcc -std=gnu99 -E
        |checking for grep that handles long lines and -e... /bin/grep
        |checking for egrep... /bin/grep -E
        |checking for ANSI C header files... no
        |checking for sys/types.h... no
        |checking for sys/stat.h... no
        |checking for stdlib.h... no
        |checking for string.h... no
        |checking for memory.h... no
        |checking for strings.h... no
        |checking for inttypes.h... no
        |checking for stdint.h... no
        |checking for unistd.h... no
        |checking netcdf.h usability... no
        |checking netcdf.h presence... no
        |checking for netcdf.h... no
        |configure: error: netcdf header netcdf.h not found
        |ERROR: configuration failed for package ‘ncdf’
        |* removing ‘/usr/local/lib/R/site-library/ncdf’


Well I though, this is the classical "install the -devel package first" but alas, I have the netcdf.h in /usr/include !!! and I did already intalled the libnetcdf-dev package in my biolinux-7 (ubuntu 12.04)

I went to the tmp folder where the R packages was downloaded and tried to install it by hand. I read the INSTALL  file and tried adding  environmental variables:


The location of the netcdf library can be specified by the environment
variable NETCDF_LIB or the configure argument --with-netcdf-lib.

or adding FLAGS to ./configure

LDFLAGS=-L/usr/lib CPPFLAGS=-I/usr/include sh configure

Nothing was working.

Finally I remembered that at the beginning of the INSTALL file it is mentioned that the program nc-config  is a helper for the installer to find out the location of the dev files, but as I had not it installed and it is said in this file that is optional I skipped this part during my first trials. After failed several times to find out how to pass the paths of the dev files I decided to give a try to the nc-config route. I searched for it and here http://cirrus.ucsd.edu/~pierce/ncdf/install_ncdf_v4.html I read:

The only issue is getting the libraries right. All the libraries used by the netcdf library, version 4, must be visible to the R software. To get a list of what libraries must be visible, run "nc-config --libs" (note: a recent version of the netcdf library, version 4, must be installed correctly for this command to work).


But I did not have nc-config

root@pmg-analysis:/tmp/RtmpiNeIMZ/downloaded_packages/ncdf# nc-config --libs The program 'nc-config' is currently not installed. You can install it by typing: apt-get install netcdf-bin

So I installed it
root@pmg-analysis:/tmp/RtmpiNeIMZ/downloaded_packages/ncdf# apt-get install netcdf-bin

And then knowing that the R package ncdf uses this program for finding the headers and libs for netcdf, I run the install.package in R:

>install.packages('ncdf')
Installing package into ‘/home/pmg/R/x86_64-pc-linux-gnu-library/3.0’
(as ‘lib’ is unspecified)
trying URL 'http://cran.univ-paris1.fr/src/contrib/ncdf_1.6.6.tar.gz'
Content type 'application/x-gzip' length 79403 bytes (77 Kb)
opened URL
==================================================
downloaded 77 Kb

* installing *source* package ‘ncdf’ ...
** package ‘ncdf’ successfully unpacked and MD5 sums checked
checking for nc-config... /usr/bin/nc-config
configure: creating ./config.status
config.status: creating src/Makevars
** libs
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I/usr/include     -fpic  -O3 -pipe  -g  -c ncdf.c -o ncdf.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I/usr/include     -fpic  -O3 -pipe  -g  -c ncdf2.c -o ncdf2.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I/usr/include     -fpic  -O3 -pipe  -g  -c ncdf3.c -o ncdf3.o
gcc -std=gnu99 -shared -o ncdf.so ncdf.o ncdf2.o ncdf3.o -L/usr/lib -lnetcdf -L/usr/lib/R/lib -lR
installing to /home/pmg/R/x86_64-pc-linux-gnu-library/3.0/ncdf/libs
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (ncdf)

Job done!!

Sunday, 10 February 2013

the perils of Excel: any one can do it



A nice story to read about the perils of using excel when you don't know what you are doing. Fools rush in where angels fear to trade.

http://baselinescenario.com/2013/02/09/the-importance-of-excel/
[...]The new model “operated through a series of Excel spreadsheets, which had to be completed manually, by a process of copying and pasting data from one spreadsheet to another.” The internal Model Review Group identified this problem as well as a few others, but approved the model, while saying that it should be automated and another significant flaw should be fixed.** After the London Whale trade blew up, the Model Review Group discovered that the model had not been automated and found several other errors. Most spectacularly,
After subtracting the old rate from the new rate, the spreadsheet divided by their sum instead of their average, as the modeler had intended. This error likely had the effect of muting volatility by a factor of two and of lowering the VaR . . .”
[...]
But while Excel the program is reasonably robust, the spreadsheets that people create with Excel are incredibly fragile. There is no way to trace where your data come from, there’s no audit trail (so you can overtype numbers and not know it), and there’s no easy way to test spreadsheets, for starters. The biggest problem is that anyone can create Excel spreadsheets—badly. Because it’s so easy to use, the creation of even important spreadsheets is not restricted to people who understand programming and do it in a methodical, well-documented way.***
[...]

The importance of logarithmic transformation in 'natural' data


Reading the Edward Tuft book about data analysis in politics and policy

http://www.edwardtufte.com/tufte/dapp/

Edward Tuft is one of the gurus of Data Analysis visualization [0], and in this chapter [1] he show in a very didactic and clear way the importance of logarithmic transformation for data of naturally occurring counts.
   [0] http://en.wikipedia.org/wiki/Edward_Tufte
   [1] http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003uF

The importance of logarithmic transformation


http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003uF

This is a very clear and didactic explanation of the importance of logarithmic transformation that anyone on doing data analysis in natural sciences or epidemiology must read.

And a very important point is to raise the point that regression analysis of a model DOES NOT TEST the relationship but SHOWS the proportionality GIVEN THE MODEL BEING TRUE 


The end part of this section has a bit more of mathematics that some biologist probably have already forgotten but it is worthy to read it anyway.

I truly recommend reading this even it is a very old book (ed. 1976).

Final note: Remember to add 1 to your data before log transform in order to avoid log(0). Don't do that if you have negative number ;-). Other option is to add a small quantity to all your 0s

Saturday, 9 February 2013

BioPerl is thinking about to be more practical and adaptative


There has been a good number of BioPerl threads in the mailing list [0] last week about how to make BioPerl more fitted to the current times.

  [0] http://thread.gmane.org/gmane.comp.lang.perl.bio.general

I like the phrase of George Hartzell about being able to move forward because we need to support Perl 5.8

But why should the all-volunteer BioPerl community be stuck supporting
code from 12 years ago because it's cost effective for someone else to
avoid spending *their* $/time/people to stay up to date.

And the links to the discussion:
Next BioPerl release : http://thread.gmane.org/gmane.comp.lang.perl.bio.general/26348
dependencies on perl version : http://thread.gmane.org/gmane.comp.lang.perl.bio.general/26344
BioPerl future :  http://thread.gmane.org/gmane.comp.lang.perl.bio.general/26394
removing packages from bioperl-live: http://thread.gmane.org/gmane.comp.lang.perl.bio.general/26341

Friday, 25 January 2013

my emacs24 needs (require 'appt) in .emacs

I have updated to emacs 24 but now my .emacs file that was working OK with 23 is not working anymore
Warning (initialization): An error occurred while loading `/home/pmg/.emacs':

Symbol's function definition is void: appt-make-list
I had this entry in my .emacs
(add-hook 'diary-hook 'appt-make-list)
And from this mailing list entry I found the solution. I need to add (require 'appt)
;; from the emacs wiki -> calendar and diary                                                          
;; Mx-calendar -> d (show day) i d (insert task)                                                      
(require 'appt)
(add-hook 'diary-hook 'appt-make-list)
(diary)

And now all works OK. Why I did not needed before? I don't have a clue

Saturday, 17 November 2012

Nice positional sequencing hack in an illumina flowcell by Shendure's lab

Another interesting paper from the Shendure's lab



 2012 Nov 13;109(46):18749-54. doi: 10.1073/pnas.1202680109. Epub 2012 Oct 29.

Capturing native long-range contiguity by in situ library construction and optical sequencing.

Source

Department of Genome Sciences, University of Washington, Seattle, WA 98195.

Abstract

The relatively short read lengths associated with the most cost-effective DNA sequencing technologies have limited their use in de novo genome assembly, structural variation detection, and haplotype-resolved genome sequencing. Consequently, there is a strong need for methods that capture various scales of contiguity information at a throughput commensurate with the current scale of massively parallel sequencing. We propose in situ library construction and optical sequencing on the flow cells of currently available massively parallel sequencing platforms as an efficient means of capturing both contiguity information and primary sequence with a single technology. In this proof-of-concept study, we demonstrate basic feasibility by generating >30,000 Escherichia coli paired-end reads separated by 1, 2, or 3 kb using in situ library construction on standard Illumina flow cells. We also show that it is possible to stretch single molecules ranging from 3 to 8 kb on the surface of a flow cell before in situ library construction, thereby enabling the production of clusters whose physical relationship to one another on the flow cell is related to genomic distance.
PMID:
 
23112150
 
[PubMed - in process]


Looking for a Perl tool like Ipython notebook.



I have just recently discovered Ipython notebooks[1] they are fantastics and like R sweave[2] they are essential for Reproducible Research.

I use Perl and pdl for bioinformatics research. Being pdl a scientific tool these kind of reproducible research utilities seems a perfect match.

Some time ago in the bio(perl|python) communities they started to look to use a similar approach to sweave but I have not seen much progress on that. A cheap alternative (if your OS is called emacs ;-)) is to use org-mode babel[3]. I like to use babel because I almost always work under ssh connection to my servers using also screen session with emacs -nw . Babel could be a substitute for sweave, but Ipython notebook is a more advanced and interactive beast. You can see this fantastic presentation from Fernando PĂ©rez about scientific Python[4] where he demonstrates its powers.

I don't know yet any Perl tool similar for Ipython notebooks but if it exist I would like to find and use it.

Any comments on this topic and links would be more than wellcome.

Key points summary in the GATK licence change


Recently I posted about GATK change of licence to a commercial one dropping MIT licence for GATK 2.0.

There is a page with the thread generated by the licence change at gatkforums. And first of all, I am very glad to see how scientist are discussing with argumentation and opinion instead of trolling and insulting. In a informatics forum this would generate a rather unpolite flame war. I am very please that holocaust and Hitler has not been mentioned yet and seem that never will be ;-). Kudos to the community.

From this thread seems that the licence change was done to prevent some companies from making money selling GATK analysis meanwhile the lab "producing" GATK is year by year, as the rest of us, desperately trying to secure economic resources to keep on going the projects in this time of big cuts on science. That is understandable but this would rise another debate: when are tools like GATK, SAMtools pipeline, tuxedo suit etc. getting mature enough to fly solo and pass from a scientific project to a software production?. When this happens, should they be funded by scientific founds or by industry founds?

The key of the dabate here is more about the lack of expertise of the lab in licencing and commercial and retriction implications as the licence terms where ambiguous and not clear enough at some passages.


[dePristo reply in the blog explaining the two licences]


Hi Pepetideo,

You can share your GATK results -- that was a language slip up on our part. You can see it's clear in the license and FAQ now.

As for why, its two fold. One is to ensure that we can continue to develop and support the GATK into the future by creating a sustainable revenue source for the team. Two is that a commercial version will be able to support a large team providing tier-one support, such as long-term maintenance of specific GATK versions, which my research group simply cannot provide. Note that any commercial entity who wants to stay with GATK-Lite can go the full open source route, at the cost of foregoing premium support and access to the best possible tools.

We recognize that this is a change, and of course we are big supporters of open source software -- the vast majority of the GATK2 is open source. We considered creating a "GATK foundation" mozilla style, accepting micro donations, or even providing pay-for-services on top of the GATK but ultimately the commercial/non-commercial divide seemed the option that provides the most value to the entirely community.


    Purchasing a commercial GATK2 license will give you the right to run the GATK2 within the company and share / publish / etc your results. This is what I'd think of as a standard commercial license, and most places would fit in this bucket. The example here is buying Adobe Photoshop and using it in house to manage and edit photos.

    The more complex question is around third-party pipeline executors, which only take in data from others and who effectively sell the running of the GATK. Here I think there will be a separate license with specific terms, but it's something we'd like to enable. The analogy here is setting up a for-profit web portal for photo editing that backends to photoshop. A valuable activity but one not covered by the standard end-user license agreement.
[some answers to that]
TechnicalVault
August 2

From this side of the pond the Wellcome Trust Sanger Institute does have a policy on the software we develop here, it has to be open sourced and under specified licenses. This is in harmony with our policy that research funded by Wellcome Trust money must be published in an open access journal.

I can't speak for the institute itself but I have a sinking feeling this decision will spark a lot of debate. The concern this gives me and what I intend to find out about, is how this will interact with any collaborations on or contributions we with to make to GATK. UK charity law is quite tight on what kind of profit making activities charities can take part in so it may involve lawyer time.

I do understand the Broad's point of view, people are making money on software that the Broad has invested money in producing and the Broad is not getting a cut from it. Ideally the way they'd pay it forward would be to contribute testing time and improvements back, but in practice I imagine quite a few are taking a free ride. That said companies take a free ride on most of the research we do, it's just harder to make money from most of it though. This whole debate does bring the name Celera to mind though.
joshkorn
August 3

Companies pay taxes too. Some additionally indirectly support the development of tools such as the GATK through academic collaborations. I guess the way you are framing the question though, it comes down to this: taxpayers paid for the development of the GATK. Most taxpayers aren't doing genetic research. So what will benefit the taxpayers most? (Should companies be paying for the reference genome sequence? For SNP databases? Where do we draw the line?) The reason the government (taxpayers) invest in basic research is to stimulate the downstream discovery. Help us translate research into helping patients!

I even take exception to the phrase "If the time comes when they're asked to contribute back and they don't, then yes they are leeches." Who is asking to contribute back? Not the people who paid for the development in the first place (NIH, Eli & Edyth Broad, Harvard, MIT)! The GATK became widely used not only because it was good (it is), and not only because it was free (it was), but because of a huge investment from other projects (most notably the 1000 genomes project, but others as well) that got it free publicity and turned it into a de facto standard. It's hard to compete when the GATK team has earlier access to taxpayer-funded projects/data/sequencing, and guaranteed publications when these projects come out. Also, gaining market share and then raising prices sounds more capitalist than Marxist, to reference an earlier comment.

I don't mean to go on a tirade; I guess I just feel strongly about this. I definitely understand where you are coming from; I too have written popular open-source software that cost me personally plenty of time of support, and meanwhile surely helped some for-profit entities do research (I hope!) and one for-profit company in particular have success. Nonetheless, I knew it was my duty to share the software freely. Also please know that I say all this with the utmost respect for the whole GATK team (most of which I think I know--I don't think we've had the pleasure of meeting though, Geraldine). You guys are doing a great job, and it's wonderful that taxpayers have been able to fund the development of (documented, supported) academic software. I'd be happy to take everyone out for drinks after work some day and thank you personally; justifying paying for the software is difficult.

*Note: these views are my own, and do not necessarily reflect those of my company or colleagues.
Mark_DePristo
August 3

Hi all,

I want to chime in with three clarifying points:

    We don't yet know the pricing scheme, but we are keenly aware of the complications of per-use licensing as TechnicalVault brings up

    Overall I want everyone to tone down the moral issue surrounding commercial licensing. The discussion of moral issues, extracting of rents, leeching off taxpayers, are all counterproductive to helping understand what we have decided and the best path forward. All of this is just business, after all.

    The NIH is clear that when funding basic research that the support IP is generally owned the developing institution (I'm sure there are exceptions), and this is true for software in general. The only key software restriction I know of is that the software must be made available to federal employees upon request. The reason for this policy is obvious -- it would be extremely difficult to accept the trade-off in a grants with IP ownership if you are creating high-value IP. Even federal SBIR grants spell out clearly that the government does not own any IP associated with the support. The federal granting system is to foster innovation, not to own innovation. It's a subtle difference but important for anyone creating real IP value with federal money in the US

    Many users of the GATK would like a much higher level of support than the Broad institute could possibly provide, as this is off track for the Broad's mission to transform medicine through genomics. We believe that having a commercial license for the GATK will allow us to actually deliver on this superior support and continue to grow the GATK as a reliable standard for NGS analysis in the commercial sphere and beyond. Without a commercial version we simply cannot follow through on this opportunity.

    We attempted to make the GATK easy for others to contribute code to, but our experiences in this area have been disappointing. Many people use the GATK for developing tools -- and we are committed to ensuring the programming framework and libraries remain MIT licensed -- we have had little contribution over 3 years to the master codebase from independent third-parties. Certainly some of our collaborators have contributed impressive tools and extensions, but again they aren't really independent. There's a good wikipedia article on the experiences of mySQL similar to this, and they release with a dual-license similar to our approach. Still though I'd like to release the source code to all of the tools -- if we can find a way consistent with the commercial license -- for transparency to the community and to allow others to contribute, in so far as they like.

-- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard
But people still confused 2 month later
TechnicalVault Posts
October 17

Hi Mark, I have a couple of questions stemming from the FAQ posted by your new commercial partners:

    Regarding, "Why not stay with GATKLite?" According to the FAQ at your new partner's site "Broad has indicated that GATK-lite tools will soon be obsolete, and it plans to stop supporting the tools by the end of 2012." Can you confirm exactly what this means please? Is it all of GATK-lite which will be dropped or just tools which have been replaced by new ones from GATK2?

    "Use by a not-for-profit organization to generate revenue requires a commercial license", can you clarify what this means? For example providing sequencing services to other academic institutions generates revenue, however it is usually done at cost so does not generate profit.

    If a not-for-profit was interested in buying support, but not in buying a commercial license is there an option for this? Who would it be with?

    Finally who will be the final arbiter of usage terms? Does that remain with the Broad or have you signed enforcement over to your partners?

Thank you for all your hard work
Post edited by TechnicalVault on October 17
 
Geraldine_VdAuwera
October 17

Hi Martin,

    CORRECTION: The wording in the FAQ was incorrect due to a miscommunication. We will in fact continue providing support for GATK-Lite tools indefinitely, and although we will eventually stop providing a separate build (jar file), the GATK-Lite codebase will remain publicly available on our open source GitHub repository. In addition, tools from GATK 2 will be migrated into the GATK-Lite codebase over time.
    and 3. Please direct these questions to our partner, Appistry. They will be able to tell you based on your specific circumstances. They have a contact form that you can use, and within a few days they will also have a discussion forum that you can use for this purpose.
    See above.
    I believe Mark will answer that for you, or direct you to Issi Rozen here at the Broad, if you want an answer from our side. Otherwise I expect the Appistry people should be able to answer that as well.




Friday, 16 November 2012

GATK 2.0 drops MIT licence


Sadly Broad Institute's great contribution to new generation sequencing NGS, GATK, is not open source anymore.

GATK 2.0 is moving to a mixed open/closed-source model
The complete GATK 2.0 suite will be distributed as a binary only, without source code for the newest tools. We plan to release the source code for these tools, but its unclear the timeframe for this.
Personally I dislike this movement. If reason is money there are successful business models build around open source software without the need of leaving the open source community.

Doing this sort of things is a sacrilege for many of the members of our community fighting day and night for the #OA. I can see RMS deeply depress and crying in a corner every time such a thing happens. Whatever the reason behind this change is, presumably it has been  a long meditated one. So I would expect a better explanation of why this change was needed, why keep being an open source software was impeding this goal, and what is more important, the thing that annoys me more, and  denotes poor strategic plan for such a big change "We plan to release the source code for these tools, but its unclear the timeframe for this".  Hummm, this seems to me that commitment for open source has vanished here. Even if you believe in OS, some times you are forced to make concessions to  close software but I would expect to leave clear note when and how your source code would be open again.

How come we are going to lead a war for open access journals and against patents but want to make money of our software. This seems a deadly friendly fire to me until I read a good detailed explanation of the reasons behind.

It is a good thing to have samtools around as a backup.

== follow up ==
http://pablomarin-garcia.blogspot.co.uk/2012/11/key-points-summary-in-gatk-licence.html

Thursday, 15 November 2012

bioperl popularity (measured by searches) going down day by day.


According to the following figure bioperl is loosing all its appealing day by day.  One critique to this plot is that big bio projects still are in perl like Ensembl, biopieces etc,  but not shown here. The pity is that R and biophyton have very good tools/pipelines for New Generation Sequencing and bioperl or other perl bio projects don't.