Tuesday 18 May 2010

[Perl] Removig spaces from strings

Thanks to Fred Moyer in the PerlMongers at LinkedIn I learned about the existence of the module:
String::Strip

It uses XS to remove spaces from a string and claims to be 35% faster.

You can also trim your text withText::Trim.

I would prefer to use modules for typical patterns but here a simple s/^\s+|\s+$//g will do the trick and I have it in my Emacs macros for a "sub trim {...}" ;-). This would be nanosecond slower than other approximations but trimming is not my bottleneck, db accession is :-(.
[update 210-06-13]
Tom Christiansen in "Perl Cookbook, 2nd Edition"

--Recipe 1.19. Trimming Blanks from the Ends of a String--

show another aproximation:
If the function isn't passed any arguments at all, it could act like chop and chomp by defaulting to $_. Incorporating all of these embellishments produces this function:
# 1. trim leading and trailing white space
# 2. collapse internal whitespace to single space each
# 3. take input from $_ if no arguments given
# 4. join return list into single scalar with intervening spaces 
#     if return is scalar context

sub trim {
    my @out = @_ ? @_ : $_;
    $_ = join(' ', split(' ')) for @out;
    return wantarray ? @out : "@out";
}

PD:
I would need to update my trim sub at some point (following PBP) to something like
s/A\s+|\s+z//gms
But I would need to check that it would do the right thing, and remove spaces before and after "\n"

* Posted by David Bouman in perlmongers at linkedIn:
sub trim { return unpack 'A*', reverse unpack 'A*', reverse shift }

* Posted by Tobie Van Der Merwe in perlmongers at linkedIn:

Tobie solves the problem doing more work that needed and instead doing eliminating white space directly he captures the text in between. Probably this is not elegant and a bit complicated (see Gabor reply) but on the other hand he has shown a good perl attitude and one not so good.

The good one: always give a test case and the code to proof your code. Probably he had not the knowledge for a better regex but he had a good attitude .

The not so good one: if you do a complex regex with some tricky parts (non greedy quantifiers) you should use /x and put comments.

[Tobie] I think it can be solved with a regex -

# Short program to test left and right
# trim regex - s/^[\s]*(.*?)[\s]*$/$1/

my %tests = ( 1 => 'This is great',
2 => 'This is great ',
3 => ' This is great',
4 => ' This is great ',
5 => ' 12345AA 22277 ',
6 => ' !^%&^"$ ^777% ');

foreach my $num (sort keys (%tests)) {
print "BEFORE[" . $tests{$num} . "]\n";
$tests{$num} =~ s/^[\s]*(.*?)[\s]*$/$1/;
print "AFTER [" . $tests{$num} . "]\n";
}


* Gábor Szabó reply:

Looking at s/^[\s]*(.*?)[\s]*$/$1/
besides the fact that the square brackets [] around the \s are not necessary and only make noise it one of the examples I am using to show that you do NOT have to do everything with one regex.

This solution is both complex and error prone - sanjeev indeed missed out on the ? that turns the otherwise greedy quantifier into minimal matching. You can of course use as TMTOWDI but then don't be surprised if people think Perl is cryptic.

Lastly, really, is trimming whitespace such an important task that it justifies 60 posts?

some perl modules for web testing and scrapping

perl modules for web testing


Some modules and its links to do web testing and web automation. There has been a long time since I was doing a lot of web scrapping in the past(parsing webs with locus specific genetic mutations), and now I need it to do it again. So I am trying to find out which are the latest modules for web scrapping and refresh my memory.

This is my first list of modules to explore. I will post later my progress in this issue.

* HTTP:WebTest
: http://search.cpan.org/~ilyam/HTTP-WebTest-2.04/lib/HTTP/WebTest.pm

* HTTP::Recorder
: http://search.cpan.org/~leira/HTTP-Recorder-0.05/lib/HTTP/Recorder.pm
: http://www.perl.com/lpt/a/845
: http://www.perl.com/pub/a/2004/06/04/recorder.html

* WWW::Mechanize
: http://search.cpan.org/~petdance/WWW-Mechanize-1.62/lib/WWW/Mechanize.pm
: http://www.perl.com/pub/a/2003/01/22/mechanize.html
:: There is a Test::WWW::Mechanize

* Web::Scraper
: http://search.cpan.org/~miyagawa/Web-Scraper-0.32/
: http://use.perl.org/articles/07/10/04/2021216.shtml
: http://teusje.wordpress.com/2010/05/02/web-scraping-with-perl/

* WWW::Scripter
: http://search.cpan.org/~sprout/WWW-Scripter-0.010/lib/WWW/Scripter.pod

Special mention deserves pQuery a port of jQuery to perl:
* pQuery
: http://search.cpan.org/~ingy/pQuery-0.07/lib/pQuery.pm

[from the pQuery's CPAN page]

The power of jQuery is that single method calls can apply to many DOM objects. pQuery does the exact same thing but can take this one step further. A single PQUERY object can contain several DOMs!

Consider this example:

> perl -MpQuery -le 'PQUERY(\
map "http://search.cpan.org/~$_/", qw(ingy gugod miyagawa))\
->find("table")->eq(1)->find("tr")\
->EACH(sub{\
    printf("%40s - %s Perl distributions\n", $_->url, $_->length - 1)\
})'
[out]
http://search.cpan.org/~ingy/ - 88 Perl distributions
http://search.cpan.org/~gugod/ - 86 Perl distributions
http://search.cpan.org/~miyagawa/ - 138 Perl distributions
The power lies in PQUERY, a special constructor that creates a wrapper object for many pQuery objects, and applies all methods called on it to all the pQuery objects it contains.

===============

-- summary from  http://stackoverflow.com/questions/713827/how-can-i-screen-scrape-with-perl


----------------
If you are familiar with jQuery you might want to check out pQuery, which makes this very easy:
## print every tag in page

use pQuery;

pQuery("http://google.com/search?q=pquery")->find("h2")->each(
    sub {
       my $i = shift;
       print $i + 1, ") ", pQuery($_)->text, "\n";
    });
There's also HTML::DOM.


===============
use LWP;
use HTTP::Cookies;
my $cookie_jar = HTTP::Cookies->new;
my $browser = LWP::UserAgent->new;
$browser->cookie_jar($cookie_jar);
$resp = $browser->get("https://www.stackoverflow.com");
if($resp->is_success) {
   # Play with your source here
   $source = $resp->content;
   $source =~ s/^.*/i; # this is just an example
   print $source;      # not a solution to your problem. 
}
==========
use HTML::TableExtract;
$te = HTML::TableExtract->new();
$te->parse($html_string);

# Examine all matching tables
foreach $ts ($te->tables) {
    print "Table (", join(',', $ts->coords), "):\n";
    foreach $row ($ts->rows) {
        print join(',', @$row), "\n";
    }
}


========

[UPDATES]


Thanks to @kiran I learned about Selenium

[here from the CPAN:]

NAME ^

WWW::Selenium - Perl Client for the Selenium Remote Control test tool

SYNOPSIS ^

use WWW::Selenium;
    
    my $sel = WWW::Selenium->new( host => "localhost", 
                                  port => 4444, 
                                  browser => "*iexplore", 
                                  browser_url => "http://www.google.com",
                                );
    
    $sel->start;
    $sel->open("http://www.google.com");
    $sel->type("q", "hello world");
    $sel->click("btnG");
    $sel->wait_for_page_to_load(5000);
    print $sel->get_title;
    $sel->stop;

DESCRIPTION ^

Selenium Remote Control (SRC) is a test tool that allows you to write automated web application UI tests in any programming language against any HTTP website using any mainstream JavaScript-enabled browser. SRC provides a Selenium Server, which can automatically start/stop/control any supported browser. It works by using Selenium Core, a pure-HTML+JS library that performs automated tasks in JavaScript; the Selenium Server communicates directly with the browser using AJAX (XmlHttpRequest).
http://www.openqa.org/selenium-rc/
This module sends commands directly to the Server using simple HTTP GET/POST requests. Using this module together with the Selenium Server, you can automatically control any supported browser.
To use this module, you need to have already downloaded and started the Selenium Server. (The Selenium Server is a Java application.)

[2010-05-20] from @anonymous, he pointed to HTML::Query


NAME ^

HTML::Query - jQuery-like selection queries for HTML::Element

SYNOPSIS ^

Creating an HTML::Query object using the Query() constructor subroutine:
use HTML::Query 'Query';
    
    # using named parameters 
    $q = Query( text  => $text  );          # HTML text
    $q = Query( file  => $file  );          # HTML file
    $q = Query( tree  => $tree  );          # HTML::Element object
    $q = Query( query => $query );          # HTML::Query object
    $q = Query(                             
        text  => $text1,                    # or any combination
        text  => $text2,                    # of the above
        file  => $file1,
        file  => $file2,
        tree  => $tree,
        query => $query,
    );

    # passing elements as positional arguments 
    $q = Query( $tree );                    # HTML::Element object(s)
    $q = Query( $tree1, $tree2, $tree3, ... );  
    
    # or from one or more existing queries
    $q = Query( $query1 );                  # HTML::Query object(s)
    $q = Query( $query1, $query2, $query3, ... );
    
    # or a mixture
    $q = Query( $tree1, $query1, $tree2, $query2 );

    # the final argument (in all cases) can be a selector
    my $spec = 'ul.menu li a';  # <ul class="menu">..<li>..<a>
    
    $q = Query( $tree, $spec );
    $q = Query( $query, $spec );
    $q = Query( $tree1, $tree2, $query1, $query2, $spec );
    $q = Query( text  => $text,  $spec );
    $q = Query( file  => $file,  $spec );
    $q = Query( tree  => $tree,  $spec );
    $q = Query( query => $query, $spec );
    $q = Query( 
        text => $text,
        file => $file,
        # ...etc...
        $spec 
    );

DESCRIPTION ^

The HTML::Query module is an add-on for the HTML::Tree module set. It provides a simple way to select one or more elements from a tree using a query syntax inspired by jQuery. This selector syntax will be reassuringly familiar to anyone who has ever written a CSS selector.
HTML::Query is not an attempt to provide a complete (or even near-complete) implementation of jQuery in Perl (see Ingy's pQuery module for a more ambitious attempt at that). Rather, it borrows some of the tried and tested selector syntax from jQuery (and CSS) that can easily be mapped onto the look_down() method provided by the HTML::Element module.

Tuesday 11 May 2010

European Nucleotide Archive

The European Bioinformatics Archive (EBI) at Cambridge, UK, has recently launched the European Nucleotide Archive (ENA). This Archive puts together all previous nucleotide services at EBI (EMBL-Bank, ERA, etc.)



Press release:
http://www.ebi.ac.uk/Information/News/pdf/Press10May10.pdf


Please update your links!!!!

Official Email:

Hello all,

The EBI has launched the European Nucleotide Archive(ENA; http://www.ebi.ac.uk/ena/). The press release is
available here (http://www.ebi.ac.uk/Information/News/pdf/Press10May10.pdf).The European Nucleotide Archive is the new collective name for the
archival nucleotide sequence databases and services that have been
operated from this campus over many years, including annotated and
assembled sequence (EMBL-Bank) and raw data (Trace Archive and Sequence
Read Archive). The new service includes graphical browsing, programmatic
services, next generation sequencing support, text search and a new
rapid sequence similarity search.

We are grateful for feedback and can provide help in using ENA - please
contact us at datasubs@ebi.ac.uk.


Referring to ENA
----------------
From now on, our services should be referred to as the 'European Nucleotide Archive' or 'ENA'.

We prefer that 'ENA' be used when referring to all data types that we
cover (eg. 'raw data and assemblies were submitted to ENA', 'annotation
downloaded from ENA'), but we accept that some usage of existing
component database names, such as EMBL-Bank, will be necessary at least
for some time.

Accession numbers are unique across all data classes within ENA. When
pointing to records, it is only therefore necessary to cite the
namespace and the accession number: 'ENA:', eg. 'ENA:BN000065'.


Pointing to ENA records
-----------------------
URLs to resolve all ENA records by accession take the form:
http://www.ebi.ac.uk/ena/data/view/,
egs.
http://www.ebi.ac.uk/ena/data/view/BN000065
http://www.ebi.ac.uk/ena/data/view/ERA000092
http://www.ebi.ac.uk/ena/data/view/TI1288391363

Many further options are supported for HTTP access and are described in
http://www.ebi.ac.uk/ena/about/page.php?page=browser.

We will continue to support existing URL syntax for our component
databases for some time, but would ask that you update any links that you provide as soon as possible.


Thanks for using ENA,

Guy Cochrane.

Sunday 9 May 2010

Neandertal genome

The Neandertal genome draft has been published in science:


http://www.sciencemag.org/cgi/content/full/328/5979/710


http://en.wikipedia.org/wiki/Neanderthal_genome_project


Science. 2010 May 7;328(5979):710-22.

A draft sequence of the Neandertal genome.

Department of Evolutionary Genetics, Max-Planck Institute for Evolutionary Anthropology, D-04103 Leipzig, Germany. green@eva.mpg.de
Comment in:

Abstract

Neandertals, the closest evolutionary relatives of present-day humans, lived in large parts of Europe and western Asia before disappearing 30,000 years ago. We present a draft sequence of the Neandertal genome composed of more than 4 billion nucleotides from three individuals. Comparisons of the Neandertal genome to the genomes of five present-day humans from different parts of the world identify a number of genomic regions that may have been affected by positive selection in ancestral modern humans, including genes involved in metabolism and in cognitive and skeletal development. We show that Neandertals shared more genetic variants with present-day humans in Eurasia than with present-day humans in sub-Saharan Africa, suggesting that gene flow from Neandertals into the ancestors of non-Africans occurred before the divergence of Eurasian groups from each other.
PMID: 20448178 [PubMed - in process]