Tuesday, 18 May 2010

some perl modules for web testing and scrapping

perl modules for web testing


Some modules and its links to do web testing and web automation. There has been a long time since I was doing a lot of web scrapping in the past(parsing webs with locus specific genetic mutations), and now I need it to do it again. So I am trying to find out which are the latest modules for web scrapping and refresh my memory.

This is my first list of modules to explore. I will post later my progress in this issue.

* HTTP:WebTest
: http://search.cpan.org/~ilyam/HTTP-WebTest-2.04/lib/HTTP/WebTest.pm

* HTTP::Recorder
: http://search.cpan.org/~leira/HTTP-Recorder-0.05/lib/HTTP/Recorder.pm
: http://www.perl.com/lpt/a/845
: http://www.perl.com/pub/a/2004/06/04/recorder.html

* WWW::Mechanize
: http://search.cpan.org/~petdance/WWW-Mechanize-1.62/lib/WWW/Mechanize.pm
: http://www.perl.com/pub/a/2003/01/22/mechanize.html
:: There is a Test::WWW::Mechanize

* Web::Scraper
: http://search.cpan.org/~miyagawa/Web-Scraper-0.32/
: http://use.perl.org/articles/07/10/04/2021216.shtml
: http://teusje.wordpress.com/2010/05/02/web-scraping-with-perl/

* WWW::Scripter
: http://search.cpan.org/~sprout/WWW-Scripter-0.010/lib/WWW/Scripter.pod

Special mention deserves pQuery a port of jQuery to perl:
* pQuery
: http://search.cpan.org/~ingy/pQuery-0.07/lib/pQuery.pm

[from the pQuery's CPAN page]

The power of jQuery is that single method calls can apply to many DOM objects. pQuery does the exact same thing but can take this one step further. A single PQUERY object can contain several DOMs!

Consider this example:

> perl -MpQuery -le 'PQUERY(\
map "http://search.cpan.org/~$_/", qw(ingy gugod miyagawa))\
->find("table")->eq(1)->find("tr")\
->EACH(sub{\
    printf("%40s - %s Perl distributions\n", $_->url, $_->length - 1)\
})'
[out]
http://search.cpan.org/~ingy/ - 88 Perl distributions
http://search.cpan.org/~gugod/ - 86 Perl distributions
http://search.cpan.org/~miyagawa/ - 138 Perl distributions
The power lies in PQUERY, a special constructor that creates a wrapper object for many pQuery objects, and applies all methods called on it to all the pQuery objects it contains.

===============

-- summary from  http://stackoverflow.com/questions/713827/how-can-i-screen-scrape-with-perl


----------------
If you are familiar with jQuery you might want to check out pQuery, which makes this very easy:
## print every tag in page

use pQuery;

pQuery("http://google.com/search?q=pquery")->find("h2")->each(
    sub {
       my $i = shift;
       print $i + 1, ") ", pQuery($_)->text, "\n";
    });
There's also HTML::DOM.


===============
use LWP;
use HTTP::Cookies;
my $cookie_jar = HTTP::Cookies->new;
my $browser = LWP::UserAgent->new;
$browser->cookie_jar($cookie_jar);
$resp = $browser->get("https://www.stackoverflow.com");
if($resp->is_success) {
   # Play with your source here
   $source = $resp->content;
   $source =~ s/^.*/i; # this is just an example
   print $source;      # not a solution to your problem. 
}
==========
use HTML::TableExtract;
$te = HTML::TableExtract->new();
$te->parse($html_string);

# Examine all matching tables
foreach $ts ($te->tables) {
    print "Table (", join(',', $ts->coords), "):\n";
    foreach $row ($ts->rows) {
        print join(',', @$row), "\n";
    }
}


========

[UPDATES]


Thanks to @kiran I learned about Selenium

[here from the CPAN:]

NAME ^

WWW::Selenium - Perl Client for the Selenium Remote Control test tool

SYNOPSIS ^

use WWW::Selenium;
    
    my $sel = WWW::Selenium->new( host => "localhost", 
                                  port => 4444, 
                                  browser => "*iexplore", 
                                  browser_url => "http://www.google.com",
                                );
    
    $sel->start;
    $sel->open("http://www.google.com");
    $sel->type("q", "hello world");
    $sel->click("btnG");
    $sel->wait_for_page_to_load(5000);
    print $sel->get_title;
    $sel->stop;

DESCRIPTION ^

Selenium Remote Control (SRC) is a test tool that allows you to write automated web application UI tests in any programming language against any HTTP website using any mainstream JavaScript-enabled browser. SRC provides a Selenium Server, which can automatically start/stop/control any supported browser. It works by using Selenium Core, a pure-HTML+JS library that performs automated tasks in JavaScript; the Selenium Server communicates directly with the browser using AJAX (XmlHttpRequest).
http://www.openqa.org/selenium-rc/
This module sends commands directly to the Server using simple HTTP GET/POST requests. Using this module together with the Selenium Server, you can automatically control any supported browser.
To use this module, you need to have already downloaded and started the Selenium Server. (The Selenium Server is a Java application.)

[2010-05-20] from @anonymous, he pointed to HTML::Query


NAME ^

HTML::Query - jQuery-like selection queries for HTML::Element

SYNOPSIS ^

Creating an HTML::Query object using the Query() constructor subroutine:
use HTML::Query 'Query';
    
    # using named parameters 
    $q = Query( text  => $text  );          # HTML text
    $q = Query( file  => $file  );          # HTML file
    $q = Query( tree  => $tree  );          # HTML::Element object
    $q = Query( query => $query );          # HTML::Query object
    $q = Query(                             
        text  => $text1,                    # or any combination
        text  => $text2,                    # of the above
        file  => $file1,
        file  => $file2,
        tree  => $tree,
        query => $query,
    );

    # passing elements as positional arguments 
    $q = Query( $tree );                    # HTML::Element object(s)
    $q = Query( $tree1, $tree2, $tree3, ... );  
    
    # or from one or more existing queries
    $q = Query( $query1 );                  # HTML::Query object(s)
    $q = Query( $query1, $query2, $query3, ... );
    
    # or a mixture
    $q = Query( $tree1, $query1, $tree2, $query2 );

    # the final argument (in all cases) can be a selector
    my $spec = 'ul.menu li a';  # <ul class="menu">..<li>..<a>
    
    $q = Query( $tree, $spec );
    $q = Query( $query, $spec );
    $q = Query( $tree1, $tree2, $query1, $query2, $spec );
    $q = Query( text  => $text,  $spec );
    $q = Query( file  => $file,  $spec );
    $q = Query( tree  => $tree,  $spec );
    $q = Query( query => $query, $spec );
    $q = Query( 
        text => $text,
        file => $file,
        # ...etc...
        $spec 
    );

DESCRIPTION ^

The HTML::Query module is an add-on for the HTML::Tree module set. It provides a simple way to select one or more elements from a tree using a query syntax inspired by jQuery. This selector syntax will be reassuringly familiar to anyone who has ever written a CSS selector.
HTML::Query is not an attempt to provide a complete (or even near-complete) implementation of jQuery in Perl (see Ingy's pQuery module for a more ambitious attempt at that). Rather, it borrows some of the tried and tested selector syntax from jQuery (and CSS) that can easily be mapped onto the look_down() method provided by the HTML::Element module.

5 comments:

avilella said...

I recently found out a package that checks when a website has changed in content. Some of the stuff in this blog post seems to be useful to do that as well, which is great for tweaks/extensions:

urlwatch --urls myfilewithurls.txt | less

Pablo Marin-Garcia said...

@avilella:
Thanks, I would take a look at it.

One of the modules that I wrote for MUTRES was doing exactly this. downloading the page, if the page was changed (md5 value) I was extracting the tables that I wanted and calculating the md5 for the text (not the html) and stored in a db both md5. Every week a cron process launched the script and tested if the page or the wanted contend had changed and was sending emails accordingly.

Anonymous said...

http://search.cpan.org/~abw/HTML-Query-0.02/lib/HTML/Query.pm

Is also worth a look.

Kiran said...

For Web Testing , Check out Selenium

snju said...

Has anyone tried Web testing on a linux server blade using perl/python; wherein you cannot use Selenium/Mechanize.

i am trying to automate logging into a website, where the login form is generated by a javascript!