PLABO: some perl modules for web testing and scrapping

perl modules for web testing

Some modules and its links to do web testing and web automation. There has been a long time since I was doing a lot of web scrapping in the past(parsing webs with locus specific genetic mutations), and now I need it to do it again. So I am trying to find out which are the latest modules for web scrapping and refresh my memory.

This is my first list of modules to explore. I will post later my progress in this issue.

* HTTP:WebTest
: http://search.cpan.org/~ilyam/HTTP-WebTest-2.04/lib/HTTP/WebTest.pm

* HTTP::Recorder
: http://search.cpan.org/~leira/HTTP-Recorder-0.05/lib/HTTP/Recorder.pm
: http://www.perl.com/lpt/a/845
: http://www.perl.com/pub/a/2004/06/04/recorder.html

* WWW::Mechanize
: http://search.cpan.org/~petdance/WWW-Mechanize-1.62/lib/WWW/Mechanize.pm
: http://www.perl.com/pub/a/2003/01/22/mechanize.html
:: There is a Test::WWW::Mechanize

* Web::Scraper
: http://search.cpan.org/~miyagawa/Web-Scraper-0.32/
: http://use.perl.org/articles/07/10/04/2021216.shtml
: http://teusje.wordpress.com/2010/05/02/web-scraping-with-perl/

* WWW::Scripter
: http://search.cpan.org/~sprout/WWW-Scripter-0.010/lib/WWW/Scripter.pod

Special mention deserves pQuery a port of jQuery to perl:
* pQuery
: http://search.cpan.org/~ingy/pQuery-0.07/lib/pQuery.pm

[from the pQuery's CPAN page]

The power of jQuery is that single method calls can apply to many DOM objects. pQuery does the exact same thing but can take this one step further. A single PQUERY object can contain several DOMs!

Consider this example:

&gt; perl -MpQuery -le 'PQUERY(\
map "http://search.cpan.org/~$_/", qw(ingy gugod miyagawa))\
-&gt;find("table")-&gt;eq(1)-&gt;find("tr")\
-&gt;EACH(sub{\
    printf("%40s - %s Perl distributions\n", $_-&gt;url, $_-&gt;length - 1)\
})'

[out]
http://search.cpan.org/~ingy/ - 88 Perl distributions
http://search.cpan.org/~gugod/ - 86 Perl distributions
http://search.cpan.org/~miyagawa/ - 138 Perl distributions

The power lies in PQUERY, a special constructor that creates a wrapper object for many pQuery objects, and applies all methods called on it to all the pQuery objects it contains.

===============

-- summary from http://stackoverflow.com/questions/713827/how-can-i-screen-scrape-with-perl

----------------
If you are familiar with jQuery you might want to check out pQuery, which makes this very easy:

## print every tag in page

use pQuery;

pQuery("http://google.com/search?q=pquery")-&gt;find("h2")-&gt;each(
    sub {
       my $i = shift;
       print $i + 1, ") ", pQuery($_)-&gt;text, "\n";
    });

There's also HTML::DOM.

===============

use LWP;
use HTTP::Cookies;
my $cookie_jar = HTTP::Cookies-&gt;new;
my $browser = LWP::UserAgent-&gt;new;
$browser-&gt;cookie_jar($cookie_jar);
$resp = $browser-&gt;get("https://www.stackoverflow.com");
if($resp-&gt;is_success) {
   # Play with your source here
   $source = $resp-&gt;content;
   $source =~ s/^.*/i; # this is just an example
   print $source;      # not a solution to your problem. 
}

==========

use HTML::TableExtract;
$te = HTML::TableExtract-&gt;new();
$te-&gt;parse($html_string);

# Examine all matching tables
foreach $ts ($te-&gt;tables) {
    print "Table (", join(',', $ts-&gt;coords), "):\n";
    foreach $row ($ts-&gt;rows) {
        print join(',', @$row), "\n";
    }
}

========

[UPDATES]

Thanks to @kiran I learned about Selenium

[here from the CPAN:]

NAME

WWW::Selenium - Perl Client for the Selenium Remote Control test tool

SYNOPSIS

use WWW::Selenium;
    
    my $sel = WWW::Selenium-&gt;new( host =&gt; "localhost", 
                                  port =&gt; 4444, 
                                  browser =&gt; "*iexplore", 
                                  browser_url =&gt; "http://www.google.com",
                                );
    
    $sel-&gt;start;
    $sel-&gt;open("http://www.google.com");
    $sel-&gt;type("q", "hello world");
    $sel-&gt;click("btnG");
    $sel-&gt;wait_for_page_to_load(5000);
    print $sel-&gt;get_title;
    $sel-&gt;stop;

DESCRIPTION

Selenium Remote Control (SRC) is a test tool that allows you to write automated web application UI tests in any programming language against any HTTP website using any mainstream JavaScript-enabled browser. SRC provides a Selenium Server, which can automatically start/stop/control any supported browser. It works by using Selenium Core, a pure-HTML+JS library that performs automated tasks in JavaScript; the Selenium Server communicates directly with the browser using AJAX (XmlHttpRequest).
http://www.openqa.org/selenium-rc/
This module sends commands directly to the Server using simple HTTP GET/POST requests. Using this module together with the Selenium Server, you can automatically control any supported browser.
To use this module, you need to have already downloaded and started the Selenium Server. (The Selenium Server is a Java application.)

[2010-05-20] from @anonymous, he pointed to HTML::Query

NAME

HTML::Query - jQuery-like selection queries for HTML::Element

SYNOPSIS

Creating an HTML::Query object using the Query() constructor subroutine:

use HTML::Query 'Query';
    
    # using named parameters 
    $q = Query( text  =&gt; $text  );          # HTML text
    $q = Query( file  =&gt; $file  );          # HTML file
    $q = Query( tree  =&gt; $tree  );          # HTML::Element object
    $q = Query( query =&gt; $query );          # HTML::Query object
    $q = Query(                             
        text  =&gt; $text1,                    # or any combination
        text  =&gt; $text2,                    # of the above
        file  =&gt; $file1,
        file  =&gt; $file2,
        tree  =&gt; $tree,
        query =&gt; $query,
    );

    # passing elements as positional arguments 
    $q = Query( $tree );                    # HTML::Element object(s)
    $q = Query( $tree1, $tree2, $tree3, ... );  
    
    # or from one or more existing queries
    $q = Query( $query1 );                  # HTML::Query object(s)
    $q = Query( $query1, $query2, $query3, ... );
    
    # or a mixture
    $q = Query( $tree1, $query1, $tree2, $query2 );

    # the final argument (in all cases) can be a selector
    my $spec = 'ul.menu li a';  # <ul class="menu">..<li>..<a>
    
    $q = Query( $tree, $spec );
    $q = Query( $query, $spec );
    $q = Query( $tree1, $tree2, $query1, $query2, $spec );
    $q = Query( text  => $text,  $spec );
    $q = Query( file  => $file,  $spec );
    $q = Query( tree  => $tree,  $spec );
    $q = Query( query => $query, $spec );
    $q = Query( 
        text => $text,
        file => $file,
        # ...etc...
        $spec 
    );

DESCRIPTION

The HTML::Query module is an add-on for the HTML::Tree module set. It provides a simple way to select one or more elements from a tree using a query syntax inspired by jQuery. This selector syntax will be reassuringly familiar to anyone who has ever written a CSS selector.
HTML::Query is not an attempt to provide a complete (or even near-complete) implementation of jQuery in Perl (see Ingy's pQuery module for a more ambitious attempt at that). Rather, it borrows some of the tried and tested selector syntax from jQuery (and CSS) that can easily be mapped onto the look_down() method provided by the HTML::Element module.

5 comments:

avilella said...: I recently found out a package that checks when a website has changed in content. Some of the stuff in this blog post seems to be useful to do that as well, which is great for tweaks/extensions:

urlwatch --urls myfilewithurls.txt | less; 18 May 2010 at 05:13
Pablo Marin-Garcia said...: @avilella:
Thanks, I would take a look at it.

One of the modules that I wrote for MUTRES was doing exactly this. downloading the page, if the page was changed (md5 value) I was extracting the tables that I wanted and calculating the md5 for the text (not the html) and stored in a db both md5. Every week a cron process launched the script and tested if the page or the wanted contend had changed and was sending emails accordingly.; 18 May 2010 at 09:52
Anonymous said...: http://search.cpan.org/~abw/HTML-Query-0.02/lib/HTML/Query.pm

Is also worth a look.; 18 May 2010 at 11:06
Unknown said...: For Web Testing , Check out Selenium; 18 May 2010 at 11:27
snju said...: Has anyone tried Web testing on a linux server blade using perl/python; wherein you cannot use Selenium/Mechanize.

i am trying to automate logging into a website, where the login form is generated by a javascript!; 13 November 2013 at 20:49

PLABO

Tuesday, 18 May 2010

some perl modules for web testing and scrapping

perl modules for web testing

[UPDATES]

NAME

SYNOPSIS

DESCRIPTION

NAME

SYNOPSIS

DESCRIPTION

5 comments:

Blog Archive

About Me

Tuesday, 18 May 2010

perl modules for web testing

[UPDATES]

5 comments:

Blog Archive

About Me

Subscribe To