(you can even do that with the wikipedia if you will)
After dumping all with
./maintenance/dumpBackup.ph
I have found a CPAN module for process the XML file:
http://en.wikipedia.org/wiki/Wikipedia:Computer_help_desk/ParseMediaWikiDump
The latest version of Parse::MediaWikiDump is available at http://www.cpan.org/modules/by-authors/id/T/TR/TRIDDLE/
Examples
Find uncategorized articles in the main name space
#!/usr/bin/perl -w
use strict;
use Parse::MediaWikiDump;
my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;
while(defined($page = $pages->page)) {
#main namespace only
next unless $page->namespace eq '';
print $page->title, "\n" unless defined($page->categories);
No comments:
Post a Comment