Friday, 1 June 2007

dumping wikipedia

Finally I have found how to dump all my mediawiki pages and process them with a perl script!!
(you can even do that with the wikipedia if you will)

After dumping all with


./maintenance/dumpBackup.ph


I have found a CPAN module for process the XML file:

http://en.wikipedia.org/wiki/Wikipedia:Computer_help_desk/ParseMediaWikiDump

The latest version of Parse::MediaWikiDump is available at http://www.cpan.org/modules/by-authors/id/T/TR/TRIDDLE/

Examples

Find uncategorized articles in the main name space


#!/usr/bin/perl -w

use strict;
use Parse::MediaWikiDump;

my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;

while(defined($page = $pages->page)) {
#main namespace only
next unless $page->namespace eq '';

print $page->title, "\n" unless defined($page->categories);

No comments: