Friday, 1 June 2007

dumping wikipedia

Finally I have found how to dump all my mediawiki pages and process them with a perl script!!
(you can even do that with the wikipedia if you will)

After dumping all with


I have found a CPAN module for process the XML file:

The latest version of Parse::MediaWikiDump is available at


Find uncategorized articles in the main name space

#!/usr/bin/perl -w

use strict;
use Parse::MediaWikiDump;

my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;

while(defined($page = $pages->page)) {
#main namespace only
next unless $page->namespace eq '';

print $page->title, "\n" unless defined($page->categories);

No comments: