PLABO: May 2011

From time to time you find a large or complex regular expression that has not been coded with //x thus you have a oneliner without comments, and a big headache after some time trying to decoding it.

Recently I found a RE of this class and a module (YAPE::Regex::Explain) that helps you to decompose the RE elements.

regexp:

m{^((\w*)://(?:(\w+)(?:\:([^/\@]*))?\@)?(?:([\w\-\.]+)(?:\:(\d+))?)?/(\w*))(?:/(\w+)(?:\?(\w+)=(\w+))?)?((?:;(\w+)=(\w+))*)$}

This regexp parses URIs like:

mysql://anonymous@my.self.com:1234/dbname/tablename

(NOTE: these URIs are better parsed with URI::Split but this is another story)

And how to decompose it:

#!/usr/bin/env perl

use feature ':5.10';
use strict;
use URI::Split qw(uri_join uri_split);
use YAPE::Regex::Explain;
use Data::Dumper;

explain_RE($REx);

sub explain_RE {
    my $REx = shift;
    my $exp = YAPE::Regex::Explain->new($REx)->explain;
    print $exp;
}

result:

The regular expression:

(?x-ims:
^((\w*)://(?:(\w+)(?:\:([^/\@]*))?\@)?(?:([\w\-\.]+)(?:\:(\d+))?)?/(\w*))(?:/(\w+)(?:\?(\w+)=(\w+))?)?((?:;(\w+)=(\w+))*)$)
matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?x-ims:                 group, but do not capture (disregarding
                         whitespace and comments) (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    (                        group and capture to \2:
----------------------------------------------------------------------
      \w*                      word characters (a-z, A-Z, 0-9, _) (0
                               or more times (matching the most
                               amount possible))
----------------------------------------------------------------------
    )                        end of \2
----------------------------------------------------------------------
    ://                      '://'
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      (                        group and capture to \3:
----------------------------------------------------------------------
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
----------------------------------------------------------------------
      )                        end of \3
----------------------------------------------------------------------
      (?:                      group, but do not capture (optional
                               (matching the most amount possible)):
----------------------------------------------------------------------
        \:                       ':'
----------------------------------------------------------------------
        (                        group and capture to \4:
----------------------------------------------------------------------
          [^/\@]*                  any character except: '/', '\@' (0
                                   or more times (matching the most
                                   amount possible))
----------------------------------------------------------------------
        )                        end of \4
----------------------------------------------------------------------
      )?                       end of grouping
----------------------------------------------------------------------
      \@                       '@'
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      (                        group and capture to \5:
----------------------------------------------------------------------
        [\w\-\.]+                any character of: word characters
                                 (a-z, A-Z, 0-9, _), '\-', '\.' (1 or
                                 more times (matching the most amount
                                 possible))
----------------------------------------------------------------------
      )                        end of \5
----------------------------------------------------------------------
      (?:                      group, but do not capture (optional
                               (matching the most amount possible)):
----------------------------------------------------------------------
        \:                       ':'
----------------------------------------------------------------------
        (                        group and capture to \6:
----------------------------------------------------------------------
          \d+                      digits (0-9) (1 or more times
                                   (matching the most amount
                                   possible))
----------------------------------------------------------------------
        )                        end of \6
----------------------------------------------------------------------
      )?                       end of grouping
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
    /                        '/'
----------------------------------------------------------------------
    (                        group and capture to \7:
----------------------------------------------------------------------
      \w*                      word characters (a-z, A-Z, 0-9, _) (0
                               or more times (matching the most
                               amount possible))
----------------------------------------------------------------------
    )                        end of \7
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    /                        '/'
----------------------------------------------------------------------
    (                        group and capture to \8:
----------------------------------------------------------------------
      \w+                      word characters (a-z, A-Z, 0-9, _) (1
                               or more times (matching the most
                               amount possible))
----------------------------------------------------------------------
    )                        end of \8
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      \?                       '?'
----------------------------------------------------------------------
      (                        group and capture to \9:
----------------------------------------------------------------------
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
----------------------------------------------------------------------
      )                        end of \9
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      (                        group and capture to \10:
----------------------------------------------------------------------
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
----------------------------------------------------------------------
      )                        end of \10
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
  (                        group and capture to \11:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      ;                        ';'
----------------------------------------------------------------------
      (                        group and capture to \12:
----------------------------------------------------------------------
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
----------------------------------------------------------------------
      )                        end of \12
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      (                        group and capture to \13:
----------------------------------------------------------------------
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
----------------------------------------------------------------------
      )                        end of \13
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
  )                        end of \11
----------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

and my manual explanation for the URI parsing:

^(
                 (\w*)
                 ://
                 (?:
                   (\w+)  # user
                   (?:
                     \:
                     ([^/\@]*)  # passw 
                   )?
                   \@
                 )?  # could not have user,pass
                 (?:
                   ([\w\-\.]+)  # host
                   (?:
                     \:  
                     (\d+)  # port
                   )?  # port optional
                 )?  # host and port optional
                 /  # become in a third '/' if no user pass host and port
                 (\w*)  # get the db (only until the first '/' is any). Will not work with full paths for sqlite.
               )
               (?:
                 /   # if tables 
                 (\w+)  # get table
                 (?:
                   \?  # parameters
                   (\w+)
                   =
                  (\w+)
                 )?  # parameter is conditional but would have always a tablename
               )?  # conditinal table and parameter
               (
                 (?:
                   ;
                   (\w+)
                   =
                   (\w+)
                 )*  # rest of parameters if any
               )
               $

Probably this regular expression was easy but while searching for more examples of YAPE::Regex::Explain I found two interesting links about Perl obfuscation with RE at StackOverflow and perl monks threads

PLABO

Tuesday, 31 May 2011

from July 2011 KEGG ftp access needs paid subscription even for academic users

Sunday, 29 May 2011

What can go wrong when working with UTF8?

Deobfuscating large or complex regular expressions

Monday, 9 May 2011

malware web page scam today

Blog Archive

About Me