Sunday, 29 May 2011

De-obfuscating large or complex regular expressions

From time to time you find a large or complex regular expression that has not been coded with //x thus you have a oneliner without comments, and a big headache after some time trying to decoding it.

Recently I found a RE of this class and a module (YAPE::Regex::Explain) that helps you to decompose the RE elements.

regexp:

m{^((\w*)://(?:(\w+)(?:\:([^/\@]*))?\@)?(?:([\w\-\.]+)(?:\:(\d+))?)?/(\w*))(?:/(\w+)(?:\?(\w+)=(\w+))?)?((?:;(\w+)=(\w+))*)$}

This regexp parses URIs like:

mysql://anonymous@my.self.com:1234/dbname/tablename

(NOTE: these URIs are better parsed with URI::Split but this is another story)

And how to decompose it:

#!/usr/bin/env perl

use feature ':5.10';
use strict;
use URI::Split qw(uri_join uri_split);
use YAPE::Regex::Explain;
use Data::Dumper;

explain_RE($REx);

sub explain_RE {
    my $REx = shift;
    my $exp = YAPE::Regex::Explain->new($REx)->explain;
    print $exp;
}

result:

The regular expression:

(?x-ims:
^((\w*)://(?:(\w+)(?:\:([^/\@]*))?\@)?(?:([\w\-\.]+)(?:\:(\d+))?)?/(\w*))(?:/(\w+)(?:\?(\w+)=(\w+))?)?((?:;(\w+)=(\w+))*)$)
matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?x-ims:                 group, but do not capture (disregarding
                         whitespace and comments) (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    (                        group and capture to \2:
----------------------------------------------------------------------
      \w*                      word characters (a-z, A-Z, 0-9, _) (0
                               or more times (matching the most
                               amount possible))
----------------------------------------------------------------------
    )                        end of \2
----------------------------------------------------------------------
    ://                      '://'
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      (                        group and capture to \3:
----------------------------------------------------------------------
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
----------------------------------------------------------------------
      )                        end of \3
----------------------------------------------------------------------
      (?:                      group, but do not capture (optional
                               (matching the most amount possible)):
----------------------------------------------------------------------
        \:                       ':'
----------------------------------------------------------------------
        (                        group and capture to \4:
----------------------------------------------------------------------
          [^/\@]*                  any character except: '/', '\@' (0
                                   or more times (matching the most
                                   amount possible))
----------------------------------------------------------------------
        )                        end of \4
----------------------------------------------------------------------
      )?                       end of grouping
----------------------------------------------------------------------
      \@                       '@'
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      (                        group and capture to \5:
----------------------------------------------------------------------
        [\w\-\.]+                any character of: word characters
                                 (a-z, A-Z, 0-9, _), '\-', '\.' (1 or
                                 more times (matching the most amount
                                 possible))
----------------------------------------------------------------------
      )                        end of \5
----------------------------------------------------------------------
      (?:                      group, but do not capture (optional
                               (matching the most amount possible)):
----------------------------------------------------------------------
        \:                       ':'
----------------------------------------------------------------------
        (                        group and capture to \6:
----------------------------------------------------------------------
          \d+                      digits (0-9) (1 or more times
                                   (matching the most amount
                                   possible))
----------------------------------------------------------------------
        )                        end of \6
----------------------------------------------------------------------
      )?                       end of grouping
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
    /                        '/'
----------------------------------------------------------------------
    (                        group and capture to \7:
----------------------------------------------------------------------
      \w*                      word characters (a-z, A-Z, 0-9, _) (0
                               or more times (matching the most
                               amount possible))
----------------------------------------------------------------------
    )                        end of \7
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    /                        '/'
----------------------------------------------------------------------
    (                        group and capture to \8:
----------------------------------------------------------------------
      \w+                      word characters (a-z, A-Z, 0-9, _) (1
                               or more times (matching the most
                               amount possible))
----------------------------------------------------------------------
    )                        end of \8
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      \?                       '?'
----------------------------------------------------------------------
      (                        group and capture to \9:
----------------------------------------------------------------------
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
----------------------------------------------------------------------
      )                        end of \9
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      (                        group and capture to \10:
----------------------------------------------------------------------
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
----------------------------------------------------------------------
      )                        end of \10
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
  (                        group and capture to \11:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      ;                        ';'
----------------------------------------------------------------------
      (                        group and capture to \12:
----------------------------------------------------------------------
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
----------------------------------------------------------------------
      )                        end of \12
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      (                        group and capture to \13:
----------------------------------------------------------------------
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
----------------------------------------------------------------------
      )                        end of \13
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
  )                        end of \11
----------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------


and my manual explanation for the URI parsing:

^(
                 (\w*)
                 ://
                 (?:
                   (\w+)  # user
                   (?:
                     \:
                     ([^/\@]*)  # passw 
                   )?
                   \@
                 )?  # could not have user,pass
                 (?:
                   ([\w\-\.]+)  # host
                   (?:
                     \:  
                     (\d+)  # port
                   )?  # port optional
                 )?  # host and port optional
                 /  # become in a third '/' if no user pass host and port
                 (\w*)  # get the db (only until the first '/' is any). Will not work with full paths for sqlite.
               )
               (?:
                 /   # if tables 
                 (\w+)  # get table
                 (?:
                   \?  # parameters
                   (\w+)
                   =
                  (\w+)
                 )?  # parameter is conditional but would have always a tablename
               )?  # conditinal table and parameter
               (
                 (?:
                   ;
                   (\w+)
                   =
                   (\w+)
                 )*  # rest of parameters if any
               )
               $
            


Probably this regular expression was easy but while searching for more examples of YAPE::Regex::Explain I found two interesting links about Perl obfuscation with RE at Stack Overflow and perl monks threads

2 comments:

xenoterracide said...

Neat module. I'm not normally a grammar Nazi but... it's de-obfuscating (not deofuscating) and StackOverflow (not stake overflow)

Pablo Marin-Garcia said...

Thanks @xenoterracide, corrected