Sunday, 29 May 2011

Deobfuscating large or complex regular expressions

From time to time you find a large or complex regular expression that has not been coded with //x thus you have a oneliner without comments, and a big headache after some time trying to decoding it.

Recently I found a RE of this class and a module (YAPE::Regex::Explain) that helps you to decompose the RE elements.



This regexp parses URIs like:


(NOTE: these URIs are better parsed with URI::Split but this is another story)

And how to decompose it:

#!/usr/bin/env perl

use feature ':5.10';
use strict;
use URI::Split qw(uri_join uri_split);
use YAPE::Regex::Explain;
use Data::Dumper;


sub explain_RE {
    my $REx = shift;
    my $exp = YAPE::Regex::Explain->new($REx)->explain;
    print $exp;


The regular expression:

matches as follows:

NODE                     EXPLANATION
(?x-ims:                 group, but do not capture (disregarding
                         whitespace and comments) (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n):
  ^                        the beginning of the string
  (                        group and capture to \1:
    (                        group and capture to \2:
      \w*                      word characters (a-z, A-Z, 0-9, _) (0
                               or more times (matching the most
                               amount possible))
    )                        end of \2
    ://                      '://'
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
      (                        group and capture to \3:
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
      )                        end of \3
      (?:                      group, but do not capture (optional
                               (matching the most amount possible)):
        \:                       ':'
        (                        group and capture to \4:
          [^/\@]*                  any character except: '/', '\@' (0
                                   or more times (matching the most
                                   amount possible))
        )                        end of \4
      )?                       end of grouping
      \@                       '@'
    )?                       end of grouping
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
      (                        group and capture to \5:
        [\w\-\.]+                any character of: word characters
                                 (a-z, A-Z, 0-9, _), '\-', '\.' (1 or
                                 more times (matching the most amount
      )                        end of \5
      (?:                      group, but do not capture (optional
                               (matching the most amount possible)):
        \:                       ':'
        (                        group and capture to \6:
          \d+                      digits (0-9) (1 or more times
                                   (matching the most amount
        )                        end of \6
      )?                       end of grouping
    )?                       end of grouping
    /                        '/'
    (                        group and capture to \7:
      \w*                      word characters (a-z, A-Z, 0-9, _) (0
                               or more times (matching the most
                               amount possible))
    )                        end of \7
  )                        end of \1
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
    /                        '/'
    (                        group and capture to \8:
      \w+                      word characters (a-z, A-Z, 0-9, _) (1
                               or more times (matching the most
                               amount possible))
    )                        end of \8
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
      \?                       '?'
      (                        group and capture to \9:
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
      )                        end of \9
      =                        '='
      (                        group and capture to \10:
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
      )                        end of \10
    )?                       end of grouping
  )?                       end of grouping
  (                        group and capture to \11:
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
      ;                        ';'
      (                        group and capture to \12:
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
      )                        end of \12
      =                        '='
      (                        group and capture to \13:
        \w+                      word characters (a-z, A-Z, 0-9, _)
                                 (1 or more times (matching the most
                                 amount possible))
      )                        end of \13
    )*                       end of grouping
  )                        end of \11
  $                        before an optional \n, and the end of the
)                        end of grouping

and my manual explanation for the URI parsing:

                   (\w+)  # user
                     ([^/\@]*)  # passw 
                 )?  # could not have user,pass
                   ([\w\-\.]+)  # host
                     (\d+)  # port
                   )?  # port optional
                 )?  # host and port optional
                 /  # become in a third '/' if no user pass host and port
                 (\w*)  # get the db (only until the first '/' is any). Will not work with full paths for sqlite.
                 /   # if tables 
                 (\w+)  # get table
                   \?  # parameters
                 )?  # parameter is conditional but would have always a tablename
               )?  # conditinal table and parameter
                 )*  # rest of parameters if any

Probably this regular expression was easy but while searching for more examples of YAPE::Regex::Explain I found two interesting links about Perl obfuscation with RE at StackOverflow and perl monks threads


Unknown said...

Neat module. I'm not normally a grammar Nazi but... it's de-obfuscating (not deofuscating) and StackOverflow (not stake overflow)

Pablo Marin-Garcia said...

Thanks @xenoterracide, corrected