Advanced regular expression, ReDOS and related discussions

Regular expression, invented by the American mathematician Stephen Cole Kleene, is a search pattern used to match, or find and replace, strings.

While its name is the same as regular expression in formal theory, (they even have the same Wikipedia article), they are different things, representing the different capabilities of text recognization. Read my article about formal language theory, to know more about their distinction.

My post will, in a more concise way, cover the most (advanced) information from regular-expression.info which discusses various implementations by PCRE/Perl/Python/Java engines. While I will add caveats focused on the Regular Expression (RegExp) defined in ECMA (i.e., the RegExp engine implemented in Javascript). At the end of the post, I will discuss the ReDOS attack and how to protect against it.

In javascript, there are packages such as xregexp, babel-plugin-transform-modern-regexp that add several additional capabilities such as named capture, inline comments, leading mode modifier, free-spacing, ...  to the original RegExp.

regex101.com is a useful website to test your regular expression only.

regex101: build, test, and debug regex
Regular expression tester with syntax highlighting, explanation, cheat sheet for PHP/PCRE, Python, GO, JavaScript, Java. Features a regex quiz & library.

Repetition

The question mark (X?) is called quantifier or one repetition, is equivalent to X{0,1}.

The asterisk, or star, (X*)  is equivalent to X{0,}.

The plus (X+) is equivalent to X{1,}.

Greediness

All repetitions (quantifier/one repetition, asterisk/star, plus, curly braces) are greedy. To turn them to non-greedy/ungreedy/reluctant, append a question mark after the repetition. For example: X+?, X??, X*?, X{1,10}?.

Possessive quantifier

Similar to non-greedy modifier (X+?), the possessive quantifier is the plus after repetition ( X++, X?+, X*+, X{1,10}+).

The possessive quantifiers prevent the matching group from backtracking.

Possessive quantifiers can be achieved using atomic grouping. For example: X*+ is equivalent to (?>X*).

Javascript does not support possessive quantifiers.

Atomic grouping

  • Isn't supported by Javascript.
  • Syntax: (?>X) (e.g.: (?>integer|insert|in))
  • Look-around groups ( (?!X), (?=X), (?<!X), (?<=X) ) are atomic.
  • Is non-capturing.
  • Do not back-tracking. And thus, prevent ReDOS from happening.

Non-capturing group

A non-capturing group is a group that doesn't create a numbered capturing group that is used for back referencing, or, in replacement.

A normal group (X) can be made non-capturing by adding a question-colon (?:X).

Regexp engines that support the named group also have an option to make all unnamed groups non-capturing. However, the default behaviors among flavors are inconsistent. This suggests we explicit this option (if possible) avoid confusion.

When a capturing group is repeated, the last iteration will be captured.

Javascript does support non-capturing groups. /a(?:c)(b)\1/.test('acbb') returns true.

Look-around group

Is zero-length matching, which can be look-ahead or look-behind.

  • Positive look-ahead: (?=X).
  • Negative look-ahead: (?!X).
  • Positive look-behind: (?<=X).
  • Negative look-behind: (?<!X).

Look-around groups are all atomic, and themselves non-capturing.

Note that most engines match a look-behind group by firstly determine how many characters to step back and apply the pattern forwardly from there. This implies the impossibility of features that can be used in look-behind groups. For example repetition, back-references, or alternation with different length members cannot be used.

Javascript does support look-around.

Named group

Invented in Python, used to make capturing groups easier to be read.

  • Python syntax: (?P<name>X) for naming, (?P=name) for referencing.
  • .Net syntax: (?<name>X) or (?'name'X) for naming, \k<name> or \k'name' for referencing.
  • Perl 5.10, PCRE 7.2: supports both Python and .Net syntax, and add \k{name}, \g{name} for referencing.

RegExp engines which support named groups have different default behaviors on numbering named group. Therefore, it is recommended to explicitly configure the behavior (if possible) if named groups and number groups are mixed together.

Javascript does support named group from ECMAScript 2018 with the .Net and JGSoft syntax (?<name>X) and \k<name>. In Javascript, adding a name to a group does not change the numbering. e.g.: /a(?c)(b)\1/.test('acbc') returns true. In replacement, use the syntax $<name>.

Back-references

\1, \2, ... can be used to back-reference a numbered capturing group.

Depending on the regex engine, (?P=name), \k<name>, \k'name', \k{name}, or, \g{name} can be used to back-reference a named group.

Back-references can be used to repeat the matched group in the same regexp pattern, or, used in replacement.

In many engines, a back-reference to a non-participating capturing group always fails to match. For example: (a)?b\1 fails to match b. However, in Javascript, a back-reference to a non-participating group successfully match nothing. In other words, in Javascript, (a)?b\1 does match b.

In most engines, back-references to non-existent capturing groups cause an error. In Javascript, \1 to \7 are treated as octal escapes if the associated number group does not exist. In a similar situation, \8 and \9 throw an error.

Javascript does not allow forward reference (a reference to a group that appears later).

Many regex engines support nested references. In Javascript, nest references do not cause errors and are treated as back-references to non-participating groups.

Some flavors also support relative back-references, such as \k<-1>, \k<-3>, \k<-5>, \g<-1>, \g<-3>, ...

Javascript only supports absolute or named references. In replacement, use the syntax $1, $2, ...

Recursion

Recursion has a similar syntax with back-reference.
(?R), (?0), or \g<0> are used to recursively repeat the whole regex. In text replacement, they refer to the whole match.

Javascript does not support regexp recursion.

Subroutines

Subroutine is the syntax to repeat the pattern, similar to recursion. It is not to repeat the matched string, like back-reference.

Syntax:

  • Numbered group: (?1) / \g<1> .
  • Next group: (?+1) / \g<+1> / \g'+1' .
  • Preceding group: (?-1) / \g<-1> / \g'-1' .
  • Named group: (?&name) / (?P>name) / \g<name> / \g'name' .

Javascript does not support subroutines.

Conditioning

(?ifthen|else) can be used to condition the matching.

Basically, the if part can be a look-around group, (?(?=regex)then|else) or (?(?!regex)then|else).

Note: the conditional check (the if part) does not create any group, and thus is non-capturing.

The if part can be a numbered group, a named group, or a relative group. This highly depends on the favor of the regexp engines. For example: (?(test)then|else), (?(<test>)then|else), (?('test')then|else), (?(-3)then|else), (?(1)then|else).

Javascript does not support conditional groups..

Free-spacing and comments

In free-spacing mode (specified by x flag), whitespace (spaces, tabs, line breaks) between expression tokens are ignored.

Note that:  a single token cannot be broken, for example: (?> is not the same as (? >, because ?> is a single token, while (?> and ( ?> are the same.

Character classes are single tokens. Thus, [abc] and [ a b c ] are not the same.

In free-spacing mode, # starts a comment and ends until the end of line character.

Javascript does not support free-spacing but XRegExp does.

(?x)
(
    [0-9]{4} # year
    -
    [0-9]{2} # month
    -
    [0-9]{2} # day
)

Without free-spacing, the inline comment is also available with the (?#comment) syntax.

x flag is not yet standardized by ES. babel-plugin-transform-modern-regexp and XRegExp supports this flag.

Mode modifiers

The modes (configuration) can be specified outside the expression in programming languages, or, specified using mode modifier with the syntax (?flag). Multiple flags can be combined together, such as (?ismx).

Usually, mode modifiers are applied to the right side of the modifiers. In Python, once a mode modifier is included, it will be applied to the whole expression.

(?-flag) (for example (?-ism)) is used to turn the mode off. Mode on/off modifiers can be mixed, like (?i-sm).

Instead of turning the mode on/off, mode span syntax (?flag:regex) (for e.g.: (?i:a-case-sensitive-match)) can be used to apply modifiers onto a group.

  • i: case insensitive.
  • x: free-spacing mode.
  • m: multiline mode.
  • n: make unnamed groups non-capturing.
  • U: non-greedy mode.

Javascript does not support inline mode modifiers. The mode can be specified using flags which is applied to the whole expression.


ReDOS attack

Regular expression Denial of Service (ReDoS), or, Catastrophic Backtracking, or, exponential time regex is a method to exploit the vulnerability where evil regular expressions are used. An evil regular expression is regexp whose running time is exponential related to the input size (for a crafted input), causing the CPU busy for a long time, or, causing stack overflow due to extremely deep recursions.

If the regular expression only uses the syntax defined by regular expression in formal theory, its matching domain is equal to regular language, thus, can be built using Deterministic Finite Automaton. However, most regular expression engines have a larger capability compared to the one with the same name used defined in formal language. Moreover, not all implementations are written in an efficient way. Perl has a relatively good implementation (example).

An example from OWASP, a naive implementation of regexp, i.e. an implementation which just do backtracking,  will match 65536 possible paths of the regexp ^(a+)+$ for aaaaaaaaaaaaaaaaX before failing the match.

Even with a pure implementation of regexp, catastrophic backtracking can be prevented by

  • Avoid using nested quantifiers. e.g.: (a*)*.
  • Use atomic group.
  • Use possessive quantifier.
  • If an alternated group is used with repetition, the start of each alternative inside the group is not optional and mutually exclusive with the start of all the other alternatives, and mutually exclusive with the token that follows it inside its alternative inside the group.

Within Javascript, safe-regex is a package used to test if a regex is possible to run in exponential time by checking the nesting depth. The check isn't always correct all the time but it covers the major cases.

detect-unsafe-regex rule of eslint-plugin-security makes use of this package and gives a warning if an unsafe regex is used.

The computational complexity of regexp

The problem of determining whether a regular expression recognizes or rejects a string is called "The regular expression inclusion problem".

Suppose we are dealing with the original version of regular expression in formal theory (without back-reference, conditioning, or, recursion), and we try to match the whole string instead of matching a sub-string.

The regexp has a length m after the expansion (of size repetition, such as a{3} becomes aaa), the haystack string has a length n.

According to Wikipedia, there are at least these 3 approaches.

  • Convert the regular expression to an NFA (time/space O(m)). Build a DFA from the NFA (size/space O(2m)). Time and space complexity for building the DFA will be O(2m) while the matching time complexity is O(n).
  • Simulate NFA directly without building DFA. Space complexity is O(m), while the matching time complexity is up to O(m*n).
  • Naive algorithm with backtracking. This approach is most flexible but vulnerable to ReDoS attack because it is easy to craft an input string which causes the running time exponential (O(2n)) related to the input size if the regexp is not well constructed.

From the same Wikipedia article, when back-reference is in-used, a naive approach of duplicating an NFA for each back-reference has a complexity of O(n2k+2) time and O(n2k+1) space for k is the number of back-references in the regexp.