Regex Padding

I was recently working on Aloha’s support for multilabel models and I came across a regular expression issue that was confounding. I want to quickly share the problem and an easy solution.

Problem

When trying to find regular expression matches of something delimited by whitespace, it’s not sufficient to simply add surrounding whitespace to a Pattern. For instance, consider the following example:

1
2
3
4
5
6
7
8
9
10
import java.util.regex.Matcher;
import java.util.regex.Pattern;

final Pattern re1 = Pattern.compile(""" -q\s*(\S{2}) """);
final String subject1 = " -qab -q cd ";

final Matcher m1 = re1.matcher(subject1);
m1.find();        // true
m1.group(1);      // "ab"
m1.find();        // false

Notice that, at first glance, the regular expression in re1 seems reasonable. It seems like it should be able to find the values in the first capture groups: "ab" and "cd". But it fails to find the second value. This is because of the trailing space in the match " -qab ". The trailing space is consumed by the first match and is not available to delimit the second match. This is because the search for the second match starts at the index after the end of the first match.

Easy Solution

The solution involves consuming the delimiting whitespace on just one end of the pattern. In the following example, an arbitrary decision was made to consume the whitespace at the beginning of the pattern. To do this, simply use the zero-width positive lookahead feature. This is the (?=\s|$) at the end of pattern, re2. This says “the next character, if one exists, must be whitespace (but don’t consume it).” At the beginning of the pattern, (^|\s) says “if a character is present, it must be whitespace (and DO consume it).”

1
2
3
4
5
6
7
8
9
10
11
import java.util.regex.Matcher;
import java.util.regex.Pattern;

final Pattern re2 = Pattern.compile("""(^|\s)-q\s*(\S{2})(?=\s|$)""");
final String subject2 = "-qab -q cd";

final Matcher m2 = re2.matcher(subject2);
m2.find();        // true
m2.group(2);      // "ab"
m2.find();        // true
m2.group(2);      // "cd"

An additional capture group in front of the main pattern is added to consume the whitespace character, so the group containing the relevant information is group 2, but this is the only thing necessary to change. Notice the second match is found this time. Also notice that subject2 isn’t whitespace-padded since the regular expression takes care of the edge cases.

Scala Too

Since Scala is based on the JVM, the functions in Regex encounter the same problems:

1
2
3
4
5
val re1 = """ -q\s*(\S{2}) """.r
val subject1 = " -qab -q cd "

// List("ab")
val matches1 = re1.findAllMatchIn(subject1).map(_.group(1)).toList

and with the fix:

1
2
3
4
5
val re2 = """(^|\s)-q\s*(\S{2})(?=\s|$)""".r
val subject2 = "-qab -q cd"

// List("ab", "cd")
val matches2 = re2.findAllMatchIn(subject2).map(_.group(2)).toList

Deleting the Matches

The matches can also safely be deleted without replacing with whitespace. For instance:

1
2
3
4
val re2 = """(^|\s)-q\s*(\S{2})(?=\s|$)""".r
val subject2 = "-qab -q cd"
val res = re2.replaceAllIn(subject2, "")
assert( res == "" )

Summary

So, in summary, consider whitespace padding your regular expressions with:

// JAVA
public static String pad(String s) {
  return "(^|\\s)" + s + "(?=\\s|$)";
}

// SCALA
def pad(s: String) = """(^|\s)""" + s + """(?=\s|$)"""