Regex Padding
I was recently working on Aloha’s support for multilabel models and I came across a regular expression issue that was confounding. I want to quickly share the problem and an easy solution.
Problem
When trying to find regular expression matches of something delimited by whitespace, it’s not sufficient to simply add surrounding whitespace to a Pattern. For instance, consider the following example:
1
2
3
4
5
6
7
8
9
10
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final Pattern re1 = Pattern.compile(""" -q\s*(\S{2}) """);
final String subject1 = " -qab -q cd ";
final Matcher m1 = re1.matcher(subject1);
m1.find(); // true
m1.group(1); // "ab"
m1.find(); // false
Notice that, at first glance, the regular expression in re1
seems reasonable. It seems like it should be
able to find the values in the first capture groups: "ab"
and "cd"
. But it fails to find the second
value. This is because of the trailing space in the match " -qab "
. The trailing space is consumed by
the first match and is not available to delimit the second match. This is because the search for
the second match starts at the index after the end of the first match.
Easy Solution
The solution involves consuming the delimiting whitespace on just one end of the pattern. In the following
example, an arbitrary decision was made to consume the whitespace at the beginning of the pattern. To do
this, simply use the zero-width positive lookahead feature. This is the (?=\s|$)
at the end of
pattern, re2
. This says “the next character, if one exists, must be whitespace (but don’t consume it).”
At the beginning of the pattern, (^|\s)
says “if a character is present, it must be whitespace
(and DO consume it).”
1
2
3
4
5
6
7
8
9
10
11
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final Pattern re2 = Pattern.compile("""(^|\s)-q\s*(\S{2})(?=\s|$)""");
final String subject2 = "-qab -q cd";
final Matcher m2 = re2.matcher(subject2);
m2.find(); // true
m2.group(2); // "ab"
m2.find(); // true
m2.group(2); // "cd"
An additional capture group in front of the main pattern is added to consume the whitespace character, so
the group containing the relevant information is group 2
, but this is the only thing necessary to change.
Notice the second match is found this time. Also notice that subject2
isn’t whitespace-padded since the
regular expression takes care of the edge cases.
Scala Too
Since Scala is based on the JVM, the functions in Regex encounter the same problems:
1
2
3
4
5
val re1 = """ -q\s*(\S{2}) """.r
val subject1 = " -qab -q cd "
// List("ab")
val matches1 = re1.findAllMatchIn(subject1).map(_.group(1)).toList
and with the fix:
1
2
3
4
5
val re2 = """(^|\s)-q\s*(\S{2})(?=\s|$)""".r
val subject2 = "-qab -q cd"
// List("ab", "cd")
val matches2 = re2.findAllMatchIn(subject2).map(_.group(2)).toList
Deleting the Matches
The matches can also safely be deleted without replacing with whitespace. For instance:
1
2
3
4
val re2 = """(^|\s)-q\s*(\S{2})(?=\s|$)""".r
val subject2 = "-qab -q cd"
val res = re2.replaceAllIn(subject2, "")
assert( res == "" )
Summary
So, in summary, consider whitespace padding your regular expressions with: