May 2nd, 2014 by Lincoln Baxter III

How to interrupt a long-running “infinite” Java regular expression

If you’ve ever done a lot of work with Regular Expressions, you’re probably familiar with the concept of catastrophic backtracking, which means that the engine is forced to calculate permutations of exponential proportions. For instance, click here run this example and see how long it takes (should be about 5-10 seconds to time out):
LongRunningRegexExample.java
public class LongRunningRegexExample
{
   public static void main(String[] args) throws InterruptedException
   {
      final Pattern pattern = Pattern.compile("(0*)*A");
      final String input = "00000000000000000000000000";

      long startTime = System.currentTimeMillis();
      Matcher matcher = pattern.matcher(input);
      matcher.find(); // runs for a long time!
      System.out.println("Regex took:" + (System.currentTimeMillis() - startTime) + "ms");
   }
}
However, a small change results in near instantaneous response. Why?

To quote Andreas Haufler from Dzone, in his article How to Kill Java with a Regular Expression, from where this very concise example came:

What at first glance looks like an infinite loop turns out to be catastrophic backtracking.

What this basically means is that the matcher detects that no “A” was found at the end of the input. Now the outer quantifier goes one step back, the inner one forward, and again, no result.

Therefore, the matcher goes back step by step and tries all combinations to find a match. It will eventually return (without a match), but the complexity (and therefore the runtime) of this is exponential (adding one character to the input doubles the runtime).

So just how might we protect ourselves from such a devastating effect, should we encounter a scenario that could cause the Java process to run amuck for days, or even years?

You might be able to simply take care to avoid this situation in your patterns, if you have that luxury, but if you are hosting an application that accepts regular expressions as input – for example an online java visual regex tester, then you’ll need to protect from this or be vulnerable to denial of service attacks.

The answer, thankfully, comes fairly simply but requires the introduction of a thread, and a very special character sequence implementation that I found on StackOverflow.

InterruptibleRegexExample.java
public class InterruptibleRegexExample
{
   public static void main(String[] args) throws InterruptedException
   {
      final Pattern pattern = Pattern.compile("(0*)*A");
      final String input = "00000000000000000000000000";

      Runnable runnable = new Runnable() {
         @Override
         public void run()
         {
            long startTime = System.currentTimeMillis();
            Matcher interruptableMatcher = pattern.matcher(new InterruptibleCharSequence(input));
            interruptableMatcher.find(); // runs for a long time!
            System.out.println("Regex took:" + (System.currentTimeMillis() - startTime) + "ms");
         }
      };

      Thread thread = new Thread(runnable);
      thread.start();

      Thread.sleep(500);
      thread.interrupt();
   }
}
This results in a nice, lovely, interrupted regular expression:
Exception in thread "Thread-2" java.lang.RuntimeException: Die! ... Why won't you DIE!
	at org.ocpsoft.tutorial.regex.server.InterruptRegex$InterruptibleCharSequence.charAt(InterruptRegex.java:72)
	at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
	at java.util.regex.Pattern$Curly.match0(Pattern.java:3760)
	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
You’ll of course need the InterruptibleCharSequence, which will impact performance to some extent, but it’s better than waiting a year:
InterruptibleCharSequence.java
/**
 * {@link CharSequence} that notices {@link Thread} interrupts
 * 
 * @author gojomo - http://stackoverflow.com/questions/910740/cancelling-a-long-running-regex-match
 */
private static class InterruptibleCharSequence implements CharSequence {
   CharSequence inner;

   public InterruptibleCharSequence(CharSequence inner) {
      super();
      this.inner = inner;
   }

   @Override
   public char charAt(int index) {
      if (Thread.currentThread().isInterrupted()) {
         throw new RuntimeException("Interrupted!");
      }
      return inner.charAt(index);
   }

   @Override
   public int length() {
      return inner.length();
   }

   @Override
   public CharSequence subSequence(int start, int end) {
      return new InterruptibleCharSequence(inner.subSequence(start, end));
   }

   @Override
   public String toString() {
      return inner.toString();
   }
}

Conclusion

I had fun learning and writing about this, and I hope that this blog summarizes both the problem and solution (in Java) for anyone else who might run into it. Happy text processing!

Posted in Regex

7 Comments

  1. David M. Lloyd says:

    Nice trick, Lincoln, I’ll have to remember this one. I’d like to point out though that embedding an InterruptedException as an exception cause is kind of a weird to do and might possibly result in confusion later on. Better practice would be to make a custom runtime exception subtype called "MatchCancelledException" or something like that, and just leave the interrupt flag set.

    1. Yep, that’s a good point. I hadn’t thought of that. In this simple case it’s not a problem, but it certainly could be in other situations. Thanks, David!

  2. Tobias Budde says:

    There have been multiple requests to make the JDK Matcher class interruptable by default.
    All requests have been rejected as "Won’t fix". Have a look at this one:
    https://bugs.openjdk.java.net/browse/JDK-7178072

    I like your sotution to the problem a lot. Thanks!

    1. That’s unfortunate, but thank you! I wish I could say that I came up with the solution myself – however, there’s nothing to prevent you from creating your own `InterruptableMatcher` utility class/wrapper! I think that would be a nice feature for a "java fixes" library.

  3. Welyab says:

    What is the need of InterruptibleCharSequence class?

  4. Stefan Reich says:

    Interesting. I used this for my project (JavaX). I am using a slightly different approach for interruption though – basically just querying a global boolean. I suppose that will be faster than Thread.isInterrupted().

    Cheers,
    Stefan

Reply to Lincoln Baxter III




Please note: In order to submit code or special characters, wrap it in

[code lang="xml"][/code]
(for your language) - or your tags will be eaten.

Please note: Comment moderation is enabled and may delay your comment from appearing. There is no need to resubmit your comment.