July 23rd, 2013 by Lincoln Baxter III

Guide to Regular Expressions in Java (Part 2)

Often unknown, or heralded as confusing, regular expressions have defined the standard for powerful text manipulation and search. Without them, many of the applications we know today would not function. This two-part series explores the basics of regular expressions in Java, and provides tutorial examples in the hopes of spreading love for our pattern-matching friends. (Read part one.)

Part 2: Look-ahead & Configuration flags

Have you ever wanted to find something in a string, but you only wanted to find it when it came before another pattern in that string? Or maybe you wanted to find a piece of text that was not followed by another piece of text? Normally, with standard string searching, you would have to write a somewhat complex function to perform the exact operation you wanted. This can, however, all be done on one line using regular expressions. This chapter will also cover the configuration flags available to modify some of the behaviors of the regular expression patterns language.

This article is part two in the series: “Regular Expressions.” Read part one for more information on basic matching, grouping, extracting, and substitution.

1. Look-ahead & Look-behind

Look-ahead and look-behind operations use syntax that could be confused with grouping (See Ch. 1 – Basic Grouping,) but these patterns do not capture values; therefore, using these constructs, no values will be stored for later retrieval, and they do not affect group numbering. Look-ahead operations look forward, starting from their location in the pattern, continuing to the end of the input. Look-behind expressions do not search backwards, but instead start at the beginning of the pattern and continue up to/until the look-behind expression. E.g.: The statement “my dog is (?!(green|red))\\w+” asserts that ‘green’ will not the word to the look-ahead’s direct right. In other words: My dog is not green or red, but my dog is blue.

Look-ahead/behind constructs (non-capturing)

(?:X) 			X, as a non-capturing group
(?=X) 			X, via zero-width positive look-ahead
(?!X) 			X, via zero-width negative look-ahead
(?<=X) 			X, via zero-width positive look-behind
(?<!X) 			X, via zero-width negative look-behind
(?<X) 			X, as an independent, non-capturing group
So what does this all mean? What does a look-ahead really do for me? Say, for example, we wanted to know if our input string contains the word “incident” but that the word “theft” should not be found anywhere. We can use a negative look-ahead to ensure that there are no occurrences.
“(?!.*theft).*incident.*”
This expression exhibits the following behavior:
"There was a crime incident"			matches
"The incident involved a theft"			does not match
"The theft was a serious incident"		does not match
A more complex example is password validation. Let’s say we want to ensure that a password is made at least 8, but at most 12 alphanumeric characters, and at least two numbers, in any position. We will need to use a look-ahead expression in order to enforce a requirement of two numbers. This look-ahead expression will require any number of characters to be followed by a single digit, followed by any number of characters, and another single digit. E.g.: …4…2, or …42, or 42…, or 4…2.

1.1. Example

Sample code
import java.util.ArrayList;
import java.util.List;

public class LookaheadDemo {
	public static void main(String[] args) {
		List<String> input = new ArrayList<String>();
		input.add("password");
		input.add("p4ssword");
		input.add("p4ssw0rd");
		input.add("p45sword");

		for (String ssn : input) {
			if (ssn.matches("^(?=.*[0-9].*[0-9])[0-9a-zA-Z]{8,12}$")) {
				System.out.println(ssn + ": matches");
			} else {
				System.out.println(ssn + ": does not match");
			}
		}
	}
}

This produces the following output:

password: does not match p4ssword: does not match p4ssw0rd: matches p45sword: matches
Try this example online with our Visual Java Regex Tester

Dissecting the pattern:

"^(?=.*[0-9].*[0-9])[0-9a-zA-Z]{8,12}$"
^				match the beginning of the line
(?=.*[0-9].*[0-9]) 		a look-ahead expression, requires 2 digits to be present
.*				match n characters, where n >= 0
[0-9]				match a digit from 0 to 9
[0-9a-zA-Z]			match any numbers or letters
{8,12}				match 8 to 12 of whatever is specified by the last group
$				match the end of the line

Multiple look-ahead operations do not evaluate in a specific order. They must all be satisfied equally, and if they logically contradict each other, the pattern will never match.


Visual Regex Tester

To get a more visual look into how regular expressions work, try our visual java regex tester.

2. Configuring the Matching Engine

Pattern configuration flags for Java appear very similar to look-ahead operations. Flags are used to configure case sensitivity, multi-line matching, and more. Flags can be specified in collections, or as individual statements. Again, these expressions are not literal, and do not capture values.
2.1. Configuration flags
(?idmsux-idmsux)  	Turns match flags on - off for entire expression
(?idmsux-idmsux:X)   	X, as a non-capturing group with the given flags on – off
2.2. Case insensitivity mode
(?i)  			Toggle case insensitivity (default: off, (?-i)) for the text in this group only
2.3. UNIX lines mode
(?d)  			Enables UNIX line mode (default: off, (?-d)) 
			In this mode, only the '\n' line terminator is recognized in the behavior of ., ^, and $
2.4. Multi-line mode
(?m)  			Toggle treat newlines as whitespace (default: off, (?-m))
			The ^ and $ expressions will no longer match to the beginning and end of a line,
			respectively, but will match the beginning and end of the entire input sequence/string.
2.5. Dot-all mode
(?s)  			Toggle dot ‘.’ matches any character (default: off, (?-s))
			Normally, the dot character will match everything except newline characters.
2.6. Unicode-case mode
(?u)  			Toggle Unicode standard case matching (default: off, (?-u)
			By default, case-insensitive matching assumes that only characters 
			in the US-ASCII charset are being matched.
2.7. Comments mode
(?x)  			Allow comments in pattern (default: off, (?-x))
			In this mode, whitespace is ignored, and embedded comments starting with '#'
			are ignored until the end of a line.

2.8. Examples

2.8.1. Global toggle
In order to toggle flags for the entire expression, the statement must be at the head of the expression.
"(?idx)^I\s lost\s my\s .+     #this comment and all spaces will be ignored"
The above expression will ignore case, and will set the dot ‘.’ character to include newlines.
Try this example online with our Visual Java Regex Tester
2.8.2. Local toggle
In order to toggle flags for the a single non-capturing group, the group must adhere to the following syntax
"(?idx:Cars)[a-z]+
The above expression will ignore case within the group, but adhere to case beyond.
Try this example online with our Visual Java Regex Tester
2.8.3. Applied in Java
Sample code
public class ConfigurationDemo {
	public static void main(String[] args) {
		String input = "My dog is Blue.\n" +
				"He is not red or green.";

		Boolean controlResult = input.matches("(?=.*Green.*).*Blue.*");
		Boolean caseInsensitiveResult = input.matches("(?i)(?=.*Green.*).*Blue.*");
		Boolean dotallResult = input.matches("(?s)(?=.*Green.*).*Blue.*");
		Boolean configuredResult = input.matches("(?si)(?=.*Green.*).*Blue.*");
		
		System.out.println("Control result was: " + controlResult);
		System.out.println("Case ins. result was: " + caseInsensitiveResult);
		System.out.println("Dot-all result was: " + dotallResult);
		System.out.println("Configured result was: " + configuredResult);
	}
}

This produces the following output:

Control result was: false Case insensitive result was: false Dot-all result was: false Configured result was: true

Dissecting the pattern:

"(?si)(?=.*Green.*).*Blue.*"
(?si)			turn on case insensitivity and dotall modes
(?=.*Green.*) 		‘Green’ must be found somewhere to the right of this look-ahead
.*Blue.*		‘Blue’ must be found somewhere in the input
We had to enable multi-line and case-insensitive modes for our pattern to match. The look-ahead in this example is very similar to the pattern itself, and in this case, the pattern could be substituted for another look-ahead. Because we don’t care in which order we find these two items, the way this is written, substituting “(?=.*Blue.*)” for “.*Blue.*” would be an acceptable change; however, if we did care in which order we wanted to find these colors, we would need to be more precise with our ordering. If we wanted to ensure that the ‘Green’ came after ‘Blue’ we would need to move the look-ahead as seen below, and so on.
"(?si).*Blue.*(?=.*Green.*)"

3. Conclusion

Regular expressions provide an extremely flexible and powerful text processing system. Try to imagine doing this work using String.substring(…) or String.indexOf(…), with loops, nested loops, and dozens of if statements. I don’t even want to try… so play around! Think about using regular expressions next time you find yourself doing text or pattern manipulation with looping and other painful methods. Let us know how you do.

This article is part two in the series: “Guide to Regular Expressions in Java.” Read part one for more information on basic matching, grouping, extracting, and substitution.

Lincoln Baxter, III

About the author:

Lincoln Baxter, III is a Principal Software Engineer at Red Hat, working on JBoss open-source projects; most notably as creator & project lead of JBoss Forge, and author of Errai UI. This blog represents his personal thoughts and perspectives, not necessarily those of his employer.

He is a founder of OCPsoft, the author of PrettyFaces and Rewrite, the leading URL-rewriting extensions for Servlet, Java EE, and Java web frameworks; he is also the author of PrettyTime, social-style date and timestamp formatting for Java. When he is not swimming, running, or playing Ultimate Frisbee, Lincoln is focused on promoting open-source software and making web-applications more accessible for small businesses, individuals.

Posted in OpenSource

12 Comments

  1. [...] provides tutorial examples in the hopes of spreading love for our pattern-matching friends. (Read part two.) Part 1: What are Regular Expressions?Regular expressions are a language of string patterns built [...]

  2. Examples don't work says:

    This regular expression “.*(?!.*theft)incident.*”
    matches “The theft was a serious incident” what would be the correct regex to have this example work

  3. Oops, you’re correct: this should be the solution (I haven’t checked.)

    “(?!.*theft).*incident.*”

  4. Darek says:

    In the statement “my dog is (?!(green|red))w+”, why w+ is used? As I assume, within (?…) I include regexp to match the searched string from the current position in searched string to end of searched string?

    1. It is to make sure that there is actually a word there. The negative lookahead makes sure that the designated words are *not* found, but ‘\w+’ makes sure that there is *some word* there.

  5. Gaurav Jain says:

    I want to split the string : "FNAME,\"STREET,PLACE\",ADDR2,XYZ"

    Desired O/p : (Comma seperated values but ignore the , when it is between "")
    FNAME
    "STREET,PLACE"
    ADDR2
    XYZ

    I seach on internet and find : ",(?=([^\"]*\"[^\"]*\")*[^\"]*$" this regex working fine for me but I am not able to understand this. Also please suggest if there is any better way of doing this.

    Thanks in Advance

  6. Ganes says:

    [ Data ganesh|10
    Output name="ganesh" age ="10"

    I tried this "^(.*)|(.*)$" with replaceAll("name='\1' age='\2'").

    Can you help? (going nuts)

  7. You’re using alternation when you mean to be using a literal character, and you’re also not using the correct substitution symbol. Read the first part of this article series – http://ocpsoft.com/opensource/guide-to-regular-expressions-in-java-part-1/

    Here is your answer

  8. David says:

    In the 1.1 example, why is the initial .* in the look-ahead needed? (And why no trailing .* to match?)

  9. AJ says:

    When using your example,

    (?!.*theft).*incident.*

    the following two strings returned as matches: "theft there was an incident" and "there was a theft incident". I fixed the code as such

    ^(?!.*theft.*$).*incident.*

    and that handled all of the cases correctly. I’m not exactly sure why but it worked. lol

Leave a Comment




Please note: In order to submit code or special characters, wrap it in

[code lang="xml"][/code]
(for your language) - or your tags will be eaten.

Please note: Comment moderation is enabled and may delay your comment from appearing. There is no need to resubmit your comment.