OCPsoft

Guide to Regular Expressions in Java (Part 1)

February 22nd, 2012 by Lincoln Baxter III
Fork me on GitHub

Often unknown, or heralded as confusing, regular expressions (regex) have defined the standard for powerful text manipulation and search. Without them, many of the applications we know today would not function. This two-part series explores the basics of regular expressions in Java, and provides tutorial examples in the hopes of spreading love for our pattern-matching friends. (Read part two.)

Part 1: What are Regular Expressions?

Regular expressions are a language of string patterns built in to most modern programming languages, including Java 1.4 onward; they can be used for: searching, extracting, and modifying text. This chapter will cover basic syntax and use.

This article is part one in the series: “Guide to Regular Expressions in Java.” Read part two for more information on lookaheads, lookbehinds, and configuring the matching engine.

1. Syntax

Regular expressions, by definition, are string patterns that describe text. These descriptions can then be used in nearly infinite ways. The basic language constructs include character classes, quantifiers, and meta-characters.

1.1. Character Classes

Character classes are used to define the content of the pattern. E.g. what should the pattern look for?

.  	Dot, any character (may or may not match line terminators, read on)
\d  	A digit: [0-9]
\D  	A non-digit: [^0-9]
\s  	A whitespace character: [ \t\n\x0B\f\r]
\S  	A non-whitespace character: [^\s]
\w  	A word character: [a-zA-Z_0-9]
\W  	A non-word character: [^\w]

However; notice that in Java, you will need to “double escape” these backslashes.

String pattern = "\\d \\D \\W \\w \\S \\s";

1.2. Quantifiers

Quantifiers can be used to specify the number or length that part of a pattern should match or repeat. A quantifier will bind to the expression group to its immediate left.

*      Match 0 or more times
+      Match 1 or more times
?      Match 1 or 0 times
{n}    Match exactly n times
{n,}   Match at least n times
{n,m}  Match at least n but not more than m times

1.3. Meta-characters

Meta-characters are used to group, divide, and perform special operations in patterns.

\   	Quote the next meta-character
^   	Match the beginning of the line
.   	Match any character (except newline)
$   	Match the end of the line (or before newline at the end)
|   	Alternation (or’ statement)
()  	Grouping
[]  	Custom character class

2. Examples

2.1. Basic Expressions

Every string is a regular expression. For example, the string, “I lost my wallet”, is a regular expression that will match the text, “I lost my wallet”, and will ignore everything else.

What if we want to be able to find more things that we lost? We can replace wallet with an expression that will match any word.

"I lost my \\w+"

As you can see, this pattern uses both a character class and a quantifier. “\w” says match a word character, and “+” says match one or more. So when combined, the pattern says “match one or more word characters.”

Now the pattern will match any word in place of “wallet”. E.g. “I lost my sablefish”, “I lost my parrot”, but it will not match “I lost my: trooper”, because as soon as the expression finds the ":" character, which is not a word character, it will stop matching.

If we want the expression to be able to handle this situation, then we need to make a small change.

"I lost my:? \\w+"

Now the expression will allow an optional ":" directly after the word ‘my’.

2.2. Basic Grouping

An important feature of regular expressions is the ability to group sections of a pattern, and provide alternate matches.

|   	Alternation (or’ statement)
()  	Grouping

These two meta-characters are core parts of flexible regular expressions. For instance, in the first example we lost our wallet. What if we knew exactly which types of objects we had lost, and we wanted to find those objects but nothing else?

We can use a group (), with an ‘or’ meta-character in order to specify a list of expressions to allow in our match.

"I lost my:? (wallet|cell phone|car|marbles)"

The new expression will now match the beginning of the string “I lost my”, an optional ":", and then any one of the expressions in the group, separated by alternators, "|"; any one of the following: ‘wallet’, ‘cell phone’, ‘car’, or our ‘marbles’ would be a match.

"I lost my wallet"		matches
"I lost my wallets"		matches		the ‘s’ is not needed, is ignored
"I lost my: car"		matches
"I lost my- car"		doesn’t match	‘-‘ is not allowed in our pattern
"I lost my: cell"		doesn’t match	all of ‘cell phone’ is needed
"I lost my: cell phone"		matches
"I lost my cell phone"		matches
"I lost my marbles"		matches

As you can see, the combinations for matches quickly become very large. This is not the complete set, as there are several more phrases that would match our simple pattern.

2.3. Matching/Validating

Regular expressions make it possible to find all instances of text that match a certain pattern, and return a Boolean value if the pattern is found/not found. (This can be used to validate input such as phone numbers, social security numbers, email addresses, web form input data, scrub data, and much more. Eg. If the pattern is found in a String, and the pattern matches a SSN, then the string is an SSN)

Sample code
import java.util.ArrayList;
import java.util.List;

public class ValidateDemo {
	public static void main(String[] args) {
		List<String> input = new ArrayList<String>();
		input.add("123-45-6789");
		input.add("9876-5-4321");
		input.add("987-65-4321 (attack)");
		input.add("987-65-4321 ");
		input.add("192-83-7465");


		for (String ssn : input) {
			if (ssn.matches("^(\\d{3}-?\\d{2}-?\\d{4})$")) {
				System.out.println("Found good SSN: " + ssn);
			}
		}
	}
}

This produces the following output:

Found good SSN: 123-45-6789
Found good SSN: 192-83-7465

Dissecting the pattern:
"^(\\d{3}-?\\d{2}-?\\d{4})$"
^		match the beginning of the line
() 		group everything within the parenthesis as group 1
\d{n}		match n digits, where n is a number equal to or greater than zero
-?		optionally match a dash
$		match the end of the line

2.4. Extracting/Capturing

Specific values can be selected out of a large complex body of text. These values can be used in the application.

Sample code
import java.util.ArrayList;
import java.util.List;
import java.util.regex.*;

public class ExtractDemo {
	public static void main(String[] args) {
		String input = "I have a cat, but I like my dog better.";

		Pattern p = Pattern.compile("(mouse|cat|dog|wolf|bear|human)");
		Matcher m = p.matcher(input);

		List<String> animals = new ArrayList<String>();
		while (m.find()) {
			System.out.println("Found a " + m.group() + ".");
			animals.add(m.group());
		}
	}
}

This produces the following output:

Found a cat.
Found a dog.

Dissecting the pattern:
"(mouse|cat|dog|wolf|bear|human)"
()		group everything within the parenthesis as group 1
mouse		match the text ‘mouse’
|		alternation: match any one of the sections of this group
cat		match the text ‘cat’
 
//...and so on

2.5. Modifying/Substitution

Values in text can be replaced with new values, for example, you could replace all instances of the word ‘clientId=’, followed by a number, with a mask to hide the original text. (See below)

For sanitizing log files, URI strings and parameters, and form data, this can be a useful method of filtering sensitive information. A simple, reusable utility class can be used to encapsulate this into a more streamlined method.

Sample code
import java.util.regex.*;

public class ReplaceDemo {
	public static void main(String[] args) {
		String input = 
                  "User clientId=23421. Some more text clientId=33432. This clientNum=100";

		Pattern p = Pattern.compile("(clientId=)(\\d+)");
		Matcher m = p.matcher(input);

		StringBuffer result = new StringBuffer();
		while (m.find()) {
			System.out.println("Masking: " + m.group(2));
			m.appendReplacement(result, m.group(1) + "***masked***");
		}
		m.appendTail(result);
		System.out.println(result);
	}
}

This produces the following output:

Masking: 23421
Masking: 33432
User clientId=***masked***. Some more text clientId=***masked***. This clientNum=100.

Dissecting the pattern:
"(clientId=)(\\d+)"
(clientId=) 	group everything within the parenthesis as group 1
clientId=	match the text ‘clientId=(\\d+)		group everything within the parenthesis as group 2
\\d+		match one or more digits

Notice how groups begin numbering at 1, and increment by one for each new group. However; groups may contain groups, in which case the outer group begins at one, group two will be the next inner group. When referencing group 0, you will be given the entire chunk of text that matched the regex.

(  ( ) (  ( ) ( )))		//and so on
 1  2   3  4   5		//0 = everything the pattern matched

3. Conclusion & Next Steps

Wrapping up, regular expressions are not difficult to master – in fact, they are quite easy. My strategy, whenever building a new regular expression, is to start with the simplest, most general match possible. From there, I continuously add more and more complexity until I have matched, substituted, or inserted exactly what I need.

Don’t be afraid to “express” yourself! When you’ve got the hang of these techniques, or need something a little fancier, read part two for more information on lookaheads, lookbehinds, and configuring the matching engine.

Post to Twitter Post to Delicious Post to Digg Post to StumbleUpon

Posted in OpenSource

14 Comments

  1. [...] provides tutorial examples in the hopes of spreading love for our pattern-matching friends. (Read part one.) Part 2: Look-ahead & Configuration flagsHave you ever wanted to find something in a string, [...]

  2. Gene De Lisa says:

    Good job. Readable. Understandable. Clear examples. No unnecessary chest beating.

    Keep it up.

  3. [...] and look-behind operations use syntax that could be confused with grouping (See Ch. 1 – Basic Grouping,) but these patterns do not capture values; therefore, using these constructs, no values will be [...]

  4. Nitin Gautam says:

    Thanks

  5. param says:

    I have to write a regular expression in java for the following test case:

    /**
    * Test for category 2000, state 11 in a state mask

    */

    public void testStateMask1() {
        String regex = RegexTrainer.stateMask1;
        try {
          assertTrue(regex != null && regex.length() > 0);
          assertTrue("Didn't match 2000011",
                  Pattern.matches(regex, "2000011"));
          assertTrue("Didn't match 19000012000011",
                  Pattern.matches(regex, "19000012000011"));
          assertTrue("Didn't match 190000120000112100001",
                  Pattern.matches(regex, "190000120000112100001"));
          assertFalse("Matched 2000010", Pattern.matches(regex, "2000010"));
          assertFalse("Matched 010000112000011300001",
                  Pattern.matches(regex, "010000112000011300001"));
        } catch(Exception e) {
          logger.error(e);
          fail();
        }
      }
    

    I have written the following regular expression but its working only for assertTrue

    /**

    * Test for category 2000, state 11 in a state mask

    */

    public static String stateMask1 = "^((\\d{7}){1,3})$";

  6. Hi there,

    Your regex matches a group of Seven(7) digits, between 1 and 3 times. This means that it matches Seven, Fourteen, or Twenty-one digits in a row, so all of the match operations you have listed above will succeed.

    It does nothing other than that. I’m not exactly sure what you need it to do based on your description of the problem.

    I hope this helps,
    ~Lincoln

  7. [...] Lincoln Baxter III Admin Also, that EL pattern is correct. Square brackets in regular expressions denote custom character classes. See reference: (http://ocpsoft.com/opensource/guide-to-regular-expressions-in-java-part-1/#charclasses) [...]

  8. Ygor Fonseca says:

    Hi, i would like to make a pattern that attend this:

    /event/anything/eventId/

    Ex: /event/coldplay-18-03-2012-new-york/123/

    1. Sounds like you want to do some URL-rewriting? There are a few libraries out there to do this:

      http://ocpsoft.com/prettyfaces/
      http://ocpsoft.com/rewrite/

      But if you just want a regular expression to match this type of URL, the following should do the trick:

      String pattern = "/event/[^/]+/\\d+/";
      
  9. MrBCut says:

    Hi LBIII

    Thanks so much for this tut. Best one out there! Regex can be a tricky concept to get (and I bet explain), so I am very much appreciative for your help! I appreciate your efforts wholeheartedly!

    Bookmarked! lol

    1. You’re welcome! Glad you found it useful! Would love it if someone wants to expand on this!

  10. Paul says:

    Hi,

    Really cool stuff.

    Just to say it, but I think that there is a difference between #matches and #find method since the #matches method will always try to match the whole input while #find method will do it for any part of the input. You could say that #matches is like a #find but adding “^” and “$” char at the beginning and the end of the input.

    So, in the sample 2.3 :

    if (ssn.matches("^(\\d{3}-?\\d{2}-?\\d{4})$"))

    “^” and “$” char are not required. And in 2.2 sample, the input “I lost my wallets” will only matches for the #find method, not for the #matches one (all other examples are fine with both methods).

    Anyone correct me if I’m wrong.

Leave a Comment




Please note: In order to submit code or special characters, wrap it in

[sourcecode lang="xml"][/sourcecode]
(for your language) - or your tags will be eaten.

Please note: Comment moderation is enabled and may delay your comment from appearing. There is no need to resubmit your comment.

Get updates from OCPSoft



Add OCPsoft to your Circles

Read Something New

Join the Discussion

Blatant Advertising

Meta