Guide to Regular Expressions in Java (Part 1)
Lincoln Baxter III
Often unknown, or heralded as confusing, regular expressions (regex) have defined the standard for powerful text manipulation and search. Without them, many of the applications we know today would not function. This two-part series explores the basics of regular expressions in Java, and provides tutorial examples in the hopes of spreading love for our pattern-matching friends. (Read part two.)
Part 1: What are Regular Expressions?
Regular expressions are a language of string patterns built in to most modern programming languages, including Java 1.4 onward; they can be used for: searching, extracting, and modifying text. This chapter will cover basic syntax and use.
This article is part one in the series: “Guide to Regular Expressions in Java.” Read part two for more information on lookaheads, lookbehinds, and configuring the matching engine.
1. Syntax
Regular expressions, by definition, are string patterns that describe text. These descriptions can then be used in nearly infinite ways. The basic language constructs include character classes, quantifiers, and meta-characters.
1.1. Character Classes
Character classes are used to define the content of the pattern. E.g. what should the pattern look for?
. Dot, any character (may or may not match line terminators, read on) \d A digit: [0-9] \D A non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word character: [^\w]
However; notice that in Java, you will need to “double escape” these backslashes.
String pattern = "\\d \\D \\W \\w \\S \\s";
1.2. Quantifiers
Quantifiers can be used to specify the number or length that part of a pattern should match or repeat. A quantifier will bind to the expression group to its immediate left.
* Match 0 or more times + Match 1 or more times ? Match 1 or 0 times {n} Match exactly n times {n,} Match at least n times {n,m} Match at least n but not more than m times
1.3. Meta-characters
Meta-characters are used to group, divide, and perform special operations in patterns.
\ Quote the next meta-character ^ Match the beginning of the line . Match any character (except newline) $ Match the end of the line (or before newline at the end) | Alternation (‘or’ statement) () Grouping [] Custom character class
2. Examples
2.1. Basic Expressions
Every string is a regular expression. For example, the string, “I lost my wallet”, is a regular expression that will match the text, “I lost my wallet”, and will ignore everything else.
What if we want to be able to find more things that we lost? We can replace wallet with an expression that will match any word.
"I lost my \\w+"As you can see, this pattern uses both a character class and a quantifier. “\w” says match a word character, and “+” says match one or more. So when combined, the pattern says “match one or more word characters.”
Now the pattern will match any word in place of “wallet”. E.g. “I lost my sablefish”, “I lost my parrot”, but it will not match “I lost my: trooper”, because as soon as the expression finds the ":" character, which is not a word character, it will stop matching.
If we want the expression to be able to handle this situation, then we need to make a small change.
"I lost my:? \\w+"Now the expression will allow an optional ":" directly after the word ‘my’.
2.2. Basic Grouping
An important feature of regular expressions is the ability to group sections of a pattern, and provide alternate matches.
| Alternation (‘or’ statement) () Grouping
These two meta-characters are core parts of flexible regular expressions. For instance, in the first example we lost our wallet. What if we knew exactly which types of objects we had lost, and we wanted to find those objects but nothing else?
We can use a group (), with an ‘or’ meta-character in order to specify a list of expressions to allow in our match.
"I lost my:? (wallet|cell phone|car|marbles)"The new expression will now match the beginning of the string “I lost my”, an optional ":", and then any one of the expressions in the group, separated by alternators, "|"; any one of the following: ‘wallet’, ‘cell phone’, ‘car’, or our ‘marbles’ would be a match.
"I lost my wallet" matches "I lost my wallets" matches the ‘s’ is not needed, is ignored "I lost my: car" matches "I lost my- car" doesn’t match ‘-‘ is not allowed in our pattern "I lost my: cell" doesn’t match all of ‘cell phone’ is needed "I lost my: cell phone" matches "I lost my cell phone" matches "I lost my marbles" matches
As you can see, the combinations for matches quickly become very large. This is not the complete set, as there are several more phrases that would match our simple pattern.
2.3. Matching/Validating
Regular expressions make it possible to find all instances of text that match a certain pattern, and return a Boolean value if the pattern is found/not found. (This can be used to validate input such as phone numbers, social security numbers, email addresses, web form input data, scrub data, and much more. Eg. If the pattern is found in a String, and the pattern matches a SSN, then the string is an SSN)
import java.util.ArrayList;
import java.util.List;
public class ValidateDemo {
public static void main(String[] args) {
List<String> input = new ArrayList<String>();
input.add("123-45-6789");
input.add("9876-5-4321");
input.add("987-65-4321 (attack)");
input.add("987-65-4321 ");
input.add("192-83-7465");
for (String ssn : input) {
if (ssn.matches("^(\\d{3}-?\\d{2}-?\\d{4})$")) {
System.out.println("Found good SSN: " + ssn);
}
}
}
}
This produces the following output:
Found good SSN: 123-45-6789
Found good SSN: 192-83-7465
Dissecting the pattern:
"^(\\d{3}-?\\d{2}-?\\d{4})$"^ match the beginning of the line () group everything within the parenthesis as group 1 \d{n} match n digits, where n is a number equal to or greater than zero -? optionally match a dash $ match the end of the line
2.4. Extracting/Capturing
Specific values can be selected out of a large complex body of text. These values can be used in the application.
import java.util.ArrayList;
import java.util.List;
import java.util.regex.*;
public class ExtractDemo {
public static void main(String[] args) {
String input = "I have a cat, but I like my dog better.";
Pattern p = Pattern.compile("(mouse|cat|dog|wolf|bear|human)");
Matcher m = p.matcher(input);
List<String> animals = new ArrayList<String>();
while (m.find()) {
System.out.println("Found a " + m.group() + ".");
animals.add(m.group());
}
}
}
This produces the following output:
Found a cat.
Found a dog.
Dissecting the pattern:
"(mouse|cat|dog|wolf|bear|human)"() group everything within the parenthesis as group 1 mouse match the text ‘mouse’ | alternation: match any one of the sections of this group cat match the text ‘cat’ //...and so on
2.5. Modifying/Substitution
Values in text can be replaced with new values, for example, you could replace all instances of the word ‘clientId=’, followed by a number, with a mask to hide the original text. (See below)
For sanitizing log files, URI strings and parameters, and form data, this can be a useful method of filtering sensitive information. A simple, reusable utility class can be used to encapsulate this into a more streamlined method.
import java.util.regex.*;
public class ReplaceDemo {
public static void main(String[] args) {
String input =
"User clientId=23421. Some more text clientId=33432. This clientNum=100";
Pattern p = Pattern.compile("(clientId=)(\\d+)");
Matcher m = p.matcher(input);
StringBuffer result = new StringBuffer();
while (m.find()) {
System.out.println("Masking: " + m.group(2));
m.appendReplacement(result, m.group(1) + "***masked***");
}
m.appendTail(result);
System.out.println(result);
}
}
This produces the following output:
Masking: 23421
Masking: 33432
User clientId=***masked***. Some more text clientId=***masked***. This clientNum=100.
Dissecting the pattern:
"(clientId=)(\\d+)"(clientId=) group everything within the parenthesis as group 1 clientId= match the text ‘clientId=’ (\\d+) group everything within the parenthesis as group 2 \\d+ match one or more digits
Notice how groups begin numbering at 1, and increment by one for each new group. However; groups may contain groups, in which case the outer group begins at one, group two will be the next inner group. When referencing group 0, you will be given the entire chunk of text that matched the regex.
( ( ) ( ( ) ( ))) //and so on 1 2 3 4 5 //0 = everything the pattern matched
3. Conclusion & Next Steps
Wrapping up, regular expressions are not difficult to master – in fact, they are quite easy. My strategy, whenever building a new regular expression, is to start with the simplest, most general match possible. From there, I continuously add more and more complexity until I have matched, substituted, or inserted exactly what I need.
Don’t be afraid to “express” yourself! When you’ve got the hang of these techniques, or need something a little fancier, read part two for more information on lookaheads, lookbehinds, and configuring the matching engine.
|
About the author:Lincoln Baxter, III is a Senior Software Engineer at Red Hat, working on JBoss open-source projects; most notably as project lead for JBoss Forge. This blog represents his personal thoughts and perspectives, not necessarily those of his employer. He is a founder of OCPsoft, the author of PrettyFaces and Rewrite, the leading URL-rewriting extensions for Servlet, Java EE, and Java web frameworks; he is also a member of the JavaServer™ Faces Expert Group. When he is not swimming, running, or playing Ultimate Frisbee, Lincoln is focused on promoting open-source software and making web-applications more accessible for small businesses, individuals. His latest project is SocialPM, an open-source, agile project management tool. |
Posted in OpenSource
[...] provides tutorial examples in the hopes of spreading love for our pattern-matching friends. (Read part one.) Part 2: Look-ahead & Configuration flagsHave you ever wanted to find something in a string, [...]
Good job. Readable. Understandable. Clear examples. No unnecessary chest beating.
Keep it up.
[...] and look-behind operations use syntax that could be confused with grouping (See Ch. 1 – Basic Grouping,) but these patterns do not capture values; therefore, using these constructs, no values will be [...]
Thanks
I have to write a regular expression in java for the following test case:
/**
* Test for category 2000, state 11 in a state mask
*/
public void testStateMask1() { String regex = RegexTrainer.stateMask1; try { assertTrue(regex != null && regex.length() > 0); assertTrue("Didn't match 2000011", Pattern.matches(regex, "2000011")); assertTrue("Didn't match 19000012000011", Pattern.matches(regex, "19000012000011")); assertTrue("Didn't match 190000120000112100001", Pattern.matches(regex, "190000120000112100001")); assertFalse("Matched 2000010", Pattern.matches(regex, "2000010")); assertFalse("Matched 010000112000011300001", Pattern.matches(regex, "010000112000011300001")); } catch(Exception e) { logger.error(e); fail(); } }I have written the following regular expression but its working only for assertTrue
/**
* Test for category 2000, state 11 in a state mask
*/
public static String stateMask1 = "^((\\d{7}){1,3})$";
Hi there,
Your regex matches a group of Seven(7) digits, between 1 and 3 times. This means that it matches Seven, Fourteen, or Twenty-one digits in a row, so all of the match operations you have listed above will succeed.
It does nothing other than that. I’m not exactly sure what you need it to do based on your description of the problem.
I hope this helps,
~Lincoln
[...] http://ocpsoft.com/opensource/guide-to-regular-expressions-in-java-part-1/ Posted 1 year ago # [...]
[...] are some regex tutorials: http://ocpsoft.com/opensource/guide-to-regular-expressions-in-java-part-1/ http://ocpsoft.com/opensource/guide-to-regular-expressions-in-java-part-2/ Posted 11 months ago [...]
[...] Lincoln Baxter III Admin Also, that EL pattern is correct. Square brackets in regular expressions denote custom character classes. See reference: (http://ocpsoft.com/opensource/guide-to-regular-expressions-in-java-part-1/#charclasses) [...]
Hi, i would like to make a pattern that attend this:
/event/anything/eventId/
Ex: /event/coldplay-18-03-2012-new-york/123/
Sounds like you want to do some URL-rewriting? There are a few libraries out there to do this:
http://ocpsoft.com/prettyfaces/
http://ocpsoft.com/rewrite/
But if you just want a regular expression to match this type of URL, the following should do the trick:
Hi LBIII
Thanks so much for this tut. Best one out there! Regex can be a tricky concept to get (and I bet explain), so I am very much appreciative for your help! I appreciate your efforts wholeheartedly!
Bookmarked! lol
You’re welcome! Glad you found it useful! Would love it if someone wants to expand on this!
Hi,
Really cool stuff.
Just to say it, but I think that there is a difference between #matches and #find method since the #matches method will always try to match the whole input while #find method will do it for any part of the input. You could say that #matches is like a #find but adding “^” and “$” char at the beginning and the end of the input.
So, in the sample 2.3 :
if (ssn.matches("^(\\d{3}-?\\d{2}-?\\d{4})$"))“^” and “$” char are not required. And in 2.2 sample, the input “I lost my wallets” will only matches for the #find method, not for the #matches one (all other examples are fine with both methods).
Anyone correct me if I’m wrong.