Often unknown, or heralded as confusing, regular expressions (regex) have defined the standard for powerful text manipulation and search. Without them, many of the applications we know today would not function. This two-part series explores the basics of regular expressions in Java, and provides tutorial examples in the hopes of spreading love for our pattern-matching friends. (Read
part two.)
Part 1: What are Regular Expressions?
Regular expressions are a language of string patterns built in to most modern programming languages, including
Java 1.4 onward; they can be used for: searching, extracting, and modifying text. This chapter will cover basic syntax and use.
1. Syntax
Regular expressions, by definition, are string patterns that describe text. These descriptions can then be used in nearly infinite ways. The basic language constructs include character classes, quantifiers, and meta-characters.
1.1. Character Classes
Character classes are used to define the content of the pattern. E.g. what should the pattern look for?
. Dot, any character (may or may not match line terminators, read on)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w] |
However; notice that in Java, you will need to “double escape” these backslashes.
String pattern = "\\d \\D \\W \\w \\S \\s"; |
1.2. Quantifiers
Quantifiers can be used to specify the number or length that part of a pattern should match or repeat. A quantifier will bind to the expression group to its immediate left.
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times |
1.3. Meta-characters
Meta-characters are used to group, divide, and perform special operations in patterns.
\ Escape the next meta-character (it becomes a normal/literal character)
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
| Alternation (‘or’ statement)
() Grouping
[] Custom character class |
Visual Regex Tester
To get a more visual look into how regular expressions work, try our
visual java regex tester.
2. Examples
2.1. Basic Expressions
Every string is a regular expression. For example, the string,
“I lost my wallet”, is a regular expression that will match the text,
“I lost my wallet”, and will ignore everything else.
What if we want to be able to find more things that we lost? We can replace
wallet with a character class expression that will match any word.
As you can see, this pattern uses both a
character class and a
quantifier.
“\w” says match a word character, and
“+” says match one or more. So when combined, the pattern says “match one or more word characters.”
Now the pattern will match any word in place of “wallet”. E.g.
“I lost my sablefish”,
“I lost my parrot”, but it will not match
“I lost my: trooper”, because as soon as the expression finds the
":" character, which is not a word character, it will stop matching.
If we want the expression to be able to handle this situation, then we need to make a small change.
Now the expression will allow an optional
":" directly after the word ‘my’.
2.2. Basic Grouping
An important feature of regular expressions is the ability to group sections of a pattern, and provide alternate matches.
| Alternation (‘or’ statement)
() Grouping |
These two meta-characters are core parts of flexible regular expressions. For instance, in the first example we lost our wallet. What if we knew exactly which types of objects we had lost, and we wanted to find those objects but nothing else?
We can use a group
(), with an ‘or’ meta-character in order to specify a list of expressions to allow in our match.
"I lost my:? (wallet|car|cell phone|marbles)" |
The new expression will now match the beginning of the string
“I lost my”, an optional
":", and then any one of the expressions in the group, separated by alternators,
"|"; any one of the following: ‘wallet’, ‘cell phone’, ‘car’, or our ‘marbles’ would be a match.
"I lost my wallet" matches
"I lost my wallets" matches the ‘s’ is not needed, is ignored
"I lost my: car" matches
"I lost my- car" doesn’t match ‘-‘ is not allowed in our pattern
"I lost my: cell" doesn’t match all of ‘cell phone’ is needed
"I lost my: cell phone" matches
"I lost my cell phone" matches
"I lost my marbles" matches |
As you can see, the combinations for matches quickly become very large. This is not the complete set, as there are several more phrases that would match our simple pattern.
Quiz: Can you figure out all possible matches for this pattern? (See the answers.)
"I lost my:? (wallet|car|cell phone|marbles)" |
Answer: This is a trick question! Because this regular expression is unlimited (has no beginning `^` and no ending `$` meta-characters to terminate the match,) the pattern we’ve created will actually match any string containing one of the results below. In short, nearly infinite possible matches; however, if we did want to limit our pattern to just these results, we could use add the required terminators to our pattern – like so:
"^I lost my:? (wallet|car|cell phone|marbles)$" |
"I lost my wallet"
"I lost my wallets"
"I lost my: wallet"
"I lost my: wallets"
"I lost my car"
"I lost my car"
"I lost my: car"
"I lost my: car"
"I lost my cell phone"
"I lost my cell phone"
"I lost my: cell phone"
"I lost my: cell phone"
"I lost my marbles"
"I lost my marbles"
"I lost my: marbles"
"I lost my: marbles" |
2.3. Matching/Validating
Regular expressions make it possible to find all instances of text that match a certain pattern, and return a Boolean value if the pattern is found/not found. (This can be used to validate input such as phone numbers, social security numbers, email addresses, web form input data, scrub data, and much more. Eg. If the pattern is found in a String, and the pattern matches a SSN, then the string is an SSN)
import java.util.ArrayList;
import java.util.List;
public class ValidateDemo {
public static void main(String[] args) {
List<String> input = new ArrayList<String>();
input.add("123-45-6789");
input.add("9876-5-4321");
input.add("987-65-4321 (attack)");
input.add("987-65-4321 ");
input.add("192-83-7465");
for (String ssn : input) {
if (ssn.matches("^(\\d{3}-?\\d{2}-?\\d{4})$")) {
System.out.println("Found good SSN: " + ssn);
}
}
}
}
This produces the following output:
Found good SSN: 123-45-6789</br>
Found good SSN: 192-83-7465 |
Dissecting the pattern:
"^(\\d{3}-?\\d{2}-?\\d{4})$" |
^ match the beginning of the line
() group everything within the parenthesis as group 1
\d{n} match n digits, where n is a number equal to or greater than zero
-? optionally match a dash
$ match the end of the line |
2.4. Extracting/Capturing
Specific values can be selected out of a large complex body of text. These values can be used in the application.
import java.util.ArrayList;
import java.util.List;
import java.util.regex.*;
public class ExtractDemo {
public static void main(String[] args) {
String input = "I have a cat, but I like my dog better.";
Pattern p = Pattern.compile("(mouse|cat|dog|wolf|bear|human)");
Matcher m = p.matcher(input);
List<String> animals = new ArrayList<String>();
while (m.find()) {
System.out.println("Found a " + m.group() + ".");
animals.add(m.group());
}
}
}
This produces the following output:
Found a cat.
Found a dog. |
Dissecting the pattern:
"(mouse|cat|dog|wolf|bear|human)" |
() group everything within the parenthesis as group 1
mouse match the text ‘mouse’
| alternation: match any one of the sections of this group
cat match the text ‘cat’
//...and so on |
2.5. Modifying/Substitution
Values in text can be replaced with new values, for example, you could replace all instances of the word
‘clientId=’, followed by
a number, with a mask to hide the original text. (See below)
For sanitizing log files, URI strings and parameters, and form data, this can be a useful method of filtering sensitive information. A simple, reusable utility class can be used to encapsulate this into a more streamlined method.
import java.util.regex.*;
public class ReplaceDemo {
public static void main(String[] args) {
String input =
"User clientId=23421. Some more text clientId=33432. This clientNum=100";
Pattern p = Pattern.compile("(clientId=)(\\d+)");
Matcher m = p.matcher(input);
StringBuffer result = new StringBuffer();
while (m.find()) {
System.out.println("Masking: " + m.group(2));
m.appendReplacement(result, m.group(1) + "***masked***");
}
m.appendTail(result);
System.out.println(result);
}
}
This produces the following output:
Masking: 23421
Masking: 33432
User clientId=***masked***. Some more text clientId=***masked***. This clientNum=100. |
Dissecting the pattern:
(clientId=) group everything within the parenthesis as group 1
clientId= match the text ‘clientId=’
(\\d+) group everything within the parenthesis as group 2
\\d+ match one or more digits |
Notice how groups begin numbering at 1, and increment by one for each new group. However; groups may contain groups, in which case the outer group begins at one, group two will be the next inner group. When referencing group 0, you will be given the entire chunk of text that matched the regex.
( ( ) ( ( ) ( ))) ( ) //and so on
1 2 3 4 5 6 //0 = everything the pattern matched |
3. Conclusion & Next Steps
Wrapping up, regular expressions are not difficult to master – in fact, they are quite easy. My strategy, whenever building a new regular expression, is to start with the simplest, most general match possible. From there, I continuously add more and more complexity until I have matched, substituted, or inserted exactly what I need.
Don’t be afraid to “express” yourself! When you’ve got the hang of these techniques, or need something a little fancier, read part two for more information on lookaheads, lookbehinds, and configuring the matching engine.
About the author:
Lincoln Baxter, III is a Senior Software Engineer at Red Hat, working on JBoss open-source projects; most notably as project lead for JBoss Forge. This blog represents his personal thoughts and perspectives, not necessarily those of his employer.
He is a founder of OCPsoft, the author of PrettyFaces and Rewrite, the leading URL-rewriting extensions for Servlet, Java EE, and Java web frameworks; he is also the author of PrettyTime, social-style date and timestamp formatting for Java. When he is not swimming, running, or playing Ultimate Frisbee, Lincoln is focused on promoting open-source software and making web-applications more accessible for small businesses, individuals.
[...] provides tutorial examples in the hopes of spreading love for our pattern-matching friends. (Read part one.) Part 2: Look-ahead & Configuration flagsHave you ever wanted to find something in a string, [...]
Good job. Readable. Understandable. Clear examples. No unnecessary chest beating.
Keep it up.
[...] and look-behind operations use syntax that could be confused with grouping (See Ch. 1 – Basic Grouping,) but these patterns do not capture values; therefore, using these constructs, no values will be [...]
Thanks
I have to write a regular expression in java for the following test case:
/** * Test for category 2000, state 11 in a state mask */ public void testStateMask1() { String regex = RegexTrainer.stateMask1; try { assertTrue(regex != null && regex.length() > 0); assertTrue("Didn't match 2000011", Pattern.matches(regex, "2000011")); assertTrue("Didn't match 19000012000011", Pattern.matches(regex, "19000012000011")); assertTrue("Didn't match 190000120000112100001", Pattern.matches(regex, "190000120000112100001")); assertFalse("Matched 2000010", Pattern.matches(regex, "2000010")); assertFalse("Matched 010000112000011300001", Pattern.matches(regex, "010000112000011300001")); } catch(Exception e) { logger.error(e); fail(); } }I have written the following regular expression but its working only for assertTrue
Hi there,
Your regex matches a group of Seven(7) digits, between 1 and 3 times. This means that it matches Seven, Fourteen, or Twenty-one digits in a row, so all of the match operations you have listed above will succeed.
It does nothing other than that. I’m not exactly sure what you need it to do based on your description of the problem.
I hope this helps,
~Lincoln
[...] http://ocpsoft.com/opensource/guide-to-regular-expressions-in-java-part-1/ Posted 1 year ago # [...]
[...] are some regex tutorials: http://ocpsoft.com/opensource/guide-to-regular-expressions-in-java-part-1/ http://ocpsoft.com/opensource/guide-to-regular-expressions-in-java-part-2/ Posted 11 months ago [...]
[...] Lincoln Baxter III Admin Also, that EL pattern is correct. Square brackets in regular expressions denote custom character classes. See reference: (http://ocpsoft.com/opensource/guide-to-regular-expressions-in-java-part-1/#charclasses) [...]
Hi, i would like to make a pattern that attend this:
/event/anything/eventId/
Ex: /event/coldplay-18-03-2012-new-york/123/
Sounds like you want to do some URL-rewriting? There are a few libraries out there to do this:
http://ocpsoft.com/prettyfaces/
http://ocpsoft.com/rewrite/
But if you just want a regular expression to match this type of URL, the following should do the trick:
Hi LBIII
Thanks so much for this tut. Best one out there! Regex can be a tricky concept to get (and I bet explain), so I am very much appreciative for your help! I appreciate your efforts wholeheartedly!
Bookmarked! lol
You’re welcome! Glad you found it useful!
Hi,
Really cool stuff.
Just to say it, but I think that there is a difference between #matches and #find method since the #matches method will always try to match the whole input while #find method will do it for any part of the input. You could say that #matches is like a #find but adding “^” and “$” char at the beginning and the end of the input.
So, in the sample 2.3 :
if (ssn.matches("^(\\d{3}-?\\d{2}-?\\d{4})$"))“^” and “$” char are not required. And in 2.2 sample, the input “I lost my wallets” will only matches for the #find method, not for the #matches one (all other examples are fine with both methods).
Anyone correct me if I’m wrong.
Hello Lincoln Baxter III,
Very nice example, simple to understand
.
Great work, keep it up.
Hi,
I want to write a password pattern to match 3 out of below 4 criteria’s a :-
1) Password must contain atleast 1 numeric value
2) Password must contain atleast 1 lower-case letter
3) Password must contain atleast 1 upper-case letter
4) Password must contain atleast 1 of these special characters !”#$%&’()*+,./;:=?_@>-
Please help me in implementing the same
While it would seem tempting to implement this using a single regular expression (which is certainly possible), I would recommend splitting this up into 4 individual checks, with unit tests for each check.
In this situation, clarity should be preferred over brevity, and the regular expression you want to construct will be a bit opaque if you attempt a one-liner. Performance is not really an issue for something like this (unless you have some strange requirements or expectations:
This is really pretty easy, so I’ll give you this code under one condition – you have to post a link on a blog back to this article!
public boolean passwordValidates( String pass ) { int count = 0; if( pass.matches(".*\\d.*") ) count ++; if( pass.matches(".*[a-z].*") ) count ++; if( pass.matches(".*[A-Z].*") ) count ++; if( pass.matches(".*[!”#$%&’()*+,./;:=?_@>-].*") ) count ++; return count >= 3; }Thanks for the above code.There is one scenario where the above code will break,If 3 criterias are match and there is a special character entered that is not in the allowed list this code will return true. For eg:- aA123@~ – 3 criteria are matched but ‘~’ is not in the allowed list ,As per the requirement if 3 criteria’s are matched and if any special character is entered and is present outside the boundary then it shoud not be allowed. Do you have snippet for this scenario
Basically,Im looking if there’s a regular expression for finding the blacklisted special characters.
I just tried this one ,it works fine (not tested all the cases).If you have a better approach please suggest me .I would sure post a link back
.
public boolean passwordValidates( String pass ) { int count = 0; boolean pattern=true; if( pass.matches(".*\\d.*") ) count ++; if( pass.matches(".*[a-z].*") ) count ++; if( pass.matches(".*[A-Z].*") ) count ++; if( pass.matches(".*[!”#$%&’()*+,./;:=?_@>-].*") ) count ++; if (count == 3) { pattern=pass.matches("^[a-zA-Z0-9\\s!\"#$%&'()*+,./;:=?_@>-]{8}$"); return pattern; } if( pass.matches(".*[!”#$%&’()*+,./;:=?_@>-].*") )//Check if there's a special character { pattern=pass.matches("^[a-zA-Z0-9\\s!\"#$%&'()*+,./;:=?_@>-]{8}$"); count ++; } return (count >= 3&&pattern); }Thanks, Great examples!! : )
PS: would you be able to add ‘boundary matchers’ (?) to your syntax section and some supporting examples to suit by any chance, I’m reasonably new at this and was reading about them here (url below) but your example format is more comprehensive and much easier to understand (it doesn’t leave anything out : )
http://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
Keep up the good work!
Hey Steve
You bet! I’ll put something up over the weekend – it should be no problem.
Thanks for the motivation!
~Lincoln
Taking a little longer than I thought
had a fun weekend though!
I would suggest that references be provided to native regular expression man pages. I would think that the way to really understand regex’s would be to understand native regex’s as they would be used in sed or egrep or other standard Unix utilities, and then understand what the java library limitations are if any. I note that just about the first thing you do in part one, is talk about the escaping for \ backslash in pattern definition strings. I don’t know if there would be a way to do this, given that it appears that operator overloads appear to be not possible in Java, but I think a useful capability would be a java library/module (like prettytime) that linguistically overloads say the tick (single quote) or slash character so that regular expressions could be defined, and easier to read in java and consistent with other non-java examples. (cf: the perl slash (/pattern/) regex delimiter) which is the same as that used by standard Unix utilies like sed. I would think this would make generic regex man pages much more useful to the java user, increase the readability of regex’s in java, and as a result, maybe increase the general understanding and suffistication of regex usage by java programers.
Unfortunately, operator overloads are not possible in Java, and there is no way to override the default behavior of the escape character in string literals, but I agree, it would be nice
Since this is a Java-targeted article, I did link to http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html – the official regex docs. I think linking to man pages here would end up being confusing because the syntax is subtly different regarding escaping and configuration.
In general, Java regexps are a full implementation of Unix regex, but not as comprehensive as say, PCRE in a few ways. This can, however, be made up for via programmatic usage of the Pattern and Matcher classes.
Great article. By the way String.matches() is another quick way to apply regular expression in Java.
Hi,
I have a string like "John’s classes are nice || john teach nicely".
when this string value passes to Jsp, where I use single quote ‘, breaks my page.
I want to write an expression which add \ before ‘ i.e. "John\’s classes are nice \|\| john teach nicely"
Can you please help me in this.
I am new to the RegEx. Is there anyway to write a pattern which says that it is a start of a group.
987 x. 5,6(anything) (xyz). can we write a pattern which says that start of group is (x.)
I need a regular expression for finding a missing parenthesis within text. I have the following that finds a missing end parenthesis: \((?!.+?\)), but not matter what I try, I can’t figure out how to find the missing beginning one. Got any ideas?
Honestly, regular expressions are probably not the best solution here. You’re probably better off using a cursor-based matching algorithm.
Iterate over the array and keep a count of how many parens you encounter. It’s hard for a regex to do this because the regex engine doesn’t have the concept of keeping count of occurrences (at least that you can access.)
Thanks and I understand it’s not the best. But, is there a possible reverse expression that could be worked out (reverse of the one I gave in my original question)? I would think you could do a negative look behind. I just couldn’t figure out where to put the parentheses in the equation. Is it possible?
Hi Lincoln,
Great Information.
I want to do something reverse of this. I have to make my password rule to be configure by user through property file with RegEx value, then i have to validate the password value against configured RegEx. I am success ed to validate it but now i have to also show, what is correct password format to the user so that he can correct it accordingly, how can i parse regex & find that it looks for n number of special char, n number of upper case alphabet, n number of numeric character?
Thanks,
-Sachin.
I don’t think you really want to try to parse the regex. For passwords, I generally find that using several separate regexes in independent if() blocks -instead of one big regex- allows for more control of the parsing and error reporting.
Thanks for the reply. It make sense to split it in multiple groups to have better control on error reporting. But still i would give a try, may be i can put a constraint on configuration, that RegEx need to have in specific sequence of groups & then i can split this string & find the information inside these groups to form the dynamic error message.
If it won’t work then i will go with your suggestion.
Thanks,
-Sachin.
Hi, can you give me a java code for my assignment. for example the input is Regular expression Alphabet then the output will be the set of string generated by regular expression over the alphabet.
e.g. input alphabet {a,b} regular expression : b+ output : b, bb,bbb, bbbb….
or the other e.g
input: (a || b) b+
output: {a, b)
if i want to detect some characters those are inside round bracket,excludes round bracket..
How to perform this operation?