Regular expressions are used to specify string patterns. You can use regular expressions whenever you need to locate strings that match a particular pattern. For example, one of our sample programs locates all hyperlinks in an HTML file by looking for strings of the pattern <a href="...">. Of course, when specifying a pattern, the ... notation is not precise enough. You need to specify precisely what sequence of characters is a legal match. There is a special syntax that you need to use whenever you describe a pattern. Here is a simple example. The regular expression

[Jj]ava.+

matches any string of the following form:

For example, the string "javanese" matches the particular regular expression, but the string "Core Java" does not. As you can see, you need to know a bit of syntax to understand the meaning of a regular expression. Fortunately, for most purposes, a small number of straightforward constructs is sufficient.

For example, here is a somewhat complex but potentially useful regular expression-it describes decimal or hexadecimal integers:

[+-]?[0-9]+|0[Xx][0-9A-Fa-f]+

Unfortunately, the expression syntax is not completely standardized between the various programs and libraries that use regular expressions. While there is consensus on the basic constructs, there are many maddening differences in the details. The Java regular expression classes use a syntax that is similar to, but not quite the same as, the one used in the Perl language. shows all constructs of the Java syntax. For more information on the regular expression syntax, consult the API documentation for the Pattern class or the tutorial Mastering Regular Expressions by Jeffrey E. F. Friedl (Oracle and Associates, 1997). The simplest use for a regular expression is to test whether a particular string matches it. Here is how you program that test in Java. First construct a Pattern object from the string denoting the regular expression. Then get a Matcher object from the pattern, and call its matches method:

Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(input);
if (matcher.matches()) . . .

Table 12-4. Regular expression syntax

Characters

Explanation

c The character c
\unnnn, \xnn, \0n, \0nn, \0nnn The character with the given hex or octal value
\t, \n, \r, \f, \a, \e The control characters tab, newline, return, form feed, alert, and escape
\cc The control character corresponding to the character c

Character Classes

[C1C2. . .] Any of the characters represented by C1, C2, . . . The Ci are characters, character ranges (c1-c2), or character classes
[^. . .] Complement of character class
[ . . . && . . .] Intersection of two character classes

Predefined Character Classes

. Any character except line terminators (or any character if the DOTALL flag is set)
\d A digit [0-9]
\D A non-digit [^0-9]
\s A whitespace character [\t\n\r\f\x0B]
\S A non-whitespace character
\w A word character [a-zA-Z0-9_]
\W A non-word character
\p{name} A named character class-see
\P{name} The complement of a named character class

Boundary Matchers

^ $ Beginning, end of input (or beginning, end of line in multiline mode)
\b A word boundary
\B A non-word boundary
\A Beginning of input
\z End of input
\Z End of input except final line terminator
\G End of previous match

Quantifiers

X? Optional X
X* X, 0 or more times
X+ X, 1 or more times
X{n} X{n,} X{n,m} X n times, at least n times, between n and m times

Quantifier Suffixes

? Turn default (greedy) match into reluctant match
+ Turn default (greedy) match into possessive match

Set Operations

XY Any string from X, followed by any string from Y
X|Y Any string from X or Y

Grouping

(X) Capture the string matching X as a group
\n The match of the nth group

Escapes

\c The character c (must not be an alphabetic character)
\Q . . . \E Quote . . . verbatim
(? . . . ) Special construct-see API notes of Pattern class

Table 12-5. Predefined character class names

Lower ASCII lowercase [a-z]
Upper ASCII uppercase [A-Z]
Alpha ASCII alphabetic [A-Za-z]
Digit ASCII digits [0-9]
Alnum ASCII alphabetic or digit [A-Za-z0-9]
XDigit Hex digits [0-9A-Fa-f]
Print or Graph Printable ASCII character [\x21-\x7E]
Punct ASCII non-alpha or digit [\p{Print}&&\P{Alnum}]
ASCII All ASCII [\x00-\x7F]
Cntrl ASCII Control character [\p{ASCII}&&\P{Print}]
Blank Space or tab [\t]
Space Whitespace [\t\n\r\f\0x0B]
InBlock Block is the name of a Unicode character block, with spaces removed, such as BasicLatin or Mongolian. See for a list of block names
Category or InCategory Category is the name of a Unicode character category such as L (letter) or Sc (currency symbol). See for a list of category names.The input of the matcher is an object of any class that implements the CharSequence interface, such as a String, StringBuffer, or a CharBuffer from the java.nio package. When compiling the pattern, you can set one or more flags, for example
Pattern pattern = Pattern.compile(patternString,
 Pattern.CASE_INSENSITIVE + Pattern.CASE_UNICODE_CASE);

The following six flags are supported:

  • CASE_INSENSITIVE: Match characters independent of the letter case. By default, this flag takes only US ASCII characters into account.
  • UNICODE_CASE: When used in combination with CASE_INSENSITIVE, use Unicode letter case for matching.
  • MULTILINE: ^ and $ match the beginning and end of a line, not the entire input.
  • UNIX_LINES: Only '\n' is recognized as a line terminator when matching ^ and $ in multiline mode.
  • DOTALL: When using this flag, the . symbol matches all characters, including line terminators.
  • CANON_EQ: Takes canonical equivalence of Unicode characters into account. For example, u followed by ¨ (diaeresis) matches ü.

If the regular expression contains groups, then the Matcher object can reveal the group boundaries. The methods

int start(int groupIndex)
int end(int groupIndex)

yield the starting index and the past-the-end index of a particular group. You can simply extract the matched string by calling

String group(int groupIndex)

Group 0 is the entire input; the group index for the first actual group is 1. Call the groupCount method to get the total group count. Nested groups are ordered by the opening parentheses. For example, given the pattern

((1?[0-9]):([0-5][0-9]))[ap]m

and the input

11:59am

the matcher reports the following groups

Group index

Start

End

String

0

0

7

11;59am

1

0

5

11:59

2

0

2

11

3

3

5

59The following program prompts for a pattern, then for strings to match. It prints out whether or not the input matches the pattern. If the input matches, and the pattern contains groups, then the program prints the group boundaries as parentheses, such as
((11):(59))am

Example RegexTest.java

 1. import java.util.regex.*;
 2. import javax.swing.*;
 3.
 4. /**
 5. This program tests regular expression matching.
 6. Enter a pattern and strings to match, or hit Cancel
 7. to exit. If the pattern contains groups, the group
 8. boundaries are displayed in the match.
 9. */
10. public class RegExTest
11. {
12. public static void main(String[] args)
13. {
14. String patternString = JOptionPane.showInputDialog(
15. "Enter pattern:");
16. Pattern pattern = null;
17. try
18. {
19. pattern = Pattern.compile(patternString);
20. }
21. catch (PatternSyntaxException exception)
22. {
23. System.out.println("Pattern syntax error");
24. System.exit(1);
25. }
26.
27. while (true)
28. {
29. String input = JOptionPane.showInputDialog(
30. "Enter string to match:");
31. if (input == null) System.exit(0);
32.
33. Matcher matcher = pattern.matcher(input);
34. if (matcher.matches())
35. {
36. System.out.println("Match");
37. int g = matcher.groupCount();
38. if (g > 0)
39. {
40. for (int i = 0; i < input.length(); i++)
41. {
42. for (int j = 1; j <= g; j++)
43. if (i == matcher.start(j))
44. System.out.print('(');
45. System.out.print(input.charAt(i));
46. for (int j = 1; j <= g; j++)
47. if (i + 1 == matcher.end(j))
48. System.out.print(')');
49. }
50. System.out.println();
51. }
52. }
53. else
54. System.out.println("No match");
55. }
56. }
57. }

Usually, you don't want to match the entire input against a regular expression, but you want to find one or more matching substrings in the input. Use the find method of the Matcher class to find the next match. If it returns true, use the start and end methods to find the extent of the match.

while (matcher.find())
{
 int start = matcher.start();
 int end = matcher.end();
 String match = input.substring(start, end);
 . . .
}

The following sample program puts this mechanism to work. It locates all hypertext references in a web page and prints them out. To run the program, supply a URL on the command line, such as

java HrefMatch http://www.horstmann.com

Example HrefMatch.java

 1. import java.io.*;
 2. import java.net.*;
 3. import java.util.regex.*;
 4.
 5. /**
 6. This program displays all URLs in a web page by
 7. matching a regular expression that describes the
 8. <a href=...> HTML tag. Start the program as
 9. java HrefMatch URL
10. */
11. public class HrefMatch
12. {
13. public static void main(String[] args)
14. {
15. try
16. {
17. // get URL string from command line or use default
18. String urlString;
19. if (args.length > 0) urlString = args[0];
20. else urlString = "http://java.oracle.com";
21.
22. // open reader for URL
23. InputStreamReader in = new InputStreamReader(
24. new URL(urlString).openStream());
25.
26. // read contents into string buffer
27. StringBuffer input = new StringBuffer();
28. int ch;
29. while ((ch = in.read()) != -1) input.append((char)ch);
30.
31. // search for all occurrences of pattern
32. String patternString
33. = "<a\\s+href\\s*=\\s*(\"[^\"]*\"|[^\\s>])\\s*>";
34. Pattern pattern = Pattern.compile(patternString,
35. Pattern.CASE_INSENSITIVE);
36. Matcher matcher = pattern.matcher(input);
37.
38. while (matcher.find())
39. {
40. int start = matcher.start();
41. int end = matcher.end();
42. String match = input.substring(start, end);
43. System.out.println(match);
44. }
45. }
46. catch (IOException exception)
47. {
48. exception.printStackTrace();
49. }
50. catch (PatternSyntaxException exception)
51. {
52. exception.printStackTrace();
53. }
54. }
55. }

The replaceAll method of the Matcher class replaces all occurrences of a regular expression with a replacement string. For example, the following instructions replace all sequences of digits with a # character.

Pattern pattern = Pattern.compile("[0-9]+");
Matcher matcher = pattern.matcher(input);
String output = matcher.replaceAll("#");

The replacement string can contain references to groups in the pattern: $n is replaced with the nth group. Use \$ to include a $ character in the replacement text. There is also a replaceFirst method that replaces only the first occurrence of the pattern. Finally, the Pattern class has a split method that works like a string tokenizer on steroids. It splits an input into an array of strings, using the regular expression matches as boundaries. For example, the following instructions split the input into tokens, where the delimiters are punctuation marks surrounded by optional white space.

Pattern pattern = Pattern.compile("\\s*\\p{Punct}\\s*");
String[] tokens = pattern.split(input);

java.util.regex.Pattern 1.4

Java graphics api_icon
  • static Pattern compile(String expression)
  • static Pattern compile(String expression, int flags)

    compile the regular expression string into a pattern object for fast processing of matches.Matcher matcher(CharSequence input)

    returns a matcher object that you can use to locate the matches of the pattern in the input.String[] split(CharSequence input)String[] split(CharSequence input, int limit)

    split the input string into tokens, where the pattern specifies the form of the delimiters. Returns an array of tokens. The delimiters are not part of the tokens.

    java.util.regex.Matcher 1.4

    Java graphics api_icon
    • boolean matches()

      returns true if the input matches the pattern.

    • boolean lookingAt()

      returns true if the beginning of the input matches the pattern.

    • boolean find()
    • boolean find(int start)

      attempts to find the next match and returns true if another match is found.int start()int end()

      return the start and past-the-end position of the current match.String group()

      returns the current match.int groupCount()

      returns the number of groups in the input pattern.int start(int groupIndex)int end(int groupIndex)

      return the start and past-the-end position of a given group in the current match.String group(int groupIndex)

      returns the string matching a given group.String replaceAll(String replacement)String replaceFirst(String replacement)

      return a string obtained from the matcher input by replacing all matches, or the first match, with the replacement string.Matcher reset()Matcher reset(CharSequence input)

      reset the matcher state. The second method makes the matcher work on a different input. Both methods return this.You have now reached the end of the first volume of Core Java. This volume covered the fundamentals of the Java coding language and the parts of the standard library that you need for most coding projects. We hope that you enjoyed your tour through the Java fundamentals and that you found useful information along the way. For advanced topics, such as networking, multithreading, security, and internationalization, please turn to the second volume.


      Parameters: start the index at which to start searching
      Parameters groupIndex the group index (starting with 1), or 0 to indicate the entire match
      Parameters groupIndex the group index (starting with 1), or 0 to indicate the entire match
      Parameters replacement The replacement string. It can contain references to a pattern group as $n. Use \$ to include a $ symbol.
    Parameters expression the regular expression
    flags one or more of the flags CASE_INSENSITIVE, UNICODE_CASE, MULTILINE, UNIX_LINES, DOTALL, and CANON_EQ
    Parameters input The string to be split into tokens.
    limit The maximum number of strings to produce. If limit - 1 matching delimiters have been found, then the last entry of the returned array contains the remaining unsplit input. If limit is Screenshot 0, then the entire input is split. If limit is 0, then trailing empty strings are not placed in the returned array.