Regular Expressions - Streams and Files

Regular expressions are used to specify string patterns. You can use regular expressions whenever you need to locate strings that match a particular pattern. For example, one of our sample programs locates all hyperlinks in an HTML file by looking for strings of the pattern <a href="...">. Of course, when specifying a pattern, the ... notation is not precise enough. You need to specify precisely what sequence of characters is a legal match. There is a special syntax that you need to use whenever you describe a pattern. Here is a simple example. The regular expression

[Jj]ava.+

matches any string of the following form:

The first letter is a J or j.
The next three letters are ava.
The remainder of the string consists of one or more arbitrary characters.

For example, the string "javanese" matches the particular regular expression, but the string "Core Java" does not. As you can see, you need to know a bit of syntax to understand the meaning of a regular expression. Fortunately, for most purposes, a small number of straightforward constructs is sufficient.

A character class is a set of character alternatives, enclosed in brackets, such as [Jj], [0-9], [A-Za-z], or [^0-9]. Here the - denotes a range (all characters whose Unicode value falls between the two bounds), and ^ denotes the complement (all characters except the ones specified).
There are many predefined character classes such as \d (digits) or \p{Sc} (Unicode currency symbol). See Table 12-2 and 12-3.
Most characters match themselves, such as the ava characters in the example above.
The . symbol matches any character (except possibly line terminators, depending on flag settings).
Use \ as an escape character, for example \. matches a period and \\ matches a backslash.
^ and $ match the beginning and end of a line respectively.
If X and Y are regular expressions, then XY means "any match for X followed by a match for Y". X | Y means "any match for X or Y".
You can apply quantifiers X+ (1 or more), X* (0 or more), and X? (0 or 1) to an expression X.
By default, a quantifier matches the largest possible repetition that makes the overall match succeed. You can modify that behavior with suffixes ? (reluctant or stingy match-match the smallest repetition count) and + (possessive or greedy match-match the largest count even if that makes the overall match fail). For example, the string cab matches [a-z]*ab but not [a-z]*+ab. In the first case, the expression [a-z]* only matches the character c, so that the characters ab match the remainder of the pattern. But the greedy version [a-z]*+ matches the characters cab, leaving the remainder of the pattern unmatched.
You can use groups to define subexpressions. Enclose the groups in ( ), for example ([+-]?)([0-9]+). You can then ask the pattern matcher to return the match of each group, or refer back to a group with \n, where n is the group number (starting with \1).

For example, here is a somewhat complex but potentially useful regular expression-it describes decimal or hexadecimal integers:

[+-]?[0-9]+|0[Xx][0-9A-Fa-f]+

Unfortunately, the expression syntax is not completely standardized between the various programs and libraries that use regular expressions. While there is consensus on the basic constructs, there are many maddening differences in the details. The Java regular expression classes use a syntax that is similar to, but not quite the same as, the one used in the Perl language. Table 12-4 shows all constructs of the Java syntax. For more information on the regular expression syntax, consult the API documentation for the Pattern class or the tutorial Mastering Regular Expressions by Jeffrey E. F. Friedl (Oracle and Associates, 1997). The simplest use for a regular expression is to test whether a particular string matches it. Here is how you program that test in Java. First construct a Pattern object from the string denoting the regular expression. Then get a Matcher object from the pattern, and call its matches method:

Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(input);
if (matcher.matches()) . . .

Table 12-4. Regular expression syntax

Characters

Explanation

The character c

\unnnn, \xnn, \0n, \0nn, \0nnn

The character with the given hex or octal value

\t, \n, \r, \f, \a, \e

The control characters tab, newline, return, form feed, alert, and escape

\cc

The control character corresponding to the character c

Character Classes

[C₁C₂. . .]

Any of the characters represented by C₁, C₂, . . . The C_i are characters, character ranges (c₁-c₂), or character classes

[^. . .]

Complement of character class

[ . . . && . . .]

Intersection of two character classes

Predefined Character Classes

.

Any character except line terminators (or any character if the DOTALL flag is set)

\d

A digit [0-9]

\D

A non-digit [^0-9]

\s

A whitespace character [\t\n\r\f\x0B]

\S

A non-whitespace character

\w

A word character [a-zA-Z0-9_]

\W

A non-word character

\p{name}

A named character class-see Table 12-5

\P{name}

The complement of a named character class

Boundary Matchers

^ $

Beginning, end of input (or beginning, end of line in multiline mode)

\b

A word boundary

\B

A non-word boundary

\A

Beginning of input

\z

End of input

\Z

End of input except final line terminator

\G

End of previous match

Quantifiers

X?

Optional X

X*

X, 0 or more times

X+

X, 1 or more times

X{n} X{n,} X{n,m}

X n times, at least n times, between n and m times

Quantifier Suffixes

?

Turn default (greedy) match into reluctant match

+

Turn default (greedy) match into possessive match

Set Operations

Any string from X, followed by any string from Y

X|Y

Any string from X or Y

Grouping

(X)

Capture the string matching X as a group

\n

The match of the nth group

Escapes

\c

The character c (must not be an alphabetic character)

\Q . . . \E

Quote . . . verbatim

(? . . . )

Special construct-see API notes of Pattern class

Table 12-5. Predefined character class names

Lower

ASCII lowercase [a-z]

Upper

ASCII uppercase [A-Z]

Alpha

ASCII alphabetic [A-Za-z]

Digit

ASCII digits [0-9]

Alnum

ASCII alphabetic or digit [A-Za-z0-9]

XDigit

Hex digits [0-9A-Fa-f]

Print or Graph

Printable ASCII character [\x21-\x7E]

Punct

ASCII non-alpha or digit [\p{Print}&&\P{Alnum}]

ASCII

All ASCII [\x00-\x7F]

Cntrl

ASCII Control character [\p{ASCII}&&\P{Print}]

Blank

Space or tab [\t]

Space

Whitespace [\t\n\r\f\0x0B]

InBlock

Block is the name of a Unicode character block, with spaces removed, such as BasicLatin or Mongolian. See http://www.unicode.org for a list of block names

Category or InCategory

Category is the name of a Unicode character category such as L (letter) or Sc (currency symbol). See http://www.unicode.org for a list of category names.The input of the matcher is an object of any class that implements the CharSequence interface, such as a String, StringBuffer, or a CharBuffer from the java.nio package. When compiling the pattern, you can set one or more flags, for example

Pattern pattern = Pattern.compile(patternString,
 Pattern.CASE_INSENSITIVE + Pattern.CASE_UNICODE_CASE);

The following six flags are supported:

CASE_INSENSITIVE: Match characters independent of the letter case. By default, this flag takes only US ASCII characters into account.
UNICODE_CASE: When used in combination with CASE_INSENSITIVE, use Unicode letter case for matching.
MULTILINE: ^ and $ match the beginning and end of a line, not the entire input.
UNIX_LINES: Only '\n' is recognized as a line terminator when matching ^ and $ in multiline mode.
DOTALL: When using this flag, the . symbol matches all characters, including line terminators.
CANON_EQ: Takes canonical equivalence of Unicode characters into account. For example, u followed by ¨ (diaeresis) matches ü.

If the regular expression contains groups, then the Matcher object can reveal the group boundaries. The methods

int start(int groupIndex)
int end(int groupIndex)

yield the starting index and the past-the-end index of a particular group. You can simply extract the matched string by calling

String group(int groupIndex)

Group 0 is the entire input; the group index for the first actual group is 1. Call the groupCount method to get the total group count. Nested groups are ordered by the opening parentheses. For example, given the pattern

((1?[0-9]):([0-5][0-9]))[ap]m

and the input

11:59am

the matcher reports the following groups

Group index

Start

End

String

11;59am

11:59

11

59The following program prompts for a pattern, then for strings to match. It prints out whether or not the input matches the pattern. If the input matches, and the pattern contains groups, then the program prints the group boundaries as parentheses, such as

((11):(59))am

Example RegexTest.java

 1. import java.util.regex.*;
 2. import javax.swing.*;
 3.
 4. /**
 5. This program tests regular expression matching.
 6. Enter a pattern and strings to match, or hit Cancel
 7. to exit. If the pattern contains groups, the group
 8. boundaries are displayed in the match.
 9. */
10. public class RegExTest
11. {
12. public static void main(String[] args)
13. {
14. String patternString = JOptionPane.showInputDialog(
15. "Enter pattern:");
16. Pattern pattern = null;
17. try
18. {
19. pattern = Pattern.compile(patternString);
20. }
21. catch (PatternSyntaxException exception)
22. {
23. System.out.println("Pattern syntax error");
24. System.exit(1);
25. }
26.
27. while (true)
28. {
29. String input = JOptionPane.showInputDialog(
30. "Enter string to match:");
31. if (input == null) System.exit(0);
32.
33. Matcher matcher = pattern.matcher(input);
34. if (matcher.matches())
35. {
36. System.out.println("Match");
37. int g = matcher.groupCount();
38. if (g > 0)
39. {
40. for (int i = 0; i < input.length(); i++)
41. {
42. for (int j = 1; j <= g; j++)
43. if (i == matcher.start(j))
44. System.out.print('(');
45. System.out.print(input.charAt(i));
46. for (int j = 1; j <= g; j++)
47. if (i + 1 == matcher.end(j))
48. System.out.print(')');
49. }
50. System.out.println();
51. }
52. }
53. else
54. System.out.println("No match");
55. }
56. }
57. }

Usually, you don't want to match the entire input against a regular expression, but you want to find one or more matching substrings in the input. Use the find method of the Matcher class to find the next match. If it returns true, use the start and end methods to find the extent of the match.

while (matcher.find())
{
 int start = matcher.start();
 int end = matcher.end();
 String match = input.substring(start, end);
 . . .
}

The following sample program puts this mechanism to work. It locates all hypertext references in a web page and prints them out. To run the program, supply a URL on the command line, such as

java HrefMatch http://www.horstmann.com

Example HrefMatch.java

 1. import java.io.*;
 2. import java.net.*;
 3. import java.util.regex.*;
 4.
 5. /**
 6. This program displays all URLs in a web page by
 7. matching a regular expression that describes the
 8. <a href=...> HTML tag. Start the program as
 9. java HrefMatch URL
10. */
11. public class HrefMatch
12. {
13. public static void main(String[] args)
14. {
15. try
16. {
17. // get URL string from command line or use default
18. String urlString;
19. if (args.length > 0) urlString = args[0];
20. else urlString = "http://java.oracle.com";
21.
22. // open reader for URL
23. InputStreamReader in = new InputStreamReader(
24. new URL(urlString).openStream());
25.
26. // read contents into string buffer
27. StringBuffer input = new StringBuffer();
28. int ch;
29. while ((ch = in.read()) != -1) input.append((char)ch);
30.
31. // search for all occurrences of pattern
32. String patternString
33. = "<a\\s+href\\s*=\\s*(\"[^\"]*\"|[^\\s>])\\s*>";
34. Pattern pattern = Pattern.compile(patternString,
35. Pattern.CASE_INSENSITIVE);
36. Matcher matcher = pattern.matcher(input);
37.
38. while (matcher.find())
39. {
40. int start = matcher.start();
41. int end = matcher.end();
42. String match = input.substring(start, end);
43. System.out.println(match);
44. }
45. }
46. catch (IOException exception)
47. {
48. exception.printStackTrace();
49. }
50. catch (PatternSyntaxException exception)
51. {
52. exception.printStackTrace();
53. }
54. }
55. }

The replaceAll method of the Matcher class replaces all occurrences of a regular expression with a replacement string. For example, the following instructions replace all sequences of digits with a # character.

Pattern pattern = Pattern.compile("[0-9]+");
Matcher matcher = pattern.matcher(input);
String output = matcher.replaceAll("#");

The replacement string can contain references to groups in the pattern: $n is replaced with the nth group. Use \$ to include a $ character in the replacement text. There is also a replaceFirst method that replaces only the first occurrence of the pattern. Finally, the Pattern class has a split method that works like a string tokenizer on steroids. It splits an input into an array of strings, using the regular expression matches as boundaries. For example, the following instructions split the input into tokens, where the delimiters are punctuation marks surrounded by optional white space.

Pattern pattern = Pattern.compile("\\s*\\p{Punct}\\s*");
String[] tokens = pattern.split(input);

`java.util.regex.Pattern` 1.4

static Pattern compile(String expression)

static Pattern compile(String expression, int flags)

compile the regular expression string into a pattern object for fast processing of matches.Matcher matcher(CharSequence input)

returns a matcher object that you can use to locate the matches of the pattern in the input.String[] split(CharSequence input)String[] split(CharSequence input, int limit)

split the input string into tokens, where the pattern specifies the form of the delimiters. Returns an array of tokens. The delimiters are not part of the tokens.

`java.util.regex.Matcher` 1.4

boolean matches()
returns true if the input matches the pattern.
boolean lookingAt()
returns true if the beginning of the input matches the pattern.
boolean find()

boolean find(int start)

attempts to find the next match and returns true if another match is found.int start()int end()

return the start and past-the-end position of the current match.String group()

returns the current match.int groupCount()

returns the number of groups in the input pattern.int start(int groupIndex)int end(int groupIndex)

return the start and past-the-end position of a given group in the current match.String group(int groupIndex)

returns the string matching a given group.String replaceAll(String replacement)String replaceFirst(String replacement)

return a string obtained from the matcher input by replacing all matches, or the first match, with the replacement string.Matcher reset()Matcher reset(CharSequence input)

reset the matcher state. The second method makes the matcher work on a different input. Both methods return this.You have now reached the end of the first volume of Core Java. This volume covered the fundamentals of the Java coding language and the parts of the standard library that you need for most coding projects. We hope that you enjoyed your tour through the Java fundamentals and that you found useful information along the way. For advanced topics, such as networking, multithreading, security, and internationalization, please turn to the second volume.

Top

Parameters:	`start`	the index at which to start searching
Parameters	`groupIndex`	the group index (starting with 1), or 0 to indicate the entire match
Parameters	`groupIndex`	the group index (starting with 1), or 0 to indicate the entire match
Parameters	`replacement`	The replacement string. It can contain references to a pattern group as `$n`. Use `\$` to include a `$` symbol.
Previous

Parameters	`expression`	the regular expression
	`flags`	one or more of the flags `CASE_INSENSITIVE`, `UNICODE_CASE`, `MULTILINE`, `UNIX_LINES`, `DOTALL`, and `CANON_EQ`
Parameters	`input`	The string to be split into tokens.
	`limit`	The maximum number of strings to produce. If `limit - 1` matching delimiters have been found, then the last entry of the returned array contains the remaining unsplit input. If `limit` is 0, then the entire input is split. If `limit` is 0, then trailing empty strings are not placed in the returned array.