REGULAR EXPRESSIONS

Let us say that a language is a set of strings; a string is a finite sequence of symbols. The symbols themselves are taken from a finite alphabet. The Pascal language is the set of all strings that constitute legal Pascal programs; the language of primes is the set of all decimal-digit strings that represent prime numbers; and the language of C reserved words is the set of all alphabetic strings that cannot be used as identifiers in the C coding language. The first two of these languages are infinite sets; the last is a finite set. In all of these cases, the alphabet is the ASCII character set. When we speak of languages in this way, we will not assign any meaning to the strings; we will just be attempting to classify each string as in the language or not. To specify some of these (possibly infinite) languages with finite descriptions, we will use the notation of regular expressions. Each regular expression stands for a set of strings. Symbol: For each symbol a in the alphabet of the language, the regular expression a denotes the language containing just the string a. Alternation: Given two regular expressions M and N, the alternation operator written as a vertical bar ‖ makes a new regular expression M ‖ N. A string is in the language of M ‖ N if it is in the language of M or in the language of N. Thus, the language of a ‖ b contains the two strings a and b. Concatenation: Given two regular expressions M and N, the concatenation operator · makes a new regular expression M · N. A string is in the language of M · N if it is the concatenation of any two strings α and β such that α is in the language of M and β is in the language of N. Thus, the regular expression (a ‖ b) · a defines the language containing the two strings aa and ba. Epsilon: The regular expression ∊ represents a language whose only string is the empty string. Thus, (a · b) ‖ ∊ represents the language {"", "ab"}. Repetition: Given a regular expression M, its Kleene closure is M*. A string is in M* if it is the concatenation of zero or more strings, all of which are in M. Thus, ((a ‖ b) · a)* represents the infinite set {"", "aa", "ba", "aaaa", "baaa", "aaba", "baba", "aaaaaa", …}. Using symbols, alternation, concatenation, epsilon, and Kleene closure we can specify the set of ASCII characters corresponding to the lexical tokens of a coding language. First, consider some examples:

(0 \| 1)* · 0	Binary numbers that are multiples of two.
b(abb*)(a\|∊)	Strings of `a`'s and `b`'s with no consecutive `a`'s.
(a\|b)aa(a\|b)	Strings of `a`'s and `b`'s containing consecutive `a`'s.

In writing regular expressions, we will sometimes omit the concatenation symbol or the epsilon, and we will assume that Kleene closure "binds tighter" than concatenation, and concatenation binds tighter than alternation; so that ab | c means (a · b) | c, and (a |) means (a | ∊). Let us introduce some more abbreviations: [abcd] means (a | b | c | d), [b-g] means [bcdefg], [b-gM-Qkr] means [bcdefgMNOPQkr], M? means (M | ∊), and M⁺ means (M·M*). These extensions are convenient, but none extend the descriptive power of regular expressions: Any set of strings that can be described with these abbreviations could also be described by just the basic set of operators. All the operators are summarized in Image 2.1.

a	An ordinary character stands for itself.
∊	The empty string. Another way to write the empty string.
M \| N	Alternation, choosing from M or N.
M · N	Concatenation, an M followed by an N.
MN	Another way to write concatenation.
M*	Repetition (zero or more times).
M⁺	Repetition, one or more times.
M?	Optional, zero or one occurrence of M.
[a − zA − Z]	Character set alternation.
`.`	A period stands for any single character except newline.
`"a.+*"`	Quotation, a string in quotes stands for itself literally.

Image 2.1: Regular expression notation.

Using this language, we can specify the lexical tokens of a coding language (Image 2.2).

if IF
[a-z][a-z0-9]* ID
[0-9]+ NUM
([0-9]+"."[0-9]*)|([0-9]*"."[0-9]+) REAL
("--"[a-z]*"\n")|(" "|"\n"|"\t")+ no token, just white space
. error

Image 2.2: Regular expressions for some tokens.

The fifth line of the description recognizes comments or white space but does not report back to the parser. Instead, the white space is discarded and the lexer resumed. The comments for this lexer begin with two dashes, contain only alphabetic characters, and end with newline. Finally, a lexical specification should be complete, always matching some initial substring of the input; we can always achieve this by having a rule that matches any single character (and in this case, prints an "illegal character" error message and continues). These rules are a bit ambiguous. For example, does if8 match as a single identifier or as the two tokens if and 8? Does the string if 89 begin with an identifier or a reserved word? There are two important disambiguation rules used by Lex, JavaCC, SableCC, and other similar lexical-analyzer generators: Longest match: The longest initial substring of the input that can match any regular expression is taken as the next token. Rule priority: For a particular longest initial substring, the first regular expression that can match determines its token-type. This means that the order of writing down the regular-expression rules has significance.

Thus, if8 matches as an identifier by the longest-match rule, and if matches as a reserved word by rule-priority.