Previous    Next

LEXICAL-ANALYZER GENERATORS

DFA construction is a mechanical task easily performed by computer, so it makes sense to have an automatic lexical-analyzer generator to translate regular expressions into a DFA. JavaCC and SableCC generate lexical analyzers and parsers written in Java. The lexical analyzers are generated from lexical specifications; and, as explained in the next chapter, the parsers are generated from grammars. For both JavaCC and SableCC, the lexical specification and the grammar are contained in the same file.

JAVACC

The tokens described in Image 2.2 are specified in JavaCC as shown in Program 2.9. A JavaCC specification starts with an optional list of options followed by a Java compilation unit enclosed between PARSER_BEGIN(name) and PARSER_END(name). The same name must follow PARSER_BEGIN and PARSER_END; it will be the name of the generated parser (MyParser in Program 2.9). The enclosed compilation unit must contain a class declaration of the same name as the generated parser.

JavaCC specification of the tokens from Image 2.2.
PARSER_BEGIN(MyParser)
 class MyParser {}
PARSER_END(MyParser)
/* For the regular expressions on the right, the token on the left will be returned:/*
TOKEN : {
 < IF: "if" >
 | < #DIGIT: ["0"-"9"] >
 | < ID: ["a"-"z"] (["a"-"z"]|<DIGIT>) >
 | < NUM: (<DIGIT>)+ >
 | < REAL: ( (<DIGIT>)+ "." (<DIGIT>)* ) |
 ( (<DIGIT>)* "." (<DIGIT>)+ )>
}
/* The regular expressions here will be skipped during lexical analysis: */
SKIP : {
 <"--" (["a"-"z"])* ("\n" | "\r" | "\r\n")>
 |""
 | "\t"
 | "\n"
}
/* If we have a substring that does not match any of the regular expressions in TOKEN or SKIP,
 JavaCC will automatically throw an error. */
void Start() :
{}
{ ( <IF> | <ID> | <NUM> | <REAL> )* }



Java End example

Next is a list of grammar productions of the following kinds: a regular-expression production defines a token, a token-manager declaration can be used by the generated lexical analyzer, and two other kinds are used to define the grammar from which the parser is generated. A lexical specification uses regular-expression productions; there are four kinds: TOKEN, SKIP, MORE, and SPECIAL_TOKEN. We will only need TOKEN and SKIP for the compiler project in this tutorial. The kind TOKEN is used to specify that the matched string should be transformed into a token that should be communicated to the parser. The kind SKIP is used to specify that the matched string should be thrown away. In Program 2.9, the specifications of ID, NUM, and REAL use the abbreviation DIGIT. The definition of DIGIT is preceeded by # to indicate that it can be used only in the definition of other tokens.

The last part of Program 2.9 begins with void Start. Itisa production which, in this case, allows the generated lexer to recognize any of the four defined tokens in any order. The next chapter will explain productions in detail.

SABLECC

The tokens described in Image 2.2 are specified in SableCC as shown in Program 2.10. A SableCC specification file has six sections (all optional):

  1. Package declaration: specifies the root package for all classes generated by SableCC.

  2. Helper declarations: a list of abbreviations.
  3. State declarations: support the state feature of, for example, GNU FLEX; when the lexer is in some state, only the tokens associated with that state are recognized. States can be used for many purposes, including the detection of a beginning-of-line state, with the purpose of recognizing tokens only if they appear at the beginning of a line. For the compiler described in this tutorial, states are not needed.
  4. Token declarations: each one is used to specify that the matched string should be transformed into a token that should be communicated to the parser.
  5. Ignored tokens: each one is used to specify that the matched string should be thrown away.
  6. Productions: are used to define the grammar from which the parser is generated.
SableCC specification of the tokens from Image 2.2.
Helpers
 digit = ['0'..'9'];
Tokens
 if = 'if';
 id = ['a'..'z'](['a'..'z'] | (digit))*;
 number = digit+;
 real = ((digit)+ '.' (digit)*) |
 ((digit)* '.' (digit)+);
 whitespace = (' ' | '\t' | '\n')+;
 comments = ('--' ['a'..'z']* '\n');
Ignored Tokens
 whitespace,
 comments;



Java End example

JaVaScreenshot Previous    Next
Comments