Previous    Next

LEXICAL TOKENS

A lexical token is a sequence of characters that can be treated as a unit in the grammar of a coding language. A coding language classifies lexical tokens into a finite set of token types. For example, some of the token types of a typical coding language are

Type

Examples

ID

foo n14 last

NUM

082

REAL

1e67 5.5e-10

IF

if

COMMA

,

NOTEQ

!=

LPAREN

(

RPAREN

)

Punctuation tokens such as IF, VOID, RETURN constructed from alphabetic characters are called reserved words and, in most languages, cannot be used as identifiers. Examples of nontokens are

comment

/* try again */

preprocessor directive

#include<stdio.h>

preprocessor directive

#define NUMS 5, 6

macro

NUMS

blanks, tabs, and newlines

 

In languages weak enough to require a macro preprocessor, the preprocessor operates on the source character stream, producing another character stream that is then fed to the lexical analyzer. It is also possible to integrate macro processing with lexical analysis. Given a program such as

float match0(char *s) /* find a zero */
{if (!strncmp(s, "0.0", 3))
 return 0.;
}


the lexical analyzer will return the stream

FLOAT ID(match0) LPAREN CHAR STAR ID(s) RPAREN LBRACE IF LPAREN BANG ID(strncmp) LPAREN ID(s)
COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF


where the token-type of each token is reported; some of the tokens, such as identifiers and literals, have semantic values attached to them, giving auxiliary information in addition to the token-type. How should the lexical rules of a coding language be described? In what language should a lexical analyzer be written? We can describe the lexical tokens of a language in English; here is a description of identifiers in C or Java:

An identifier is a sequence of letters and digits; the first character must be a letter. The underscore _ counts as a letter. Upper- and lowercase letters are different. If the input stream has been parsed into tokens up to a given character, the next token is taken to include the longest string of characters that could possibly constitute a token. Blanks, tabs, newlines, and comments are ignored except as they serve to separate tokens. Some white space is required to separate otherwise adjacent identifiers, keywords, and constants.

And any reasonable coding language serves to implement an ad hoc lexer. But we will specify lexical tokens using the formal language of regular expressions, implement lexers using deterministic finite automata, and use mathematics to connect the two. This will lead to simpler and more readable lexical analyzers.


JaVaScreenshot Previous    Next
Comments