A nondeterministic finite automaton (NFA) is one that has a choice of edges - labeled with the same symbol - to follow out of a state. Or it may have special edges labeled with ∊ (the Greek letter epsilon) that can be followed without eating any symbol from the input. Here is an example of an NFA:

In the start state, on input character `a`, the automaton can move either right or left. If left is chosen, then strings of `a`'s whose length is a multiple of three will be accepted. If right is chosen, then even-length strings will be accepted. Thus, the language recognized by this NFA is the set of all strings of `a`'s whose length is a multiple of two or three. On the first transition, this machine must choose which way to go. It is required to accept the string if there is *any* choice of paths that will lead to acceptance. Thus, it must "guess", and must always guess correctly. Edges labeled with ∊ may be taken without using up a symbol from the input. Here is another NFA that accepts the same language:

Again, the machine must choose which ∊-edge to take. If there is a state with some ∊-edges and some edges labeled by symbols, the machine can choose to eat an input symbol (and follow the corresponding symbol-labeled edge), or to follow an ∊-edge instead.

Nondeterministic automata are a useful notion because it is easy to convert a (static, declarative) regular expression to a (simulatable, quasi-executable) NFA. The conversion algorithm turns each regular expression into an NFA with a *tail* (start edge) and a *head* (ending state). For example, the single-symbol regular expression **a** converts to the NFA

The regular expression **ab**, made by combining **a** with **b** using concatenation, is made by combining the two NFAs, hooking the head of **a** to the tail of **b**. The resulting machine has a tail labeled by **a** and a head into which the **b** edge flows.

In general, any regular expression *M* will have some NFA with a tail and head:

We can define the translation of regular expressions to NFAs by induction. Either an expression is primitive (a single symbol or ∊) or it is made from smaller expressions. Similarly, the NFA will be primitive or made from smaller NFAs. Image 2.6 shows the rules for translating regular expressions to nondeterministic automata. We illustrate the algorithm on some of the expressions in Image 2.2 - for the tokens IF, ID, NUM, and **error**. Each expression is translated to an NFA, the "head" state of each NFA is marked final with a different token type, and the tails of all the expressions are joined to a new start node. The result - after some merging of equivalent NFA states - is shown in Image 2.7.

Image 2.6: Translation of regular expressions to NFAs.

Image 2.7: Four regular expressions translated to an NFA.

As we saw in , implementing deterministic finite automata (DFAs) as computer programs is easy. But implementing NFAs is a bit harder, since most computers don't have good "guessing" hardware. We can avoid the need to guess by trying every possibility at once. Let us simulate the NFA of Image 2.7 on the string `in`. We start in state 1. Now, instead of guessing which ∊-transition to take, we just say that at this point the NFA might take any of them, so it is in one of the states {1, 4, 9, 14}; that is, we compute the ∊-*closure* of {1}. Clearly, there are no other states reachable without eating the first character of the input. Now, we make the transition on the character `i`. From state 1 we can reach 2, from 4 we reach 5, from 9 we go nowhere, and from 14 we reach 15. So we have the set f2, 5, 15g. But again we must compute the ∊-closure: From 5 there is an ∊-transition to 8, and from 8 to 6. So the NFA must be in one of the states {2, 5, 6, 8, 15}. On the character `n`, we get from state 6 to 7, from 2 to nowhere, from 5 to nowhere, from 8 to nowhere, and from 15 to nowhere. So we have the set {7}; its ∊-closure is {6, 7, 8}. Now we are at the end of the string `in`; is the NFA in a final state? One of the states in our possible-states set is 8, which is final. Thus, `in` is an ID token. We formally define ∊-closure as follows. Let **edge**(*s, c*) be the set of all NFA states reachable by following a single edge with label *c* from state *s*. For a set of states *S*, **closure**(*S*) is the set of states that can be reached from a state in *S* without consuming any of the input, that is, by going only through ∊-edges. Mathematically, we can express the idea of going through ∊-edges by saying that **closure**(*S*) is the smallest set *T* such that

We can calculate *T* by iteration:

Why does this algorithm work? *T* can only grow in each iteration, so the final *T* must include *S*. If *T* = *T*′ after an iteration step, then *T* must also include . Finally, the algorithm must terminate, because there are only a finite number of distinct states in the NFA. Now, when simulating an NFA as described above, suppose we are in a set *d* = {*s _{i}; s_{k}; s_{l}*} of NFA states

Using **DFAedge**, we can write the NFA simulation algorithm more formally. If the start state of the NFA is *s*_{1}, and the input string is *c*_{1},…, *c _{k}*, then the algorithm is

Manipulating sets of states is expensive - too costly to want to do on every character in the source program that is being lexically analyzed. But it is possible to do all the sets-of-states calculations in advance. We make a DFA from the NFA, such that each set of NFA states corresponds to one DFA state. Since the NFA has a finite number *n* of states, the DFA will also have a finite number (at most 2^{n}) of states. DFA construction is easy once we have **closure** and **DFAedge** algorithms. The DFA start state *d*_{1} is just **closure**(*s*_{1}), as in the NFA simulation algorithm. Abstractly, there is an edge from *d _{i}* to

The algorithm does not visit unreachable states of the DFA. This is extremely important, because in principle the DFA has 2^{n} states, but in practice we usually find that only about *n* of them are reachable from the start state. It is important to avoid an exponential blowup in the size of the DFA interpreter's transition tables, which will form part of the working compiler. A state *d* is *final* in the DFA if any NFA state in states[*d*] is final in the NFA. Labeling a state *final* is not enough; we must also say what token is recognized; and perhaps several members of states[*d*] are final in the NFA. In this case we label *d* with the token-type that occurred first in the list of regular expressions that constitute the lexical specification. This is how *rule priority* is implemented. After the DFA is constructed, the "states" array may be discarded, and the "trans" array is used for lexical analysis. Applying the DFA construction algorithm to the NFA of Image 2.7 gives the automaton in Image 2.8.

Image 2.8: NFA converted to DFA.

This automaton is suboptimal. That is, it is not the smallest one that recognizes the same language. In general, we say that two states *s*_{1} and *s*_{2} are equivalent when the machine starting in *s*_{1} accepts a string σ if and only if starting in *s*_{2} it accepts σ. This is certainly true of the states labeled and in Image 2.8, and of the states labeled and . In an automaton with two equivalent states *s*_{1} and *s*_{2}, we can make all of *s*_{2}'s incoming edges point to *s*_{1} instead and delete *s*_{2}. How can we find equivalent states? Certainly, *s*_{1} and *s*_{2} are equivalent if they are both final or both nonfinal and, for any symbol *c*, trans[*s*_{1}, *c*] = trans[*s*_{2}, *c*]; and satisfy this criterion. But this condition is not sufficiently general; consider the automaton

Here, states 2 and 4 are equivalent, but trans[2, *a*] ≠ trans[4, *a*].

After constructing a DFA it is useful to apply an algorithm to minimize it by finding equivalent states; see Exercise 2.6.

JaVa |
Previous Next |