Prev Next

Lexical Analysis - An application of FSM

An important application of finite state machines is the lexical analysis, the first stage of any compiler. In lexical analysis, the source code is read, one character at a time, and tokenized. Each word in the source code needs to be recognized and categorized as an identifier, a specific keyword, integer, floating point number, string, left or right paranthesis based on the pattern.

For example given the following piece of code,

    if (txn >= 20)
      rxn = txn * 100;
    else
      rxn = txn - 10;

the lexical analyzer scans through the code, determines the boundaries of each word (called lexeme) and assigns them the appropriate tokens. The output is shown below.

    IF LPAREN ID GEQ NUM
      ID ASSIGN ID TIMES NUM SEMICOLON
    ELSE
      ID ASSIGN ID MINUS NUM SEMICOLON

More precisely, it also stores the word along with the token as shown below.

    IF,if   LPAREN,(   ID,txn   GEQ,==   NUM,20   RPAREN,)

      ID,rxn   ASSIGN,=   ID,txn   TIMES,*   NUM,100   SEMICOLON,;  

    ELSE,else

      ID,rxn   ASSIGN,=   ID,txn   MINUS,-   NUM,10   SEMICOLON,;

Note that the alphabet Σ in this case is the entire ASCII char set.

Σ = {0, ..., 9, A, ..., Z, a, ..., z, +, -, *, /, (, ), [, ], {, }, \n, \t, space, =, >, <,...}

The lexemes are usually specified by the regular expressions similar to what is given below.

LEXEMETOKEN TYPE  LEXEME TOKEN TYPE  LEXEME TOKEN TYPE
{alpha}{alphanum}*
{digit}+
{digit}+(.{digit}+)?
"{alphanum|space}"
=
+
-
*
/
ID
NUM
REAL
STRING
ASSIGN
PLUS
MINUS
TIMES
DIV
if
else
for
while
int
float
switch
case
do
IF
ELSE
FOR
WHILE
INT
FLOAT
SWITCH
CASE
DO
(
)
;
==
!=
<
<=
>
>=
LPAREN
RPAREN
SEMICOLON
EQ
NEQ
LT
LEQ
GT
GEQ


The following are defined for convenience. They help to keep the regular expression short.

    alpha ← denotes a single alphabet. i.e. a|b|...|z|A|B|...|Z
    digit ← denotes a single digit. i.e. 0|1|...|9
    alphanum ← denotes the either a single alphabet or single digit. i.e. alpha | digit
    space ← denotes the white space. i.e. ' ' | '\t' | '\n'

The lexemes are specified in the form of regular expressions. These expressions are turned into a NFA and converted to a DFA. The DFA serves as a lexical analyzer. Usually, this process of accepting the user specified regular expressions and constructing the (minimized) DFA is automated by a tool such as Lex or JLex. The output is a C or Java program that contains the logic to categorize each lexeme.

JLex is a lexical analyzer generator that produces Java code to recognize token types automatically. The instructions to get started with JLex is given here. For a full-fledged manual, visit the JLex website.