Prev | Next |
Lexical Analysis - An application of FSM
An important application of finite state machines is the lexical analysis, the first stage of any compiler. In lexical analysis, the source code is read, one character at a time, and tokenized. Each word in the source code needs to be recognized and categorized as an identifier, a specific keyword, integer, floating point number, string, left or right paranthesis based on the pattern.
For example given the following piece of code,
-
if (txn >= 20)
rxn = txn * 100;
else
rxn = txn - 10;
the lexical analyzer scans through the code, determines the boundaries of each word (called lexeme) and assigns them the appropriate tokens. The output is shown below.
-
IF LPAREN ID GEQ NUM
ID ASSIGN ID TIMES NUM SEMICOLON
ELSE
ID ASSIGN ID MINUS NUM SEMICOLON
More precisely, it also stores the word along with the token as shown below.
-
IF,if LPAREN,( ID,txn GEQ,== NUM,20 RPAREN,)
ID,rxn ASSIGN,= ID,txn TIMES,* NUM,100 SEMICOLON,;
ELSE,else
ID,rxn ASSIGN,= ID,txn MINUS,- NUM,10 SEMICOLON,;
Note that the alphabet Σ in this case is the entire ASCII char set.
Σ = {0, ..., 9, A, ..., Z, a, ..., z, +, -, *, /, (, ), [, ], {, }, \n, \t, space, =, >, <,...}
The lexemes are usually specified by the regular expressions similar to what is given below.
LEXEME | TOKEN TYPE | LEXEME | TOKEN TYPE | LEXEME | TOKEN TYPE | ||
{alpha}{alphanum}* {digit}+ {digit}+(.{digit}+)? "{alphanum|space}" = + - * / |
ID NUM REAL STRING ASSIGN PLUS MINUS TIMES DIV |
if else for while int float switch case do |
IF ELSE FOR WHILE INT FLOAT SWITCH CASE DO |
( ) ; == != < <= > >= |
LPAREN RPAREN SEMICOLON EQ NEQ LT LEQ GT GEQ |
The following are defined for convenience. They help to keep the regular expression short.
alpha ← denotes a single alphabet. i.e. a|b|...|z|A|B|...|Z
digit ← denotes a single digit. i.e. 0|1|...|9
alphanum ← denotes the either a single alphabet or single digit. i.e. alpha | digit
space ← denotes the white space. i.e. ' ' | '\t' | '\n'
The lexemes are specified in the form of regular expressions. These expressions are turned into a NFA and converted to a DFA. The DFA serves as a lexical analyzer. Usually, this process of accepting the user specified regular expressions and constructing the (minimized) DFA is automated by a tool such as Lex or JLex. The output is a C or Java program that contains the logic to categorize each lexeme.
JLex is a lexical analyzer generator that produces Java code to recognize token types automatically. The instructions to get started with JLex is given here. For a full-fledged manual, visit the JLex website.