FSM Application: Lexical Analysis

Lexical Analysis - An application of FSM

An important application of finite state machines is the lexical analysis, the first stage of any compiler. In lexical analysis, the source code is read, one character at a time, and tokenized. Each word in the source code needs to be recognized and categorized as an identifier, a specific keyword, integer, floating point number, string, left or right paranthesis based on the pattern.

For example given the following piece of code,

the lexical analyzer scans through the code, determines the boundaries of each word (called lexeme) and assigns them the appropriate tokens. The output is shown below.

More precisely, it also stores the word along with the token as shown below.

Note that the alphabet Σ in this case is the entire ASCII char set.

Σ = {0, ..., 9, A, ..., Z, a, ..., z, +, -, *, /, (, ), [, ], {, }, \n, \t, space, =, >, <,...}

The lexemes are usually specified by the regular expressions similar to what is given below.

LEXEME	TOKEN TYPE		LEXEME	TOKEN TYPE		LEXEME	TOKEN TYPE
{alpha}{alphanum}* {digit}+ {digit}+(.{digit}+)? "{alphanum\|space}" = + - * /	ID NUM REAL STRING ASSIGN PLUS MINUS TIMES DIV		if else for while int float switch case do	IF ELSE FOR WHILE INT FLOAT SWITCH CASE DO		( ) ; == != < <= > >=	LPAREN RPAREN SEMICOLON EQ NEQ LT LEQ GT GEQ

The following are defined for convenience. They help to keep the regular expression short.

alpha ← denotes a single alphabet. i.e. a|b|...|z|A|B|...|Z
digit ← denotes a single digit. i.e. 0|1|...|9
alphanum ← denotes the either a single alphabet or single digit. i.e. alpha | digit
space ← denotes the white space. i.e. ' ' | '\t' | '\n'

The lexemes are specified in the form of regular expressions. These expressions are turned into a NFA and converted to a DFA. The DFA serves as a lexical analyzer. Usually, this process of accepting the user specified regular expressions and constructing the (minimized) DFA is automated by a tool such as Lex or JLex. The output is a C or Java program that contains the logic to categorize each lexeme.

JLex is a lexical analyzer generator that produces Java code to recognize token types automatically. The instructions to get started with JLex is given here. For a full-fledged manual, visit the JLex website.

Formal Languages and Automata

Lexical Analysis - An application of FSM