Do you think IIT Guwahati certified course can help you in your career?
No
Lex in compiler design is a program used to generate scanners or lexical analyzers, also called tokenizers. These tokenizers identify the lexical pattern in the input program and convert the input text into the sequence of tokens. It is used with the YACC parser generator.
In this article, we will learn what Lex in compiler design is and its role in the compilation process. We will also be learning about lexical analysis and terminologies related to it.
What is Lexical Analysis?
Compilation happens in multiple phases, and lexical analysis is the starting Phases of Compiler. It gathers preprocessed source code, written in sentences, that comes as the preprocessor's output. The lexical analyzer generates a stream of tokens from the preprocessed source code by removing whitespace and comments. It generates an error if it gets any invalid token. The stream of character is read by it, and it seeks the legal tokens, and then the data is passed to the syntax analyzer when asked for it.
Uses of lexical analyzer
The following tasks are performed by a lexical analyzer-
The lexical analyzer removes the white spaces and comments from the source program.
It corresponds to the error messages with the source program.
Helps in the identification of the tokens.
A lexical analyzer reads the input characters from the source code.
Advantages of Lexical Analysis
Lexical analysis has the following advantages-
The lexical analysis allows the browsers to format and display a web page with the help of parsed data.
It is responsible for generating a compiled binary executable code.
The lexical analyzer generates a more efficient and specialized processor for the task.
Disadvantages of Lexical Analysis
Lexical analysis has the following disadvantages-
Additional runtime is required to create the symbol table and construct the token, generating overhead.
Debugging and developing the lexer and token description requires much effort.
A lot of time is required to read the source code and break it into tokens as the analyzer reads the characters individually.
What is LEX in Compiler Design?
Lex is a tool or program that creates a lexical analyzer and helps us perform the task of lexical analysis(It converts characters stream into tokens). The Lex tool is a compiler itself. The Lex compiler transforms the input into input patterns.
Eric Schmidt and Mike Lesk initially developed the code for Lex, which was intended for Unix-based systems.
Function of Lex
Lexical Analyzer Creation: The process begins by creating a program called lex.1 using Lex's language. This program defines the rules and patterns for recognizing tokens in the source code
Lex Compiler Execution: The lex.1 program is then executed using the Lex compiler. This step generates a C program named lex.yy.c
C Compiler Execution: The C compiler is then used to compile the generated lex.yy.c program. The result is an object program referred to as a.out
Lexical Analysis: The a.out object program is essentially a lexical analyzer. When this program is run, it takes an input stream of source code and transforms it into a sequence of tokens based on the rules defined in the original lex.1 program
LEX Source Program
A LEX source program is a collection of instructions and patterns. These instructions and patterns are written in the LEX programming language. LEX is a tool used for generating lexical analyzers. It tokenizes input source code into meaningful units called tokens.
The LEX source program defines how these tokens are recognized and processed. It consists of regular expressions that describe the patterns of tokens and corresponding actions to be taken when those patterns are encountered in the source code. This program serves as a set of rules that instructs LEX on how to break down a stream of characters from the input source code into tokens. These tokens can represent keywords, identifiers, operators, and other language constructs. There are two main components of this Lex source program:
Auxilary Definitions: These are often located in the "Definitions Section" of the LEX source program. These are user-defined macros or variables that simplify the expression of regular expressions and actions in the program
Translation Rules: These are commonly found in the "Rules Section" of the LEX source program. They establish the mapping between patterns and actions. Each translation rule consists of a regular expression pattern followed by the action to be executed when that pattern is matched in the input source code
Working of Lex
The working of lex in compiler design as a lexical analysis takes place in multiple steps. Firstly we create a file that describes the generation of the lex analyzer. This file is written in Lex language and has a .l extension. The lex compiler converts this program into a C file called lex.yy.c. The C compiler then runs this C file, and it is compiled into a.out file. This a.out file is our working Lexical Analyzer which will produce the stream of tokens based on the input text.
Lex File Format
Any lex file consists of the following three parts-
Definitions: This section of lex files contains declarations of constant, variable, and regular definitions.
Rules: This section of the lex files defines the rules in the form of the regular expressions and corresponding actions that the lexer should take to identify tokens in the input stream. Each rule consists of a regular expression followed by a code block that specifies the action to take when the regular expression matches. The rule is in the form of- p1 {action1} p2 {action2} ... pn {action}. Where Pi is the ith Regular expression and action is the action that must be taken when the regular expression matches.
User Subroutines: This section contains the user-defined functions that can be used by the action code block in the rules section of the lex file.
As we have discussed, we use the Regular expression to define the rules in Lex to see if the Input String matches the pattern. So regular expressions represent a finite pattern in the language containing a finite set of symbols. Regular expressions define a regular grammar, and regular language is defined by the regular grammar.
Conflicts in Lex
Conflicts arise in Lex if a given string matches two or more different rules in Lex. Or when an input string has a common prefix with two or more rules. This causes uncertainty for the lexical analyzer, so it needs to be resolved.
Conflict resolution
Conflicts in Lex can be resolved by following two rules-
The longer prefix should be preferred over a shorter prefix.
Pick the pattern listed first in the Lex program if the longest possible prefix corresponds to two or more patterns.
The Architecture of Lexical Analyzer
The task of the lexical analyzer is to read the input characters in the source code and produce tokens one by one. The scanner produces tokens when it is requested by the parser. The lexical analyzer also avoids any whitespace and comments while creating tokens. If any error occurs, the analyzer correlates these errors with the source file and the line number.
Lookahead Operator
An additional character is read by Lex to differentiate between other patterns of a token. This is done by Lex by reading an extra character ahead of the valid lexeme.
However, sometimes, we want a particular pattern to be matched to the input only when certain other characters follow it. Then, we may use a slash in a pattern to indicate the end of the part of the pattern that matches the lexeme.
For example:
The Lookahead operator is the addition operator read by Lex to distinguish different patterns for a token. A lexical analyzer reads one character ahead of a valid lexeme and then retracts to produce a token.
In some languages, keywords are not reserved. So the statements
IF (A, B) = 10 and IF(condition) THEN
Results in conflict about whether to produce IF as an array name or a keyword. To resolve this, the lex rule for the Keyword IF can be written as,
IF/\ (.* \) { letter }
An example implementation of a Lexical Analyser using Lex
The following Lex Program counts the number of words-
Code
%{
#include<stdio.h>
#include<string.h>
int count = 0;
%}
/* Rules Section*/
%%
([a-zA-Z0-9])* {count++;} /* Rule for counting number of words*/
"\n" {printf("Total Number of Words : %d\n", count); count = 0;}
%%
int yywrap(void){}
int main()
{
// The function that starts the analysis
yylex();
return 0;
}
How to execute
Type lex lexfile.l
Type gcc lex.yy.c.
Type ./a.exe
Output
Explanation
In the definition section of the program, we have included the standard library for input-output operations and string operations and a variable count to keep count of the number of words.
In the rules section, we have specified the regular expression ([a-zA-Z0-9])*, which matches any string formed by alphabets and numeric digits, such as “AYUSHI28”, “BOND007”, etc.
There is a rule for a newline too. When a newline character is encountered, the current count of words is printed, and the counter is set to zero.
Frequently Asked Questions
How to make a compiler with lex?
Creating a compiler with Lex involves a few key steps. First, we need to define regular expressions that describe the patterns of tokens in the source code. Then, we need to write a Lex specification file where you associate these regular expressions with corresponding actions.
What is lex in compiler design with example?
Lex is a lexical analyzer generator tool used in compiler design to generate lexical analyzers. It takes regular expressions as input and produces code for scanning input streams. For example, in lex, defining rules to recognize keywords in a programming language.
What is the function of lex?
Lex generates lexical analyzers, which are programs that break input into tokens or lexemes based on specified patterns. These lexemes serve as input for the parser. The function of lex is to automate the generation of lexical analyzers from regular expressions.
What is the purpose of the lex tool?
The purpose of the lex tool is to simplify the process of creating lexical analyzers for compilers and interpreters. It achieves this by allowing developers to specify token recognition patterns using regular expressions, which lex then translates into efficient, executable code for tokenizing input streams.
What are the three basic sections of the lex program?
The three basic sections of the lex program are:
Definition Section: Defines macros and regular expression patterns.
Rules Section: Specifies patterns and corresponding actions.
C Code Section: Contains user-defined C code to be executed alongside lex-generated code.
Conclusion
In this article, we have learned what lex in compiler design is and why it is important. Lex plays a fundamental role in compiler design by automating the creation of lexical analyzers. With its ability to generate efficient code from regular expressions, Lex simplifies the development process, allowing programmers to focus on higher-level aspects of compiler construction.