Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
The compiler is a program that converts high-language code to machine language. This is done by running many of the processes. Some are lexical analysis, syntax analysis, semantic analysis, etc. The lexical analyzer in compiler design plays a key role in breaking down the code into manageable components for further processing.
In this article, we will learn about the first phase during compilation, i.e., lexical analysis.
What is Lexical Analyzer in compiler design?
The initial stage of the compiler is lexical analysis, which involves collecting transformed source code structured as sentences from the language preprocessor. The role of the lexical analyzer is to segment these sentences into a sequence of tokens by eliminating whitespace within the source code. Should the lexical analyzer encounter an invalid token, it prompts an error.
The lexical analyzer uses the lex program to perform the analysis.
Example
Calculate the total number of tokens in the below code.
Implementation in C
int main()
{
int x = 0, y = 5;
printf(“First number is %d and the second number %d”, x, y);
return 0;
}
In the above example of lexical analysis, we can easily recognize that there are 27 tokens in the above code. Tokens in the code are ‘int’ ‘main’ ‘(‘ ‘)’ ‘{’ ‘int’ ‘x’ ‘=’ ‘0’ ‘,’ ‘y’ ‘=’ ‘5’ ‘;’ ‘printf’ ‘(‘ ‘“First number is %d and the second number %d ”’ ‘,’ ‘x’ ‘,’ ‘y’ ‘)’ ‘;’ ‘return’ ‘0’ ‘;’ ‘}’.
The sequence of characters in a token is known as lexemes. A lexeme must follow certain predefined rules to be considered a valid token. These rules are defined using a pattern. These patterns represent the token, and regular expressions define the patterns.
Keywords, numbers, punctuation, strings, and identifiers can be considered tokens in any programming language.
Specifications of Tokens
Tokens are specified in different sets. The language theory takes some sets as the specific token and others as different, we will discuss all of them here.
Alphabets
All the numbers and alphabets are considered hexadecimal alphabets by language. {a-z, A-Z,0,1,2,3….}
Strings
The collection of different alphabets occurring continuously is known as a string. String length can be defined as the number of characters or alphabets occurring together.
For example length of |codingninja| is 11 as there are 11 characters.
Special Symbols
Most of the high-level languages contain some special symbols, as shown below:
Name
Symbols
Punctuation
Comma(,), Semicolon(:)
Assignment
=
Special Assignment
+=, -=, *=, /=
Comparison
==, ≠, <, >, ≤, ≥
Preprocessor
#
Location Specifier
&
Logical
&&, |, ||, !
Shift Operator
>>, <<, >>>, <<<
Examples of Tokens
Now, let’s understand tokens with the help of examples.
Consider a line as
c = a*5 + b.
Then the lexemes tokens are as below:
Lexeme
Token
c
Identifier
=
Assignment symbol
a
Identifier
*
Multiplication symbol
5
Number
+
Addition symbol
b
Identifier
Now we will understand with proper code of C++.
#include <iostream>
int maximum(int x, int y) {
// This will compare two numbers
if (y > x)
return y;
else {
return x;
}
}
As discussed above, that lexical analyzer only needs to identify or scan some part or a finite set of valid tokens/strings belonging to the language. It searches the pattern by following the rules.
So the regular expressions define the finite pattern in the language containing a finite set of symbols. Regular grammar is the grammar defined by regular expressions, and regular language is the language defined by the regular grammar.
A regular language is used to define tokens. These languages have an effective implementation and are easy to understand.
Certain laws are followed by languages, such as
Union of two languages, A and B, can be written as
A U B = {s | s is in A or s is in B}
Concatenation of two languages, A and B, can be written as
AB = {st| s is in A and t is in B}
Operations
In compiler design, an operator is a symbol or a sequence of symbols that represents a specific operation or computation to be performed on one or more operands. Operators are a part of lexical analysis, which is the process of breaking down the source code into a sequence of tokens.
Operators can be classified into different categories, such as arithmetic operators (e.g., +, -, *, /), relational operators (e.g., <, >, <=, >=, ==, !=), logical operators (e.g., &&, ||, !), bitwise operators (e.g., &, |, ^), and assignment operators (e.g., =, +=, -=, *=, /=).
Notations
In lexical analysis, notations refer to the symbols or characters used to represent a particular token in the source code. Notations are used to describe the structure of the language and are often defined using regular expressions. For example, in the C programming language, the notation for an identifier is defined as a letter or underscore followed by any combination of letters, digits, and underscores.
Notations are used in lexical analysis to identify and classify tokens in the source code. The lexical analyzer scans the source code and matches the input sequence of characters with the corresponding notation to generate a token. The tokens generated by the lexical analyzer are then passed to the syntax analyzer for further processing.
Precedence and Associativity
In lexical analysis of the compiler design process, precedence and associativity refer to the rules used to determine the order in which operators are evaluated in an expression.
Precedence refers to the order in which operators are evaluated. Operators with higher precedence are evaluated before those with lower precedence. For example, in the expression 5 + 6 * 2, the multiplication operator has higher precedence than the addition operator, so 6 * 2 is evaluated first, resulting in 12. Then, the addition operator is evaluated, resulting in 17.
Finite Automata
Finite Automata is a machine that changes the state. It is also known as a state-changing machine. It takes the symbols string, converts or changes it accordingly, and returns it in output.
A string changes its state each time it is in finite Automata. The mathematical model consists of
The finite set of conditions.
One start state.
Transition function.
Set of the final state.
The finite set of input symbols.
Finite Automata Construction
Finite automata construction is a process in automata theory and computer science that involves building a mathematical model of a system that operates on a set of input symbols to generate a desired output. This model is called a finite automaton or a finite state machine.
The construction process involves several steps, starting with defining the input symbols, or the alphabet, that the system operates on. Next, the states of the system are defined, along with the transitions between the states based on the input symbols. The transitions are represented as a directed graph, where the nodes represent the states, and the edges represent the transitions between the states based on input symbols.
Once the transition graph is constructed, it is analyzed to determine if it satisfies the requirements of the system being modelled. For example, in a deterministic finite automaton, there can be only one transition from each state for each input symbol. If the transition graph does not satisfy the requirements of the system, it may need to be modified and the construction process repeated until the model accurately represents the system being modelled.
Longest Match Rule
A key idea in lexical analysis is the Longest Match Rule, which is applied to identify the longest possible token (a unit of language) in an input string. It says that the lexer should pick the longest matching substring as the token when several regular expressions or patterns match a portion of the input. The Longest Match Rule, for instance, ensures that "bookstore" is recognized as a single token in the text "bookstore" if there are patterns for both "book" and "bookstore," giving priority to longer matches to prevent tokenization ambiguities.
Architecture of Lexical Analyzer
A compiler's core element is the lexical analyzer. The three main components of its architecture are input buffering, scanning, and token output. Character by character, the input source code is read during the input buffering phase. The scanning phase uses regular expressions and rules to find and classify tokens. The Token Output phase creates a symbol table or tokenized output as a final product. This architecture guarantees that the source code is processed in a sequential manner, with each stage adding to the code's accurate tokenization.
Roles and Responsibility of Lexical Analyzer
The Lexical Analyzer in a compiler plays an essential role in the compilation process. Its responsibilities include:
Reading the source code and buffering it for analysis.
Tokenizing the code by identifying and categorizing tokens.
Eliminating whitespace and comments.
Generating meaningful error messages when encountering syntax errors.
Building a symbol table for later stages of compilation.
Providing a well-structured output for the parser.
Advantages of Lexical analysis
The advantages of Lexical analysis are the following:
Lexical analysis enables browsers to format and display a web page with the aid of parsed data.
It is in charge of producing an executable binary code.
It makes the task's processor more effective and specialised.
Disadvantages of Lexical analysis
The disadvantages of Lexical analysis are the following:
Lexical analyser contains some code which take significant amount of time to read the source code and splitting it into the tokens.
It requires additional development and testing for the lexer and its token descriptions.
It takes additional runtime to create the lexer tables and to construct the token.
Q. What is the main purpose of the lexical analysis?
Lexical analysis is like the first step of a compiler. It takes the code you've written and turns it into smaller pieces called tokens. It also gets rid of any empty spaces in the code to make it easier to work with.
Q. What is the lexical analyzer also called?
The lexical analyzer, also known as a lexer or scanner, is like a word detective. It reads code and breaks it into small pieces called tokens, making it easier for the computer to understand.
Q. What is the difference between lexical analyzer and parser?
A lexical analyzer reads code to identify words like keywords and variables. A parser takes these tokens to understand their relationships, forming a structured representation, like a syntax tree.
Q. Which characters are ignored during lexical analysis?
The lexical analyzer ignores white spaces, comments, tab and fragments and divides the program into tokens. The main purpose of the lexical analyzer is to remove the extra spaces, read the program and change it to the tokens for further phases.
Conclusion
In this article, we have extensively discussed Lexical Analysis in compiler design, tokens, lexeme, specifications, examples of tokens, regular expressions, finite automata, roles and responsibilities, architecture of lexical analyzer, errors, advantages, and disadvantages. Check Out - Gate Questions on Lexical Analysis