Table of contents
1.
Introduction
2.
What is Lexical Analyzer in compiler design?
2.1.
Example
2.2.
Implementation in C
3.
What are tokens?
3.1.
Specifications of Tokens
3.1.1.
Alphabets
3.1.2.
Strings
3.1.3.
Special Symbols
3.2.
Examples of Tokens
4.
Regular Expression
5.
Operations
6.
Notations
7.
Precedence and Associativity
8.
Finite Automata
9.
Finite Automata Construction
10.
Longest Match Rule
11.
Architecture of Lexical Analyzer
12.
Roles and Responsibility of Lexical Analyzer
13.
Advantages of Lexical analysis
14.
Disadvantages of Lexical analysis
15.
Frequently Asked Questions
15.1.
Q. What is the main purpose of the lexical analysis?
15.2.
Q. What is the lexical analyzer also called?
15.3.
Q. What is the difference between lexical analyzer and parser?
15.4.
Q. Which characters are ignored during lexical analysis?
16.
Conclusion
Last Updated: Sep 3, 2024
Hard

Introduction to Lexical Analysis

Author Naman Kukreja
2 upvotes
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

The compiler is a program that converts high-language code to machine language. This is done by running many of the processes. Some are lexical analysis, syntax analysis, semantic analysis, etc. The lexical analyzer in compiler design plays a key role in breaking down the code into manageable components for further processing.

lexical analysis in compiler design

In this article, we will learn about the first phase during compilation, i.e., lexical analysis. 

What is Lexical Analyzer in compiler design?

The initial stage of the compiler is lexical analysis, which involves collecting transformed source code structured as sentences from the language preprocessor. The role of the lexical analyzer is to segment these sentences into a sequence of tokens by eliminating whitespace within the source code. Should the lexical analyzer encounter an invalid token, it prompts an error.

The lexical analyzer uses the lex program to perform the analysis.

Example

Calculate the total number of tokens in the below code.

Implementation in C

int main()
{
 int x = 0, y = 5; 
printf(“First number is %d and the second number %d”, x, y); 
return 0;
}

In the above example of lexical analysis, we can easily recognize that there are 27 tokens in the above code. Tokens in the code are ‘int’ ‘main’ ‘(‘  ‘)’ ‘{’ ‘int’ ‘x’ ‘=’ ‘0’ ‘,’ ‘y’ ‘=’ ‘5’ ‘;’ ‘printf’ ‘(‘  ‘“First number is %d and the second number %d ”’ ‘,’ ‘x’ ‘,’ ‘y’  ‘)’ ‘;’ ‘return’ ‘0’ ‘;’ ‘}’.

You may also like to read about - Lexeme in Compiler Design And Cross Compiler

What are tokens?

The sequence of characters in a token is known as lexemes. A lexeme must follow certain predefined rules to be considered a valid token. These rules are defined using a pattern. These patterns represent the token, and regular expressions define the patterns.

Keywords, numbers, punctuation, strings, and identifiers can be considered tokens in any programming language.

Specifications of Tokens

Tokens are specified in different sets. The language theory takes some sets as the specific token and others as different, we will discuss all of them here.

Alphabets

All the numbers and alphabets are considered hexadecimal alphabets by language. {a-z, A-Z,0,1,2,3….} 

Strings

The collection of different alphabets occurring continuously is known as a string. String length can be defined as the number of characters or alphabets occurring together.

For example length of |codingninja| is 11 as there are 11 characters.

Special Symbols

Most of the high-level languages contain some special symbols, as shown below:

NameSymbols
PunctuationComma(,), Semicolon(:)
Assignment =
Special Assignment+=, -=, *=, /=
Comparison==, ≠, <, >, ≤, ≥
Preprocessor#
Location Specifier&
Logical&&, |, ||, !
Shift Operator>>, <<, >>>, <<<

Examples of Tokens

Now, let’s understand tokens with the help of examples.

Consider a line as 

c = a*5 + b.

Then the lexemes tokens are as below:

LexemeToken
cIdentifier
=Assignment symbol
aIdentifier
*Multiplication symbol
5Number
+Addition symbol
bIdentifier

Now we will understand with proper code of C++.

#include <iostream>
    int maximum(int x, int y) {
        // This will compare two numbers
        if (y > x)
            return y;
        else {
            return x;
        }
    }

The corresponding tokens table is shown below:

LexemeToken
intKeyword
maximumIdentifier
(Operator
intKeyword
xIdentifier
,Operator
intKeyword
YIdentifier
)Operator
{Operator
IfKeyword

The corresponding non-token table is shown below:

TypeExamples
Comment//Example
Preprocessor directive#include<iostream>

Also read - Specification of Tokens in Compiler Design

Regular Expression

As discussed above, that lexical analyzer only needs to identify or scan some part or a finite set of valid tokens/strings belonging to the language. It searches the pattern by following the rules.

So the regular expressions define the finite pattern in the language containing a finite set of symbols. Regular grammar is the grammar defined by regular expressions, and regular language is the language defined by the regular grammar.

A regular language is used to define tokens. These languages have an effective implementation and are easy to understand.

Certain laws are followed by languages, such as

  • Union of two languages, A and B, can be written as
  • A U B = {s | s is in A or s is in B}
  • Concatenation of two languages, A and B, can be written as
  • AB = {st| s is in A and t is in B}

Operations

In compiler design, an operator is a symbol or a sequence of symbols that represents a specific operation or computation to be performed on one or more operands. Operators are a part of lexical analysis, which is the process of breaking down the source code into a sequence of tokens.

Operators can be classified into different categories, such as arithmetic operators (e.g., +, -, *, /), relational operators (e.g., <, >, <=, >=, ==, !=), logical operators (e.g., &&, ||, !), bitwise operators (e.g., &, |, ^), and assignment operators (e.g., =, +=, -=, *=, /=).

Notations

In lexical analysis, notations refer to the symbols or characters used to represent a particular token in the source code. Notations are used to describe the structure of the language and are often defined using regular expressions. For example, in the C programming language, the notation for an identifier is defined as a letter or underscore followed by any combination of letters, digits, and underscores.

Notations are used in lexical analysis to identify and classify tokens in the source code. The lexical analyzer scans the source code and matches the input sequence of characters with the corresponding notation to generate a token. The tokens generated by the lexical analyzer are then passed to the syntax analyzer for further processing.

Precedence and Associativity

In lexical analysis of the compiler design process, precedence and associativity refer to the rules used to determine the order in which operators are evaluated in an expression.

Precedence refers to the order in which operators are evaluated. Operators with higher precedence are evaluated before those with lower precedence. For example, in the expression 5 + 6 * 2, the multiplication operator has higher precedence than the addition operator, so 6 * 2 is evaluated first, resulting in 12. Then, the addition operator is evaluated, resulting in 17.

Finite Automata

Finite Automata is a machine that changes the state. It is also known as a state-changing machine. It takes the symbols string, converts or changes it accordingly, and returns it in output. 

A string changes its state each time it is in finite Automata. The mathematical model consists of 

  • The finite set of conditions.
  • One start state.
  • Transition function.
  • Set of the final state.
  • The finite set of input symbols.

Finite Automata Construction

Finite automata construction is a process in automata theory and computer science that involves building a mathematical model of a system that operates on a set of input symbols to generate a desired output. This model is called a finite automaton or a finite state machine.

The construction process involves several steps, starting with defining the input symbols, or the alphabet, that the system operates on. Next, the states of the system are defined, along with the transitions between the states based on the input symbols. The transitions are represented as a directed graph, where the nodes represent the states, and the edges represent the transitions between the states based on input symbols.

Once the transition graph is constructed, it is analyzed to determine if it satisfies the requirements of the system being modelled. For example, in a deterministic finite automaton, there can be only one transition from each state for each input symbol. If the transition graph does not satisfy the requirements of the system, it may need to be modified and the construction process repeated until the model accurately represents the system being modelled.

Longest Match Rule

A key idea in lexical analysis is the Longest Match Rule, which is applied to identify the longest possible token (a unit of language) in an input string. It says that the lexer should pick the longest matching substring as the token when several regular expressions or patterns match a portion of the input. The Longest Match Rule, for instance, ensures that "bookstore" is recognized as a single token in the text "bookstore" if there are patterns for both "book" and "bookstore," giving priority to longer matches to prevent tokenization ambiguities.

Architecture of Lexical Analyzer

A compiler's core element is the lexical analyzer. The three main components of its architecture are input buffering, scanning, and token output. Character by character, the input source code is read during the input buffering phase. The scanning phase uses regular expressions and rules to find and classify tokens. The Token Output phase creates a symbol table or tokenized output as a final product. This architecture guarantees that the source code is processed in a sequential manner, with each stage adding to the code's accurate tokenization.

Roles and Responsibility of Lexical Analyzer

The Lexical Analyzer in a compiler plays an essential role in the compilation process. Its responsibilities include:

  • Reading the source code and buffering it for analysis.
     
  • Tokenizing the code by identifying and categorizing tokens.
     
  • Eliminating whitespace and comments.
     
  • Generating meaningful error messages when encountering syntax errors.
     
  • Building a symbol table for later stages of compilation.
     
  • Providing a well-structured output for the parser.

Advantages of Lexical analysis

The advantages of Lexical analysis are the following:

  • Lexical analysis enables browsers to format and display a web page with the aid of parsed data.
     
  • It is in charge of producing an executable binary code.
     
  • It makes the task's processor more effective and specialised.

Disadvantages of Lexical analysis

The disadvantages of Lexical analysis are the following:

  • Lexical analyser contains some code which take significant amount of time to read the source code and splitting it into the tokens.
     
  • It requires additional development and testing for the lexer and its token descriptions.
     
  • It takes additional runtime to create the lexer tables and to construct the token.

Check this out : Boundary value analysis

Frequently Asked Questions

Q. What is the main purpose of the lexical analysis?

Lexical analysis is like the first step of a compiler. It takes the code you've written and turns it into smaller pieces called tokens. It also gets rid of any empty spaces in the code to make it easier to work with.

Q. What is the lexical analyzer also called?

The lexical analyzer, also known as a lexer or scanner, is like a word detective. It reads code and breaks it into small pieces called tokens, making it easier for the computer to understand.

Q. What is the difference between lexical analyzer and parser?

A lexical analyzer reads code to identify words like keywords and variables. A parser takes these tokens to understand their relationships, forming a structured representation, like a syntax tree.

Q. Which characters are ignored during lexical analysis?

The lexical analyzer ignores white spaces, comments, tab and fragments and divides the program into tokens. The main purpose of the lexical analyzer is to remove the extra spaces, read the program and change it to the tokens for further phases.

Conclusion

In this article, we have extensively discussed  Lexical Analysis in compiler design, tokens, lexeme, specifications, examples of tokens, regular expressions, finite automata, roles and responsibilities, architecture of lexical analyzer, errors, advantages, and disadvantages.
Check Out - Gate Questions on Lexical Analysis

Do check out The Interview guide for Product Based Companies as well as some of the Popular Interview Problems from Top Tech companies on Coding Ninjas Studio.

Also check out some of the Guided Paths on topics such as Data Structure and Algorithms, Competitive Programming, Operating Systems, etc.

Live masterclass