Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
What are Tokens?
2.1.
Different Types of Tokens
2.1.1.
1. Keywords
2.1.2.
2. Identifier
2.1.3.
3. Operators
2.1.4.
4. Punctuations
3.
Specification of Token
4.
1. Strings
4.1.
Operations on String
4.1.1.
1. Prefix
4.1.2.
2. Suffix
4.1.3.
3. Substring
4.1.4.
4. Subsequence
4.1.5.
5. Concatenation
5.
2. Language
5.1.
Operations on Language
5.1.1.
1. Union
5.1.2.
2. Concatenation
5.1.3.
3. Kleene closure
5.1.4.
Positive Closure
6.
3. Regular Expression
6.1.
Writing Regular Expressions
7.
Frequently Asked Questions
7.1.
What is lexical analysis in compiler design?
7.2.
What is the difference between lexemes and tokens?
7.3.
How are regular sets different from non-regular sets?
8.
Conclusion
Last Updated: Mar 27, 2024

Specification of Tokens in Compiler Design

Author Ravi Khorwal
3 upvotes
Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

Hello Ninjas! I hope you must have heard of tokens. To recall, tokens are the smallest individual unit and the building block of any programming language. Our program's tokens are keywords, numbers, punctuation, strings, and identifiers. Don't worry if these terms seem new; we'll discuss them in great detail. 

 

 specification of tokens in compiler design

 

In this article, we will learn what the specification of tokens in compiler design is meant. We will also learn why the specification of tokens in compiler design is important for us and the benefits of the specification of tokens in compiler design. 

So without any delays, let's dive deep into the topic.

What are Tokens?

A token is the smallest individual element of a program that is meaningful to the compiler. It cannot be further broken down. Identifiers, strings, keywords, etc., can be the example of the token. In the lexical analysis phase of the compiler, the program is converted into a stream of tokens.

Different Types of Tokens

There can be multiple types of tokens. Some of them are-

1. Keywords

Keywords are words reserved for particular purposes and imply a special meaning to the compilers. The keywords must not be used for naming a variable, function, etc.

2. Identifier

The names given to various components in the program, like the function's name or variable's name, etc., are called identifiers. Keywords cannot be identifiers.

3. Operators

Operators are different symbols used to perform different operations in a programming language.

4. Punctuations

Punctuations are special symbols that separate different code elements in a programming language.

Consider the following line of code in C++ language -

int x = 45;

The above statement has multiple tokens, which are- 

Keywords: int

Identifier: x45

Operators: =

Punctuators: ;

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Specification of Token

In compiler design, there are three specifications of token-

  1. String
  2. Language
  3. Regular Expressions

1. Strings

Strings are a finite set of symbols or characters. These symbols can be a digit or an alphabet. There is also an empty string which is denoted by ε

Operations on String

The operations that can be performed on a string are-

1. Prefix

The prefix of String S is any string that is extracted by removing zero or more characters from the end of string S. For example, if the String is "NINJA", the prefix can be "NIN" which is obtained by removing "JA" from that String. A string is a prefix in itself.

Proper prefixes are special types of prefixes that are not equal to the String itself or equal to ε. We obtain it by removing at least one character from the end of the String.

2. Suffix

The suffix of string S is any string that is extracted by removing any number of characters from the beginning of string S. For example, if the String is "NINJA", the suffix can be "JA," which is obtained by removing "NIN" from that String. A string is a suffix of itself.

Proper suffixes are special types of suffixes that are not equal to the String itself or equal to ε. It is obtained by removing at least one character from the beginning of the String.

3. Substring

A substring of a string is any string obtained by removing any prefixes and suffixes of that String. For example, if the String is "AYUSHI," then the substring can be "US," which is formed by removing the prefix "AY" and suffix "HI." Every String is a substring of itself.

Proper substrings are special types that are not equal to the String itself or equal to ε. It is obtained by removing at least one prefix or suffix from the String.

4. Subsequence

The subsequence of the String is a string obtained by eliminating zero or more symbols from the String. The symbols that are removed need not be consecutive. For example, if the String is "NINJAMANIA," then a subsequence can be "NIJAANIA," which is produced by removing "N" and "M." 

Proper subsequences are special subsequences that are not equal to the String itself or equal to ε. It is obtained by removing at least one symbol from the String.

5. Concatenation

Concatenation is defined as the addition of two strings. For example, if we have two strings S=" Cod" and T=" ing," then the concatenation ST would be "Coding."  

2. Language

A language can be defined as a finite set of strings over some symbols or alphabets.

Operations on Language

The following operations are performed on a language in the lexical analysis phase-

1. Union

Union is one of the most common operations we perform on a set. In terms of languages also, it will hold a similar meaning.

Suppose there are two languages, L and S. Then the union of these two languages will be

 L ∪ S will be equal to { xx belongs to either L or S }

For example If L = {a, b} and S = {c, d}Then L ∪ S = {a, b, c, d}

2. Concatenation

Concatenation links two languages by linking the strings from one language to all the strings of the other language.

If there are two languages, L and S, then the concatenation of L and  will be LS equal to { ls | where l belongs to  L and s belongs to S }.

For example, there are two languages, L and S, such that { L`L"} is the set of strings belonging to language and { S,` S, "S`"} is the set of strings belonging to language S.

Then the concatenation of L and will be LS will be {L'S`, L'S ", L``S`L``S "}

3. Kleene closure

Kleene closure of a language is denoted by L*provides a set of all the strings that can be obtained by concatenating zero or more times.

If  L = {a, b}

then L*{ε, a, b, aa, bb, aaa, bbb, …}

Positive Closure

L+ denotes the Positive closure of a language L and provides a set of all the strings that can be obtained by concatenating one or more times.

If  L{a, b}

then L+  = {a, b, aa, bb, aaa, bbb, …}

3. Regular Expression

Regular expressions are strings of characters that define a searching pattern with the help of which we can form a language, and each regular expression represents a language.

A regular expressioncan denote a language L(r) which can be built recursively over the smaller regular expression by following some rules.

Writing Regular Expressions

Following symbols are used very frequently to write regular expressions 

  • The asterisk symbol ( * ): It is used in our regular expression to instruct the compiler that the symbol that preceded the * symbol can be repeated any number of times in the pattern. For example, if the expression is ab*c, then it gives the following string- ac, abc, abbc, abbbc, abbbbbc.. and so on.
  • The plus symbol ( + ): It is used in our regular expression to tell the compiler that the symbol that preceded + can be repeated one or more times in the pattern. For example, if the expression is ab+c, then it gives the following string- abc, abbc, abbbc, abbbbbc.. and so on.
  • Wildcard Character ( . ): The. A symbol, also known as the wildcard character, is a character in our regular expression that can be replaced by another character.
  • Character Class: It is a way of representing multiple characters. For example,  [a – z] denotes the regular expression a | b | c | d | ….|z.

The following rules are used to define a regular expression r over some alphabet Σ and the languages denoted by these regular expressions.

  • It ε is a regular expression that denotes a language L(ε). The language L(ε) has a set of strings {ε} which means that this language has a single string which is the empty String.
  • If there is a symbol 'a' in Σ, then 'a' is a regular expression that denotes a language L(a). The language L(a) = {a}, i.e., the language has only one String of length, and the String holds 'a' in the first position.
  • Consider the two regular expressions r and s then:

r|s denotes the language L(r) ∪ L(s).

(r) (s) denotes the language L(r) ⋅ L(s).

(r)* denotes the language (L(r))*.

(r)+ denotes the language L(r)

Frequently Asked Questions

What is lexical analysis in compiler design?

The compilation process takes place in multiple phases, and the first phase of the compiler is lexical analysis. It compiles changed source code from the language preprocessor written as sentences, removes whitespace and comments from the source code, and converts the input strings into a list of tokens.

What is the difference between lexemes and tokens?

A token is a sequence of characters treated as the fundamental unit of a programming language that is not further broken down. In contrast, lexemes are a sequence of characters matched with the help of predefined language rules for every lexeme to be specified as a valid token.

How are regular sets different from non-regular sets?

Regular and nonregular sets are two different sets of languages in compiler design. A language that can be defined using regular expression is called a regular set. In contrast, a language that cannot be determined with the help of regular grammar is called a nonregular set. The nonregular set is represented using context-free grammar.  

Conclusion

Congrats, Ninja!! You've learned what the specification of tokens in compiler design means and why is it necessary. We have also discussed what tokens are and what are different types of tokens. We have also added some FAQs related to the specification of tokens in compiler design. 

Recommended Reading-

Thanks for reading this article. I hope you would have found the blog insightful and understood the specifications of tokens in compiler design, and please upvote if you liked the article.

Also, check out some of the Guided Paths on topics such as Data Structure and AlgorithmsCompetitive ProgrammingOperating SystemsComputer Networks, DBMSand  System Design, etc. as well as some Contests, Test SeriesInterview Bundles, and some Interview Experiences curated by top Industry Experts only on Coding Ninjas Studio.

Happy Learning!!

 

Previous article
LL Parser in Compiler Design
Next article
Recursive Descent Parser
Live masterclass