Different Types of Tokens
There can be multiple types of tokens. Some of them are-
1. Keywords
Keywords are words reserved for particular purposes and imply a special meaning to the compilers. The keywords must not be used for naming a variable, function, etc.
2. Identifier
The names given to various components in the program, like the function's name or variable's name, etc., are called identifiers. Keywords cannot be identifiers.
3. Operators
Operators are different symbols used to perform different operations in a programming language.
4. Punctuations
Punctuations are special symbols that separate different code elements in a programming language.
Consider the following line of code in C++ language -
int x = 45;
The above statement has multiple tokens, which are-
Keywords: int
Identifier: x , 45
Operators: =
Punctuators: ;
Specification of Token
In compiler design, there are three specifications of token-
- String
- Language
- Regular Expressions
1. Strings
Strings are a finite set of symbols or characters. These symbols can be a digit or an alphabet. There is also an empty string which is denoted by ε.
Operations on String
The operations that can be performed on a string are-
1. Prefix
The prefix of String S is any string that is extracted by removing zero or more characters from the end of string S. For example, if the String is "NINJA", the prefix can be "NIN" which is obtained by removing "JA" from that String. A string is a prefix in itself.
Proper prefixes are special types of prefixes that are not equal to the String itself or equal to ε. We obtain it by removing at least one character from the end of the String.
2. Suffix
The suffix of string S is any string that is extracted by removing any number of characters from the beginning of string S. For example, if the String is "NINJA", the suffix can be "JA," which is obtained by removing "NIN" from that String. A string is a suffix of itself.
Proper suffixes are special types of suffixes that are not equal to the String itself or equal to ε. It is obtained by removing at least one character from the beginning of the String.
3. Substring
A substring of a string S is any string obtained by removing any prefixes and suffixes of that String. For example, if the String is "AYUSHI," then the substring can be "US," which is formed by removing the prefix "AY" and suffix "HI." Every String is a substring of itself.
Proper substrings are special types that are not equal to the String itself or equal to ε. It is obtained by removing at least one prefix or suffix from the String.
4. Subsequence
The subsequence of the String is a string obtained by eliminating zero or more symbols from the String. The symbols that are removed need not be consecutive. For example, if the String is "NINJAMANIA," then a subsequence can be "NIJAANIA," which is produced by removing "N" and "M."
Proper subsequences are special subsequences that are not equal to the String itself or equal to ε. It is obtained by removing at least one symbol from the String.
5. Concatenation
Concatenation is defined as the addition of two strings. For example, if we have two strings S=" Cod" and T=" ing," then the concatenation ST would be "Coding."
2. Language
A language can be defined as a finite set of strings over some symbols or alphabets.
Operations on Language
The following operations are performed on a language in the lexical analysis phase-
1. Union
Union is one of the most common operations we perform on a set. In terms of languages also, it will hold a similar meaning.
Suppose there are two languages, L and S. Then the union of these two languages will be
L ∪ S will be equal to { x | x belongs to either L or S }
For example If L = {a, b} and S = {c, d}Then L ∪ S = {a, b, c, d}
2. Concatenation
Concatenation links two languages by linking the strings from one language to all the strings of the other language.
If there are two languages, L and S, then the concatenation of L and S will be LS equal to { ls | where l belongs to L and s belongs to S }.
For example, there are two languages, L and S, such that { L`, L"} is the set of strings belonging to language L and { S,` S, "S`"} is the set of strings belonging to language S.
Then the concatenation of L and S will be LS will be {L'S`, L'S ", L``S`, L``S "}
3. Kleene closure
Kleene closure of a language L is denoted by L*provides a set of all the strings that can be obtained by concatenating L zero or more times.
If L = {a, b}
then L* = {ε, a, b, aa, bb, aaa, bbb, …}
Positive Closure
L+ denotes the Positive closure of a language L and provides a set of all the strings that can be obtained by concatenating L one or more times.
If L = {a, b}
then L+ = {a, b, aa, bb, aaa, bbb, …}
3. Regular Expression
Regular expressions are strings of characters that define a searching pattern with the help of which we can form a language, and each regular expression represents a language.
A regular expression r can denote a language L(r) which can be built recursively over the smaller regular expression by following some rules.
Writing Regular Expressions
Following symbols are used very frequently to write regular expressions
- The asterisk symbol ( * ): It is used in our regular expression to instruct the compiler that the symbol that preceded the * symbol can be repeated any number of times in the pattern. For example, if the expression is ab*c, then it gives the following string- ac, abc, abbc, abbbc, abbbbbc.. and so on.
- The plus symbol ( + ): It is used in our regular expression to tell the compiler that the symbol that preceded + can be repeated one or more times in the pattern. For example, if the expression is ab+c, then it gives the following string- abc, abbc, abbbc, abbbbbc.. and so on.
- Wildcard Character ( . ): The. A symbol, also known as the wildcard character, is a character in our regular expression that can be replaced by another character.
- Character Class: It is a way of representing multiple characters. For example, [a – z] denotes the regular expression a | b | c | d | ….|z.
The following rules are used to define a regular expression r over some alphabet Σ and the languages denoted by these regular expressions.
- It ε is a regular expression that denotes a language L(ε). The language L(ε) has a set of strings {ε} which means that this language has a single string which is the empty String.
- If there is a symbol 'a' in Σ, then 'a' is a regular expression that denotes a language L(a). The language L(a) = {a}, i.e., the language has only one String of length, and the String holds 'a' in the first position.
- Consider the two regular expressions r and s then:
r|s denotes the language L(r) ∪ L(s).
(r) (s) denotes the language L(r) ⋅ L(s).
(r)* denotes the language (L(r))*.
(r)+ denotes the language L(r)
Frequently Asked Questions
What is lexical analysis in compiler design?
The compilation process takes place in multiple phases, and the first phase of the compiler is lexical analysis. It compiles changed source code from the language preprocessor written as sentences, removes whitespace and comments from the source code, and converts the input strings into a list of tokens.
What is the difference between lexemes and tokens?
A token is a sequence of characters treated as the fundamental unit of a programming language that is not further broken down. In contrast, lexemes are a sequence of characters matched with the help of predefined language rules for every lexeme to be specified as a valid token.
How are regular sets different from non-regular sets?
Regular and nonregular sets are two different sets of languages in compiler design. A language that can be defined using regular expression is called a regular set. In contrast, a language that cannot be determined with the help of regular grammar is called a nonregular set. The nonregular set is represented using context-free grammar.
Conclusion
Congrats, Ninja!! You've learned what the specification of tokens in compiler design means and why is it necessary. We have also discussed what tokens are and what are different types of tokens. We have also added some FAQs related to the specification of tokens in compiler design.
Recommended Reading-