Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Last Updated: Mar 27, 2024

MySQL ngram Full-Text Parser

Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Speaker
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM

Introduction

The built-in MySQL full-text parser uses white space to determine the beginning and end of words. The full-text parser has a limitation in ideographic languages such as Chinese, Japanese, and Korean because these ideographic languages do not use word delimiters.

MySQL provided the ngram full-text parser to address this issue. Since MySQL version 5.7.6, the ngram full-text parser has been included as a built-in server plugin, which means that MySQL automatically loads this plugin when the MySQL database server starts. MySQL's InnoDB and MyISAM storage engines both support the ngram full-text parser.

An ngram is defined as a contiguous sequence of several characters from a text sequence. The primary function of the ngram full-text parser is to tokenize a text sequence into a contiguous sequence of n characters.

The following example show how the ngram full-text parser tokenizes a sequence of text for different values of n:

n = 1: 'm','y','s','q','l'

n = 2: 'my', 'ys', 'sq','ql' 

n = 3: 'mys', 'ysq', 'sql'

n = 4: 'mysq', 'ysql'

n = 5: 'mysql'

Configuring ngram Token Size

By default, the token size for ngram is 2. To change the token size, use the ngram_token_sizeconfiguration option. The range of values ​​is 1to 10.

Note that smaller token sizes allow faster searches with smaller full-text search indexes.

Because ngram_token_sizeis a read-only variable, you can only set its value using two options:

The first way, in the startup string:

mysqld --ngram_token_size=1

Second way - in config file:

[mysqld]

ngram_token_size=1

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

How to create FULLTEXT Index using the ngram Parser

When using CREATE TABLE, ALTER TABLE, or CREATE INDEX, specify "WITH PARSER ngram” to create a FULLTEXT index that uses the ngram parser.

For example, the below statement creates a new posts table and adds title and body columns to the index using ngram full-text parser. 

USE testdb;

CREATE TABLE posts (

     id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,

    title VARCHAR(255),

    body TEXT,

    FULLTEXT ( title, body ) WITH PARSER NGRAM

)  ENGINE=INNODB CHARACTER SET UTF8mb4;

The following INSERT statement posts insert a new row into the table:

SET NAMES utf8;

INSERT INTO posts(title, body)

VALUES('MySQL全文搜索', 'MySQL提供了具有许多好的功能的内置全文搜索'),

      ('MySQL教程', '学习MySQL快速,简单和有趣');

Note that the SET NAMES statement sets the character set that the client and server will use to send and receive data; in this example, it uses utf8.

To see ngram how the text is marked up, use the following statement:

SET GLOBAL innodb_ft_aux_table="testdb/posts";

SELECT 

    *

FROM

    information_schema.innodb_ft_index_cache

ORDER BY doc_id , position;

ngram Parser Space Handling

The ngram parser eliminates spaces when parsing. For example:

“ab cd” is parsed to “ab”, “cd”

“a bc” is parsed to “bc”

ngram Parser Phrase Search

Searches for phrases are converted to ngram phrase searches. For instance, the search phrase "abc" is converted to "ab bc," which returns documents that contain both "abc" and "ab bc."

The search phrase "abc def" is converted to "ab bc de ef," which returns documents that include both "abc def" and "ab bc de ef." A document containing the string "abcdef" is not returned.

The following example shows posts searching for phrases in a table: 搜索:

SELECT 

    id, title, body

FROM

    posts

WHERE

MATCH (title , body) AGAINST ('搜索' );

Execute the above query statement and get the following results −

+----+-----------------------+------------------------------------------------------------------+

| id   | title                      | body                                                                       |

+----+-----------------------+------------------------------------------------------------------+

|  1  | MySQL全文搜索  | MySQL提供了具有许多好的功能的内置全文搜索  |

+----+-----------------------+------------------------------------------------------------------+

1 row in set

Processing search results with ngram

Natural Language Patterns

In natural language pattern search, search terms are transformed into unions of ngram values. Assuming the token size is 2or binary, the search terms "mysql" are converted to my ys sq and ql.

SELECT 

    *

FROM

    posts

WHERE

    MATCH (title, body) AGAINST ('简单和有趣' IN natural language MODE);

Execute the above query statement and get the following results −

+----+----------------+-----------------------------------------+

| id   | title             | body                                          |

+----+----------------+-----------------------------------------+

|  2  | MySQL教程 | 学习MySQL快速,简单和有趣 |

+----+----------------+-----------------------------------------+

1 row in set

Boolean Mode

In a BOOLEAN MODEsearch, search terms are converted into ngram phrase searches.

For Example:

SELECT 

    *

FROM

    posts

WHERE

    MATCH (title, body) AGAINST ('简单和有趣' IN BOOLEAN MODE);

Execute the above query statement and get the following results −

+----+----------------+-----------------------------------------+

| id   | title             | body                                          |

+----+----------------+-----------------------------------------+

|  2  | MySQL教程 | 学习MySQL快速,简单和有趣 |

+----+----------------+-----------------------------------------+

1 row in set

Wildcard Search

The ngram FULLTEXTindex only contains ngram, so it doesn't know the start of the phrase. When performing a wildcard search, unexpected results may be returned.

The following rules will apply to FULLTEXTwildcard searches using the ngram search index:

If the wildcard prefix phrase is shorter than the ngram token size, the query returns all documents containing a token beginning with the prefix term ngram. For example:

SELECT 

    id,title,body

FROM

    posts

WHERE

    MATCH (title , body) AGAINST ('my*' );

Execute the above query statement and get the following results −

+----+----------------------+-----------------------------------------------------------------+

| id   | title                     | body                                                                      |

+----+----------------------+-----------------------------------------------------------------+

|  1  | MySQL全文搜索 | MySQL提供了具有许多好的功能的内置全文搜索 |

|  2  | MySQL教程        | 学习MySQL快速,简单和有趣                              |

+----+----------------------+-----------------------------------------------------------------+

2 rows in set

If the prefix phrase in the wildcard is longer than the ngram token size, MySQL converts the prefix term to an ngram phrase and ignores the wildcard operator.

FAQs

  1. What is an ngram?
    In the full-text index, an n-gram is a sequence of n consecutive words in a text.
     
  2. What is the default token size for an ngram, and how to change it?
    The default token size for an ngram is 2. To change the token size, use the ngram_token_size configuration option. The range of values ​​is 1to 10.
     
  3. How does ngram handles space?
    The ngram parser eliminates spaces when parsing. For example:
    “ab cd” is parsed to “ab”, “cd”
    “a bc” is parsed to “bc”

Key Takeaways

In this blog, we have learned about MySQL ngram Full-Text Parser and have seen how to create a full-text ngram Parser, configure token size in it, handle space, phrase search, and process the results in natural mode, boolean mode, and wildcard search.

Recommended Readings:

Also, try Coding Ninjas Studio to practice programming problems for your complete interview preparation. Don't stop here, Ninja; check out the Top 100 SQL Problems to get hands-on experience with frequently asked interview questions and land your dream job.

Topics covered
1.
Introduction
2.
Configuring ngram Token Size
3.
How to create FULLTEXT Index using the ngram Parser
4.
ngram Parser Space Handling
5.
ngram Parser Phrase Search
6.
Processing search results with ngram
6.1.
Natural Language Patterns
6.2.
Boolean Mode
6.3.
Wildcard Search
7.
FAQs
8.
Key Takeaways