Table of contents
1.
Introduction
2.
How are Chinese Documents Supported in RediSerach?
3.
Using Chinese in RediSearch
4.
Frequently asked questions
4.1.
What is RediSearch?
4.2.
Which package is used by RediSearch to Support Chinese documents?
4.3.
How are Chinese documents different from other language documents?
5.
Conclusion
Last Updated: Mar 27, 2024
Medium

Chinese Support In RediSearch

Author Rajat Agrawal
0 upvote

Introduction

RediSearch is a Redis module that adds queryability, secondary indexing, and full-text search to the database.

For Redis, RediSearch offers secondary indexing, full-text search, and a query language. These features enable multi-field queries, aggregation, exact phrase matching, and numeric filtering for text inquiries.

Chinese document can be added in redisearch from version 0.99.0.

Let’s learn how the Chinese documents are supported in RediSearch.

How are Chinese Documents Supported in RediSerach?

Chinese support enables Chinese documents to be added and tokenized using segmentation rather than conventional whitespace and punctuation tokenization.

Because of the way tokens are extracted, indexing a Chinese document differs from indexing a document in most other languages. While separating characters and whitespace can be used to separate tokens in other languages, this is not the case with Chinese. The most likely match (depending on the surrounding terms and characters) is determined by scanning the input text and checking every character or sequence against a dictionary of predefined phrases.

For this, RediSearch takes use of the Friso Chinese tokenization package. This is mostly invisible to the user, and no further configuration is frequently necessary.

Using Chinese in RediSearch

Pseudo Code:

FT.CREATE idx SCHEMA txt TEXT
FT.ADD idx docCn 1.0 LANGUAGE chinese FIELDS txt "Redis支持主从同步。数据可以从主服务器向任意数量的从服务器上同步,从服务器可以是关联其他从服务器的主服务器。这使得Redis可执行单层树复制。从盘可以有意无意的对数据进行写操作。由于完全实现了发布/订阅机制,使得从数据库在任何地方同步树时,可订阅一个频道并接收主服务器完整的消息发布记录。同步对读取操作的可扩展性和数据冗余很有帮助。[8]"
FT.SEARCH idx "数据" LANGUAGE chinese HIGHLIGHT SUMMARIZE

# Outputs:
# <b>数据</b>?... <b>数据</b>进行写操作。由于完全实现了发布... <b>数据</b>冗余很有帮助。[8...

Using Python Client:

# -*- coding: utf-8 -*-

from redisearch.client import Client, Query
from redisearch import TextField

client = Client('idx')
try:
    client.drop_index()
except:
    pass

client.create_index([TextField('txt')])

# Add a document
client.add_document('docCn1',
                    txt='Redis支持主从同步。数据可以从主服务器向任意数量的从服务器上同步从服务器可以是关联其他从服务器的主服务器。这使得Redis可执行单层树复制。从盘可以有意无意的对数据进行写操作。由于完全实现了发布/订阅机制,使得从数据库在任何地方同步树时,可订阅一个频道并接收主服务器完整的消息发布记录。同步对读取操作的可扩展性和数据冗余很有帮助。[8]',
                    language='chinese')
print client.search(Query('数据').summarize().highlight().language('chinese')).docs[0].txt
You can also try this code with Online Python Compiler
Run Code


Output:

<b>数据</b>?... <b>数据</b>进行写操作。由于完全实现了发布... <b>数据</b>冗余很有帮助。[8... 

Frequently asked questions

What is RediSearch?

RediSearch is a Redis module that adds queryability, secondary indexing, and full-text search to the database. It offers secondary indexing, full-text search, and a query language.

Which package is used by RediSearch to Support Chinese documents?

RediSearch uses the Friso Chinese tokenization package. This is mostly invisible to the user, and no further configuration is frequently necessary.

How are Chinese documents different from other language documents?

Chinese support enables Chinese documents to be added and tokenized using segmentation rather than conventional whitespace and punctuation tokenization.

Because of the way tokens are extracted, indexing a Chinese document differs from indexing a document in most other languages.

Conclusion

In this article, we have extensively discussed how the Chinese document is supported in RediSearch. If you want to learn more, check out our blogs on different topics like Databases, MongoDBOperational Databases, and Non-Relational Databases.

Do upvote our blog to help other ninjas grow.

Happy Coding!

Live masterclass