Table of contents
1.
Introduction
2.
Design Documents
2.1.
Internal design
2.2.
Technical overview
2.3.
Garbage collection
2.4.
Document Indexing
3.
Frequently Asked Questions
3.1.
What is RediSearch?
3.2.
What are the components of Design Documents in RediSearch?
3.3.
What does the Redis-cli command do?
4.
Conclusion
Last Updated: Mar 27, 2024

Design Documents in RediSearch

Author GAZAL ARORA
0 upvote

Introduction

RediSearch is a source available Redis module that adds query ability, secondary indexing, and full-text search to the database. These features enable multi-field queries, aggregation, exact phrase matching, and numeric filtering for text queries.

Design Documents

Design Documents details include:

Internal design

Details about decision making and implementation

Technical overview

Technical details about RediSearch's internal indexing and querying system.

Garbage collection

Information about garbage collection

Document Indexing

This page explains how to add documents to the index.

Let’s learn about them in detail.

Internal design

RediSearch provides inverted indexes on top of Redis. Still, unlike previous Redis inverted index implementations, it uses custom data encoding, allowing for more memory and CPU-efficient searches and more advanced search functionality.

Redis Modules Strings DMA, or Direct Memory Access, is the major functionality of this module.

This feature is simple yet quite effective. It allows modules to allocate data on Redis string keys and then acquire a direct pointer to that data without copying or serializing it.

This provides speedy access to large amounts of memory, and because the string value is exposed simply as char * from the module's perspective, it can be cast to any data structure.

Technical overview

Main features

  • Full-Text indexing of multiple fields inside a document, including:
    • Stemming in many languages.
    • Exact phrase matching.
    • Chinese tokenization support.
    • Optional, negative, and union queries.
    • Prefix queries.
       
  • Numeric property indexing.
  • Distributed search on billions of documents.
  • Incremental indexing without performance loss.
  • Geographical indexing and radius filters.
     
  • A structured query language for advanced queries:
    • Optional and negative queries
    • Unions and intersections
    • Tag filtering
    • Prefix matching
       
  • A powerful fuzzy matching auto-complete engine.
  • Multiple scoring models and value sorting
  • Document insertion and changes with little delay.
  • Long-running queries are possible with concurrent searches since Redis is not blocked.
  • Custom scoring models and query extensions are possible thanks to an extension mechanism.
  • Existing Hash objects can be indexed in Redis databases.

Garbage collection

A single term inverted index comprises an array of "blocks," each containing an encoded list of records - document id delta plus other information based on the index encoding scheme. Garbage originates when some of these entries refer to documents that have been erased.

The Need For GC

  • Deleted documents are not truly deleted. It marks the document as removed in the global document table to save time.
  • This means that a document no longer has an internal numeric id. We check for deletion as we traverse the index.
  • As a result, all inverted index entries for this document id are useless.
  • We don't want to delete them explicitly when removing a document because that will take a lengthy time, depending on the length of the document.
  • Furthermore, upgrading a document entails deleting it and then re-adding it with a new incremental internal id. We don't do any diffing and only append to the indexes. Thus the result is the same.

All of this means that if we have a lot of updates and deletes, a lot of our inverted index will become garbage, slowing things down and wasting memory.

As a result, we wish to improve the index. However, we do not want to disturb normal function. This indicates that optimization or garbage collection should be a non-intrusive background process. It only needs to be faster than the deletion rate for a long enough time to avoid producing more garbage than we can collect.

Document Indexing

The indexing process begins with creating a new RSAddDocumentCtx and the addition of a document to it. This is broken into several steps.

1. Submission

From input, a DocumentContext is created and associated with a document (as received). Preliminary caching will be performed during the submission process.

2. Preprocessing

A document is preprocessed once it has been submitted. All document input fields are processed statelessly during preprocessing. This includes tokenizing the document and constructing a forward index for text fields. Within the AddDocumentCtx, the preprocessors will save this information in per-field variables.  The computed result is written to the (permanent) index during the indexing step.

3. Indexing

The pre-computed results of the preprocessing phase are written down during indexing. It is performed in a single thread and takes the form of a queue.

4. Skipping already indexed documents

A document is marked as complete after it has been fully indexed. When the thread iterates over the queue, it will only process/index items that have not yet been marked as done.

5. Term Merging

When there are many documents in the queue, term merging, also known as forward index merging, is performed. Each document in the queue has its forward index scanned, and a larger,' master' forward index is created in its stead. Each forward index item includes a reference to the original document and the standard offset/score/frequency data.

6. Document ID assignment

The GIL is now locked, and each document in the queue has been allocated a document ID. The assignment is performed right before the index is written to decrease the number of times the GIL is locked. As a result, the GIL is only locked once before the index is written.

7. Writing to Indexes

Any unprocessed index data is written to the indexes when the GIL is locked. This often involves opening one or more Redis keys and writing/copying calculated data into them. The response for the specified document is then sent, and the AddDocumentCtx is freed.

Frequently Asked Questions

What is RediSearch?

RediSearch is a source available Redis module that adds query ability, secondary indexing, and full-text search to the database. 

What are the components of Design Documents in RediSearch?

Design Documents details include:

  • Internal design
  • Technical overview
  • Garbage collection
  • Document Indexing 

What does the Redis-cli command do?

Redis-cli is a terminal tool that allows you to send instructions to the Redis server and get responses.

Conclusion

In this article, we learned about Design Documents in RediSearch. We learned that Design Documents details include:

  • Internal designDetails about decision making and implementation
  • Technical overviewTechnical details about RediSearch's internal indexing and querying system.
  • Garbage collection - Information about garbage collection
  • Document Indexing - This page explains how to add documents to the index.
     

We then learned about them in detail.

Click here to know about RedisGraph or RedisCLI.

You can use Coding Ninjas Studio to practice various DSA questions asked in different interviews. It will assist you in mastering effective coding techniques, and you will also get interview experiences with people working in big companies.

Happy Learning!!

Live masterclass