Introduction
RediSearch is a source available Redis module that adds query ability, secondary indexing, and full-text search to the database. These features enable multi-field queries, aggregation, exact phrase matching, and numeric filtering for text queries.
Design Documents
Design Documents details include:
Details about decision making and implementation
Technical details about RediSearch's internal indexing and querying system.
Information about garbage collection
This page explains how to add documents to the index.
Let’s learn about them in detail.
Internal design
RediSearch provides inverted indexes on top of Redis. Still, unlike previous Redis inverted index implementations, it uses custom data encoding, allowing for more memory and CPU-efficient searches and more advanced search functionality.
Redis Modules Strings DMA, or Direct Memory Access, is the major functionality of this module.
This feature is simple yet quite effective. It allows modules to allocate data on Redis string keys and then acquire a direct pointer to that data without copying or serializing it.
This provides speedy access to large amounts of memory, and because the string value is exposed simply as char * from the module's perspective, it can be cast to any data structure.
Technical overview
Main features
-
Full-Text indexing of multiple fields inside a document, including:
- Stemming in many languages.
- Exact phrase matching.
- Chinese tokenization support.
- Optional, negative, and union queries.
-
Prefix queries.
- Numeric property indexing.
- Distributed search on billions of documents.
- Incremental indexing without performance loss.
-
Geographical indexing and radius filters.
-
A structured query language for advanced queries:
- Optional and negative queries
- Unions and intersections
- Tag filtering
-
Prefix matching
- A powerful fuzzy matching auto-complete engine.
- Multiple scoring models and value sorting
- Document insertion and changes with little delay.
- Long-running queries are possible with concurrent searches since Redis is not blocked.
- Custom scoring models and query extensions are possible thanks to an extension mechanism.
- Existing Hash objects can be indexed in Redis databases.
Garbage collection
A single term inverted index comprises an array of "blocks," each containing an encoded list of records - document id delta plus other information based on the index encoding scheme. Garbage originates when some of these entries refer to documents that have been erased.
The Need For GC
- Deleted documents are not truly deleted. It marks the document as removed in the global document table to save time.
- This means that a document no longer has an internal numeric id. We check for deletion as we traverse the index.
- As a result, all inverted index entries for this document id are useless.
- We don't want to delete them explicitly when removing a document because that will take a lengthy time, depending on the length of the document.
- Furthermore, upgrading a document entails deleting it and then re-adding it with a new incremental internal id. We don't do any diffing and only append to the indexes. Thus the result is the same.
All of this means that if we have a lot of updates and deletes, a lot of our inverted index will become garbage, slowing things down and wasting memory.
As a result, we wish to improve the index. However, we do not want to disturb normal function. This indicates that optimization or garbage collection should be a non-intrusive background process. It only needs to be faster than the deletion rate for a long enough time to avoid producing more garbage than we can collect.
Document Indexing
The indexing process begins with creating a new RSAddDocumentCtx and the addition of a document to it. This is broken into several steps.
1. Submission
From input, a DocumentContext is created and associated with a document (as received). Preliminary caching will be performed during the submission process.
2. Preprocessing
A document is preprocessed once it has been submitted. All document input fields are processed statelessly during preprocessing. This includes tokenizing the document and constructing a forward index for text fields. Within the AddDocumentCtx, the preprocessors will save this information in per-field variables. The computed result is written to the (permanent) index during the indexing step.
3. Indexing
The pre-computed results of the preprocessing phase are written down during indexing. It is performed in a single thread and takes the form of a queue.
4. Skipping already indexed documents
A document is marked as complete after it has been fully indexed. When the thread iterates over the queue, it will only process/index items that have not yet been marked as done.
5. Term Merging
When there are many documents in the queue, term merging, also known as forward index merging, is performed. Each document in the queue has its forward index scanned, and a larger,' master' forward index is created in its stead. Each forward index item includes a reference to the original document and the standard offset/score/frequency data.
6. Document ID assignment
The GIL is now locked, and each document in the queue has been allocated a document ID. The assignment is performed right before the index is written to decrease the number of times the GIL is locked. As a result, the GIL is only locked once before the index is written.
7. Writing to Indexes
Any unprocessed index data is written to the indexes when the GIL is locked. This often involves opening one or more Redis keys and writing/copying calculated data into them. The response for the specified document is then sent, and the AddDocumentCtx is freed.






