Zero downtime re-indexing with Elasticsearch.
Information retrieval (IR) is the area of computer science that focuses on finding information in a document or a collection of documents in an efficient manner. The basic principle behind IR is to build an “Inverted Index” and provide the means to read from or write to it. A simplistic analogy of an inverted index would be an index in a book, it quickly tells the page number that any word occurs in. Applying this concept to IR, think of pages as the documents and search terms as the individual words you are searching for in these docs. So instead of scanning the entire book page after page for a word, you consult the index to find which page(s) particular terms occur in, which is much faster. For e.g. consider three documents which have on line each, the inverted index for it may look like something below.
Simplified view of an Inverted index
As you add more documents, this index is automatically updated, allowing us to quickly lookup documents — this is how IR works at a very high level. ElasticSearch and SOLR are top-notch open-source IR systems that are commonly used in the industry. Both of these projects are built on top of Apache Lucene, which is considered to be the gold standard in indexing and search libraries.
Historically, information retrieval has revolved around the idea of a static set of input documents that are indexed into an information retrieval system periodically. When a new set of information was available; the delta of docs or in some cases the entire corpus (fancy term for a collection of documents) would be re-indexed.
There are scenarios where all documents in the index need to be updated; this can be because of a data format change e.g. a new field being added or a data fix — e.g. I want to search against the title, description and abstract at the same time — a new requirement.
Re-Indexing typically means creating a new version of the inverted index with a set of updated data. If you are considering reindexing without having to take your cluster offline, read more to learn how we did this at Intuit.
A bit of Background
The Intuit Persistence Service (IPS) is one of Intuit’s foundational data services that is used by multiple flagship products offered by Inuit. IPS primarily uses a NoSQL data store and is a system tuned for long-term persistence and access via well-defined APIs. For systems that treat persistence as the primary responsibility, the data access patterns are limited to predefined CRUD operations. Specifically, read operations have to be performed using an Id or some other indexed property.
A frequent requirement from IPS customers is the ability to do a full-text search on the data they store in IPS. For efficient free-text searches on the data, we need to know (beforehand) which fields we want to search on and maintain secondary indexes on fields. Databases also come with limitations on the number of indices we can have on the fields. Too many indices can cause performance degradation during insert/update operations.
These constraints led us to consider augmenting IPS’ with full-text search. The Search Platform was born out of this simple idea — IPS customers must be able to exercise full-text search on the data they store in IPS. Currently we offer search as a standalone service across Intuit for both transactional and analytical use cases.
Search with streaming updates
Initial versions of IR systems followed the model of index once and read many times before reindexing some time in the future.
However, as data streams became widely adopted, this model became less attractive. To address this, Lucene began supporting real-time updates to documents from version 2.9. This enabled IR systems built on top of Lucene to operate with near real-time capabilities.
Re-Indexing
While streaming individual updates from the source are great to provide near real-time search, there are scenarios where the entire document collection needs to be updated by republishing all the data. This is done via a full re-index. Re-indexing needs careful consideration while supporting streaming updates. Here, we are performing a full rewrite of all documents while keeping the system operational — a useful analogy to visualize the complexity involved, is to try and visualize replacing the tires of a car in motion.
The re-indexing problem gets a bit complex because the IR system is expected to be serving queries, taking updates, and deletes while rebuilding the index. The official Elasticsearch documentation on rebuilding an index is primarily based around the premise of data in a static index that you copy over from the old index to the new index while applying some transformations. The documentation is a bit ambiguous about updates that happen to the documents that were already sent over. In a nutshell, what the ElasticSearch documentation says is,
1. Add an alias that points to your old index
2. Use an alias instead of the physical index name.
3. Setup the new index with the updated mapping
4. Reindex your data from the old indices to the new indices
5. Switch the alias to point from the old index to the new index
6. Drop your old index
This works perfectly in the scenario where the index is static. But what about use cases where the updates are streamed and the lag should be minimal ?
When designing a search system that has to operate with a high amount of availability and reliability, we laid down certain tenets.
* Client’s search capability should never be impacted.
* Every event must be a fully hydrated event.
* Event streams can be unbounded and events may come out of order
* Make the system idempotent, use external versioning.
* Carry the events in a stream, use an event driven system
* Use aliases to atomically switch between versions or indices.
* Do not permit backward-incompatible changes to the event format.
* Control the indexing and read mechanisms independently.
In practical terms, the translation of the above principles to design can be viewed as
1. Use a streaming mechanism like Apache Kafka to stream and buffer the events to the indexer. That way, the source, and sink are decoupled.
2. Use well-defined events that have the following three key pieces of information
* A Unique Identifier that remains constant for the life of the document.
* The document creation timestamp (GMT, milliseconds precision) is used to route the document to the right index — use a time-window-based index lifecycle management strategy.
* The GMT timestamp (milliseconds precision) when the document was modified at the source.
3. Have an indexer that can read from a streaming data source, and can apply transformations to the data on the fly.(Transformations come in handy for data fixes while re-indexing)
4. Only allow backward-compatible changes to this schema. E.g., changing a numeric field to a boolean field or integer to Object would not be permitted, If it’s needed, leave the old field as is and introduce a new field.
5. Updates are always full documents, no delta-only updates.
There are a couple of approaches we take to perform re-indexing on streaming data that you can choose from based on your latency / staleness tolerance levels.
a) Re-indexing with blocked-off updates.
b) Re-indexing with streaming updates.
Let’s examine them case by case.
ReIndexing with blocked off updates
In this scenario, we create a new index or cluster that will replace the current one. Here, re-indexing is performed with updates blocked off till the new cluster is ready. The process can be broken into three stages as described below.
Disable all updates to your cluster by shutting down your indexer. Start a reindex from your existing index(blue) into the new one that is being built (green). Queries are still served from the old cluster, but updates are not visible. No data is being lost because they are still queued in the Event Buffer.
When the re-indexing is complete, Flip the reads and writes to the new cluster. Turn up your indexer to start applying the buffered updates to the new cluster and wait for it to catch up.
Eventually, when the buffered updates are processed and you are processing events without lags, do your smoke tests and decommission/ archive your old cluster. Queries are now served from the new cluster.
Pros
Simple to implement
Useful in scenarios when you can afford lags in getting the latest data.
Cons
There can be a considerable lag in seeing the latest data.
ReIndexing with streaming updates
Now, let’s consider the case where we have to keep applying the mutations (creates, updates, and deletes) to the cluster in real-time while the re-index is in progress. There are two variants of this scenario
1) Mapping remains the same, but a data fix has to be applied.
2) There is a mapping change (backward compatible) that needs a re-indexing.
To keep things simple let’s consider a cluster with a single index `user-v1`. Following the same model as above, we create a new index `user-v2`, and retain the old index `user-v1`. The data will be moved from `user-v1` to `user-v2`. If we were to do the re-indexing in this setup, the following questions pop up immediately.
1. Will the creates go to the index user-v1 or user-v2?
2. Will the updates and deletes go to the index user-v1 or user-v2?
3. How do we make sure that the queries present a consistent view while the re-indexing is happening?
4. A snapshot is literally a view of the ElasticSearch index at a point in time — What happens to updates/deletes to documents happening while running the re-indexing ?
5. If we let the updates and deletes go to the new index, what is the guarantee that the documents exist in the new index? Even if we use upserts, we could lose deletes.
6. If we do a dual write to both indices, and treat all creates and updates as upserts, how do we guarantee that the client does not see duplicate results?
To address these, we create a new index and a new buffer(kafka topic). We start buffering all data that is coming into the indexer into this new buffer from a point slightly ahead of the time before you start the reindex process. The reindexing process is then kicked off to move data from your existing index(blue) into the new one (green) that is being built. Queries are still served from the old cluster, updates are visible, deletes are honored.
When the re-indexing is complete, the new index is built — but has a delta of updates to be applied. There is a possibility that updates/deletes may have been applied to data that was scrolled past during reindexing. To reconcile the state between the old and new indices, we replay the buffer with necessary transformations onto the new index and wait for it to catch up. Queries are still served from the old cluster, updates are visible, deletes are honored. But writes are happening to both clusters.
After the delta is processed and you are processing events without lags on the new index, (your old cluster and the new cluster should be getting the same operations). At this point you are ready to cut-off writes to your old index. Do your smoke tests, flip the reads and writes to the new cluster(aliases come in handy here), and decommission/ archive your old cluster. Queries are now served from the new cluster, updates are happening on the new cluster.
Re-index to apply a mapping fix
Now let’s look at the more complicated scenario involving a schema change; i.e. the source wants to send data in a new format. Surprisingly, the steps that were used for re-indexing a field update work well for this use case too, with a couple of modifications. The essential point to note is that the source of the data should only emit data in the format once the old index is ready to take writes using the new mapping. The following steps outline what needs to be done.
1) Apply new mapping to the old index. This will allow the old index to take writes in the new format.
2) Allow the source to start emitting data in the new format.
3) Follow the steps discussed [earlier](#ReIndexing-with-updates-streaming).
Pros
Negligible lag
Consistent view of data at all stages.
No duplicates on reads.
Cons
More complex to implement.
Conclusion
At Intuit, we treat customer delight as a cornerstone of our products. In order to ensure this we start at the fundamentals by building reliable, highly available services. We use the patterns mentioned above to provide zero down time search services to our client business units. As of date we have rebuilt indices for several of our customers on production systems with zero down-time. We hope that these patterns are of use to your organization too.