TRIAL＆RetailAIAdvent Calendar 2024

Safely sync database cluster to a standalone database for local usage

Last updated at 2024-12-02Posted at 2024-11-29

What is the problem

Let's assume you have a mid to big-size SaaS platform with multiple tech teams and stakeholders. Different teams have different requirements for analyzing the production data independently. However, due to security and stability issues, the tech team doesn't want to allow all these stakeholders direct access to the production databases.

A simple one-way solution

Create standalone databases outside of your production database servers with the same name as production and sync the production data of the specific tables or collections to the standalone database.

Target audience

Data science and ML team to train the model and experiment
Operation, CS, and sales team to analyze the data by themselves

SkipCart platform

SkipCart is a smart shopping cart platform serving 220+ retail stores around Japan. Our platform is serving 4.0M+ monthly transactions and is growing every month. The platform is big, and the data i/o, storage, and processing requirements are heavy. To support our cause, we build SDP (SkipCart data platform). In SDP, we have a MongoDB sharding cluster beside multiple other systems. A high-level architecture is like the below:

SDP contains various data types, from SkipCart hardware activities to customer activities. Therefore, multiple stakeholders are interested in touching those important data points. However, the tech team realized that allowing different people into the production databases is not ideal. Instead, the tech team started thinking of a simple solution to allow different stakeholders to touch different data points in their own space. The data will be automatically synced one way from the production to the stakeholders' specific standalone database.

Simple design diagram

Development

We started an open-source project called sync (https://github.com/retail-ai-inc/sync), written in golang based on the diagram above. Currently, sync supports the following features:

Initial Sync: Bulk synchronization of data from the MongoDB cluster or MongoDB replica set to the standalone MongoDB instance.
Incremental Sync: Synchronizes newly updated or inserted data since the last sync using timestamps.
Change Stream Monitoring: Watches for real-time changes (insert, update, replace, delete) in the cluster's collections and reflects them in the standalone instance.
Batch Processing: Handles synchronization in batches for optimized performance.
Concurrent Execution: Supports parallel synchronization for multiple collections.

After implementing our sync service's first version, the data flow becomes smooth and maintainable with less operational burden among all stakeholders. Finally, our data flow becomes as follows:

Contributing

We welcome any feedback through pull requests or creating an issue in the repo here https://github.com/retail-ai-inc/sync.

Tomorrow's article is @mizoguchi_ryosuke's Mathematical Optimization Case Studies in Retail. Stay tuned!

Retail AI and TRAIl are looking for great engineers. If you are interested, then please take a slight look at the following pages:

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up