42
10

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

Safely sync database cluster to a standalone database for local usage

Last updated at Posted at 2024-11-29

What is the problem

Let's assume you have a mid to big-size SaaS platform with multiple tech teams and stakeholders. Different teams have different requirements for analyzing the production data independently. However, due to security and stability issues, the tech team doesn't want to allow all these stakeholders direct access to the production databases.

A simple one-way solution

Create standalone databases outside of your production database servers with the same name as production and sync the production data of the specific tables or collections to the standalone database.

Target audience

  • Data science and ML team to train the model and experiment
  • Operation, CS, and sales team to analyze the data by themselves

SkipCart platform

SkipCart is a smart shopping cart platform serving 220+ retail stores around Japan. Our platform is serving 4.0M+ monthly transactions and is growing every month. The platform is big, and the data i/o, storage, and processing requirements are heavy. To support our cause, we build SDP (SkipCart data platform). In SDP, we have a MongoDB sharding cluster beside multiple other systems. A high-level architecture is like the below:

Screenshot 2024-11-29 at 15.39.53.png

SDP contains various data types, from SkipCart hardware activities to customer activities. Therefore, multiple stakeholders are interested in touching those important data points. However, the tech team realized that allowing different people into the production databases is not ideal. Instead, the tech team started thinking of a simple solution to allow different stakeholders to touch different data points in their own space. The data will be automatically synced one way from the production to the stakeholders' specific standalone database.

Simple design diagram

sync.png

Development

We started an open-source project called sync (https://github.com/retail-ai-inc/sync), written in golang based on the diagram above. Currently, sync supports the following features:

  • Initial Sync: Bulk synchronization of data from the MongoDB cluster or MongoDB replica set to the standalone MongoDB instance.
  • Incremental Sync: Synchronizes newly updated or inserted data since the last sync using timestamps.
  • Change Stream Monitoring: Watches for real-time changes (insert, update, replace, delete) in the cluster's collections and reflects them in the standalone instance.
  • Batch Processing: Handles synchronization in batches for optimized performance.
  • Concurrent Execution: Supports parallel synchronization for multiple collections.

After implementing our sync service's first version, the data flow becomes smooth and maintainable with less operational burden among all stakeholders. Finally, our data flow becomes as follows:

Screenshot 2024-11-29 at 19.22.43.png

Contributing

We welcome any feedback through pull requests or creating an issue in the repo here https://github.com/retail-ai-inc/sync.

Tomorrow's article is @mizoguchi_ryosuke's Mathematical Optimization Case Studies in Retail. Stay tuned!

Retail AI and TRAIl are looking for great engineers. If you are interested, then please take a slight look at the following pages:

42
10
3

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
42
10

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?