What is the problem
Let's assume you have a mid to big-size SaaS platform with multiple tech teams and stakeholders. Different teams have different requirements for analyzing the production data independently. However, due to security and stability issues, the tech team doesn't want to allow all these stakeholders direct access to the production databases.
A simple one-way solution
Create standalone databases outside of your production database servers with the same name as production and sync the production data of the specific tables or collections to the standalone database.
Target audience
- Data science and ML team to train the model and experiment
- Operation, CS, and sales team to analyze the data by themselves
SkipCart platform
SkipCart is a smart shopping cart platform serving 220+ retail stores around Japan. Our platform is serving 4.0M+ monthly transactions and is growing every month. The platform is big, and the data i/o, storage, and processing requirements are heavy. To support our cause, we build SDP (SkipCart data platform). In SDP, we have a MongoDB sharding cluster beside multiple other systems. A high-level architecture is like the below:
SDP contains various data types, from SkipCart hardware activities to customer activities. Therefore, multiple stakeholders are interested in touching those important data points. However, the tech team realized that allowing different people into the production databases is not ideal. Instead, the tech team started thinking of a simple solution to allow different stakeholders to touch different data points in their own space. The data will be automatically synced one way from the production to the stakeholders' specific standalone database.
Simple design diagram
Development
We started an open-source project called sync
(https://github.com/retail-ai-inc/sync), written in golang
based on the diagram above. Currently, sync
supports the following features:
- Initial Sync: Bulk synchronization of data from the MongoDB cluster or MongoDB replica set to the standalone MongoDB instance.
- Incremental Sync: Synchronizes newly updated or inserted data since the last sync using timestamps.
- Change Stream Monitoring: Watches for real-time changes (insert, update, replace, delete) in the cluster's collections and reflects them in the standalone instance.
- Batch Processing: Handles synchronization in batches for optimized performance.
- Concurrent Execution: Supports parallel synchronization for multiple collections.
After implementing our sync
service's first version, the data flow becomes smooth and maintainable with less operational burden among all stakeholders. Finally, our data flow becomes as follows:
Contributing
We welcome any feedback through pull requests or creating an issue in the repo here https://github.com/retail-ai-inc/sync.
Tomorrow's article is @mizoguchi_ryosuke's Mathematical Optimization Case Studies in Retail. Stay tuned!
Retail AI and TRAIl are looking for great engineers. If you are interested, then please take a slight look at the following pages: