This post is about how I designed the schema for a mongodb collection which will have potentially billions of documents.
I'm working on some code that will allow us to see how many times a user has seen each dish, the main form of content at SnapDish. To do this I designed a new collection called impression count
, which has some requirements.
Data Requirements
I want to be able to:
- query by
user_id
and getimpression count
docs for all the dishes that the user has seen - query by
dish_id
and getimpression count
docs for all users who have seen the dish - query by
count
itself to see which dishes are being viewed the most - batch update thousands of
impression count
docs with a number of database queries that does not scale with the number of documents.
Because of these requirements I decided to go with a long form data format. An example looks like this:
{'user_id': ObjectId('5d9d345b0ff2aecca14b7423'),
'dish_id': ObjectId('5d9d345b0ff2aecca14b7424'),
'count': 0}
With indexes on the three keys, this data format satisfies requirements 1 through 3, but not 4. To update the count in many impression count
objects at the same I need a compound key. Since (user_id, dish_id)
is a unique combination, I use this as the compound key. I could simple use the tuple I just showed you, but it requires an awful lot of memory. So I decided to use a hash digest of the tuple instead.
Picking hashing function
I thought about whether to use the simple hash()
or to use hashlib.sha256().hexdigest()
.
Using hash()
Pros:
- Small data size.
hash()
generates anint
, which only takes 8 bytes when stored in mongo hash
Cons: - "High" hash collision probability.
hash()
produces anint
, which means there are2**64
buckets. Risk of hash collision is larger than withsha256
, but still small. And if two dish/user pairs do cause hash collision, it simply means that theimpression count
data gets a little noisy. We can live with that.
Using sha256
Pros
- very small risk of hash collisions
Cons
- Large data size. The hexdigest of the
sha256
is a 64-character string, which is much larger than the 8 bytes required to storehash()
digests.
What surprised me a little is that hash()
is only about twice as fast as hashlib.sha256
, and that's even including the extra bit of data manipulation required to convert two ObjectID
objects into a string before calculating sha256
. Still, I went with hash()
, because keeping the size of the data stored in the db as small as possible takes priority.
Testing hash()
collisions
Just to make sure, I wrote a quick test where I generated 100 million (user_id, dish_id)
tuples, hashed them, and counted the hashes to see if there were any collisions.
import time
from bson.objectid import ObjectId
import pandas as pd
start = time.time()
hashes = [hash((ObjectId(), ObjectId())) for _ in range(10**8)]
end = time.time()
pd.Series(hashes).value_counts()[:10]
Out[28]:
7865030724768499228 1
3593037806229148232 1
-7578863228603399659 1
5892295761844497950 1
-6940781337181938681 1
-2709651244688460355 1
-3635947529078774682 1
-7736769897498555780 1
-5491088629124975649 1
8601880920309379170 1
dtype: int64
So, no collisions. Great! I ran the test around 100 times, but not a single collision occurred. And remember, even if there's a collision at some point, which might really happen, then it just adds a tiny bit of noise to the data, which is not a problem for this kind of data.
Final data format
{'user_id': ObjectId('5d9d345b0ff2aecca14b7423'),
'dish_id': ObjectId('5d9d345b0ff2aecca14b7424'),
'count': 0,
'hash': -8777556310190452075
}
Let's take a look at how much memory a single impression count
document will use in mongodb
-
user_id
: BSON ObjectId uses 12 bytes -
dish_id
: BSON ObjectId uses 12 bytes -
count
: BSON NumberInt uses 4 bytes -
hash
: BSON NumberLong uses 8 bytes - 4 bytes for the document keys, when I change
user_id
tou
,dish_id
tod
,count
toc
andhash
toh
. So a total of12 + 12 + 8 + 4 + 4= 40 bytes
. While I could shave off 8 bytes by also hashinguser_id
anddish_id
, I want to be able to query directly for users and dishes. So now we know that we can store 1 billion user/dishimpression count
documents with just 40 GiB of database disk used, while still satisfying the original data requirements. Pretty neat!
Batch Updating impression counts
from typing import List, Dict
import pymongo
def increment_impression_counts(
impression_counts: pymongo.collection.Collection, impressions=List[Dict[str, tuple]]
):
"""
Accepts a list of impression documents of the form
{'u': ObjectId('5d9d41c20ff2aecca14b7427'),
'd': ObjectId('5d9d41c20ff2aecca14b7428')}
and increments the count of existing `impression_count` documents that match with
the hash, and creates new `impression_count` documents with `c=0` for the
rest.
"""
# user hash to find impression count documents already in collection
hash_to_doc = {hash((i["d"], i["u"])): i for i in impressions}
hashes_all = list(hash_to_doc.keys())
hashes_in_db = impression_counts.find(
{"h": {"$in": hashes_all}}, projection={"_id": False, "h": True}
)
hashes_in_db = [ic["h"] for ic in hashes_in_db]
# increment count for existing impression count documents
impression_counts.update_many({"h": {"$in": hashes_in_db}}, {"$inc": {"c": 1}})
# calculate impression count documents not collection
hashes_not_in_db = set(hashes_all) - set(hashes_in_db)
docs_to_be_inserted = [
{"h": key, "c": 1, "u": hash_to_doc[key]["u"], "d": hash_to_doc[key]["d"]}
for key in hashes_not_in_db
]
impression_counts.insert_many(docs_to_be_inserted)
I use a total of three queries:
- Find existing
impression count
documents - Increment count for existing
impression count
documents - Create new
impression count
documents withcount = 1
While it's certainly possible to design the collection with a data format that uses less disk space, the requirement to batch update documents means that having each user/dish pair as a separate document is necessary. And even though this will cause more documents, each document is small (40 bytes), and the user/dish relation matrix is very sparse. We have over 5 million dishes, but most users haven't seen a fraction of these.
This was a fun bit of design work :).