More than 5 years have passed since last update.

Storing and Retrieving Impression Count Documents

Posted at 2019-10-11

This post is about how I designed the schema for a mongodb collection which will have potentially billions of documents.

I'm working on some code that will allow us to see how many times a user has seen each dish, the main form of content at SnapDish. To do this I designed a new collection called impression count, which has some requirements.

Data Requirements

I want to be able to:

query by user_id and get impression count docs for all the dishes that the user has seen
query by dish_id and get impression count docs for all users who have seen the dish
query by count itself to see which dishes are being viewed the most
batch update thousands of impression count docs with a number of database queries that does not scale with the number of documents.

Because of these requirements I decided to go with a long form data format. An example looks like this:

{'user_id': ObjectId('5d9d345b0ff2aecca14b7423'),
 'dish_id': ObjectId('5d9d345b0ff2aecca14b7424'),
 'count': 0}

With indexes on the three keys, this data format satisfies requirements 1 through 3, but not 4. To update the count in many impression count objects at the same I need a compound key. Since (user_id, dish_id) is a unique combination, I use this as the compound key. I could simple use the tuple I just showed you, but it requires an awful lot of memory. So I decided to use a hash digest of the tuple instead.

Picking hashing function

I thought about whether to use the simple hash() or to use hashlib.sha256().hexdigest().

Using `hash()`

Pros:

Small data size. hash() generates an int, which only takes 8 bytes when stored in mongo hash
Cons:
"High" hash collision probability. hash() produces an int, which means there are 2**64 buckets. Risk of hash collision is larger than with sha256, but still small. And if two dish/user pairs do cause hash collision, it simply means that the impression count data gets a little noisy. We can live with that.

Using `sha256`

Pros

very small risk of hash collisions

Cons

Large data size. The hexdigest of the sha256 is a 64-character string, which is much larger than the 8 bytes required to store hash() digests.

What surprised me a little is that hash() is only about twice as fast as hashlib.sha256, and that's even including the extra bit of data manipulation required to convert two ObjectID objects into a string before calculating sha256. Still, I went with hash(), because keeping the size of the data stored in the db as small as possible takes priority.

Testing `hash()` collisions

Just to make sure, I wrote a quick test where I generated 100 million (user_id, dish_id) tuples, hashed them, and counted the hashes to see if there were any collisions.

import time
from bson.objectid import ObjectId
import pandas as pd
start = time.time()
hashes = [hash((ObjectId(), ObjectId())) for _ in range(10**8)]
end = time.time()

pd.Series(hashes).value_counts()[:10]

Out[28]:
 7865030724768499228    1
 3593037806229148232    1
-7578863228603399659    1
 5892295761844497950    1
-6940781337181938681    1
-2709651244688460355    1
-3635947529078774682    1
-7736769897498555780    1
-5491088629124975649    1
 8601880920309379170    1
dtype: int64

So, no collisions. Great! I ran the test around 100 times, but not a single collision occurred. And remember, even if there's a collision at some point, which might really happen, then it just adds a tiny bit of noise to the data, which is not a problem for this kind of data.

Final data format

{'user_id': ObjectId('5d9d345b0ff2aecca14b7423'),
 'dish_id': ObjectId('5d9d345b0ff2aecca14b7424'),
 'count': 0, 
 'hash': -8777556310190452075
}

Let's take a look at how much memory a single impression count document will use in mongodb

user_id: BSON ObjectId uses 12 bytes
dish_id: BSON ObjectId uses 12 bytes
count: BSON NumberInt uses 4 bytes
hash: BSON NumberLong uses 8 bytes
4 bytes for the document keys, when I change user_id to u, dish_id to d, count to c and hash to h. So a total of 12 + 12 + 8 + 4 + 4= 40 bytes. While I could shave off 8 bytes by also hashing user_id and dish_id, I want to be able to query directly for users and dishes. So now we know that we can store 1 billion user/dish impression count documents with just 40 GiB of database disk used, while still satisfying the original data requirements. Pretty neat!

Batch Updating `impression counts`

from typing import List, Dict
import pymongo


def increment_impression_counts(
    impression_counts: pymongo.collection.Collection, impressions=List[Dict[str, tuple]]
):
    """
      Accepts a list of impression documents of the form

      {'u': ObjectId('5d9d41c20ff2aecca14b7427'),
       'd': ObjectId('5d9d41c20ff2aecca14b7428')}

      and increments the count of existing `impression_count` documents that match with
      the hash, and creates new `impression_count` documents with `c=0` for the
      rest.
    """
    # user hash to find impression count documents already in collection
    hash_to_doc = {hash((i["d"], i["u"])): i for i in impressions}
    hashes_all = list(hash_to_doc.keys())
    hashes_in_db = impression_counts.find(
        {"h": {"$in": hashes_all}}, projection={"_id": False, "h": True}
    )
    hashes_in_db = [ic["h"] for ic in hashes_in_db]

    # increment count for existing impression count documents
    impression_counts.update_many({"h": {"$in": hashes_in_db}}, {"$inc": {"c": 1}})
    # calculate impression count documents not collection
    hashes_not_in_db = set(hashes_all) - set(hashes_in_db)
    docs_to_be_inserted = [
        {"h": key, "c": 1, "u": hash_to_doc[key]["u"], "d": hash_to_doc[key]["d"]}
        for key in hashes_not_in_db
    ]
    impression_counts.insert_many(docs_to_be_inserted)

I use a total of three queries:

Find existing impression count documents
Increment count for existing impression count documents
Create new impression count documents with count = 1

While it's certainly possible to design the collection with a data format that uses less disk space, the requirement to batch update documents means that having each user/dish pair as a separate document is necessary. And even though this will cause more documents, each document is small (40 bytes), and the user/dish relation matrix is very sparse. We have over 5 million dishes, but most users haven't seen a fraction of these.

This was a fun bit of design work :).

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up