ZOZOAdvent Calendar 2024

Understanding the Limitations of DynamoDB

Last updated at 2024-12-18Posted at 2024-12-18

Preamble

After using DynamoDB for a while, I have concluded that it is not well-suited for complex API purposes.
Usually we use a combo API Gateway + Lambda Function + DynamoDB to make a serverless backend for applications.

API Gateway

Acts as the entry point to handle incoming HTTP/HTTPS requests from clients (e.g., web or mobile apps)
Routes requests to the appropriate backend services or Lambda functions
Can also handle tasks like authentication, throttling, and CORS

Lambda Function

Executes the business logic triggered by API Gateway
Processes incoming requests, performs operations such as data validation, computation, or transformation, and interacts with other AWS services
Stateless and serverless, meaning it scales automatically and only incurs costs when the function is invoked

DynamoDB

A NoSQL database used to store and retrieve data
Designed for high availability, scalability, and low latency
Ideal for applications requiring fast, predictable performance with flexible schemas

However, there are some limitations in DynamoDB that may cause frustration and disappointment.
Let me explain.

1 - Item size limit

A single DynamoDB item (similar to a "row" in a relational database) cannot exceed 400KB.

In comparison,
PostgresSQL allows up to 1.6TB per item (1600 columns × 1GB each).

2 - Page size limit

When requesting DynamoDB we can not retrieve more than 1MB answer - for example when we try to retrieve a group of item.

When the request does not return the entire list of item corresponding to the request, it will add a LastEvaluatedKeyproperty that can be used in the next request to read the next page.

For example in a simple table :

subscriptions
user_id (Partition Key)
subscription_type (Sort Key)

To retrieve all items for subscription_type = ‘daily-newsletter’ multiple queries are required due to pagination :

table = dynamodb.Table('subscriptions')

# Perform the query
response = table.query(
  KeyConditionExpression=Key('subscription_type').eq('daily-newsletter')
)
user_ids = [item['user_id'] for item in response['Items']]

# Handling pagination if the result set is large
while 'LastEvaluatedKey' in response:
  response = table.query(
    KeyConditionExpression=Key('subscription_type').eq('daily-newsletter'),
    ExclusiveStartKey=response['LastEvaluatedKey']
  )
  user_ids.extend([item['user_id'] for item in response['Items']])

In comparison,
PostgresSQL can handle it with a single request without row limit (depends on settings).

SELECT user_id from subscriptions where subscription_type='daily-newsletter'

The same limitation is also applicable for table scan (select all elements in a table):

table = dynamodb.Table('subscriptions')

# Perform the query
response = table.scan()
user_ids = [item['user_id'] for item in response['Items']]

# Handling pagination if the result set is large
while 'LastEvaluatedKey' in response:
  response = table.query(
    ExclusiveStartKey=response['LastEvaluatedKey']
  )
  user_ids.extend([item['user_id'] for item in response['Items']])

In comparison,
PostgresSQL can handle it simply by doing:

SELECT user_id from subscriptions

We could imagine reducing the size by selecting precisely the column(s) needed with the parameter : ProjectionExpression with help to fill the 1MB limit but it will not change the number of result returned… It will just improve the response time.

table = dynamodb.Table('subscriptions')
scan_params = {"ProjectionExpression": "user_id"}

# Perform the query
response = table.scan(**scan_params)
user_ids = [item['user_id'] for item in response['Items']]

# Handling pagination if the result set is large
while 'LastEvaluatedKey' in response:
  response = table.scan(
    ExclusiveStartKey=response['LastEvaluatedKey'],
    **scan_params
  )
  user_ids.extend([item['user_id'] for item in response['Items']])

3 - Pagination Limit

We just explain the result is paginated due to the 1MB response limit but what about the "real pagination" system used in most of the system.
Pagination can be handle by adding the parameter LIMIT but there is no way to set the page num or the offset.

To handle it we need to loop over queries retrieving page after page until finding the good one:

table = dynamodb.Table('subscriptions')
scan_params = {"Limit": 20}
page = 5 # Receive from client

items = []
last_evaluated_key = None
# We must loop over all the results until we reach the good page
for current_page in range(1, page + 1):
    if last_evaluated_key:
        scan_params["ExclusiveStartKey"] = last_evaluated_key

    response = table.scan(**scan_params)
    last_evaluated_key = response.get("LastEvaluatedKey")

    if current_page == page:
        items = response.get("Items", [])
        break

    if last_evaluated_key is None:
        break

This method is not good and can introduce error due to problem mentioned in #2
As the size is limited to 1MB, there is no guarantee that the number of element per page (LIMIT) is respected.

First page can contain 20 elements, second page 17, third page 19….

In comparison,
PostgresSQL can handle simply by doing:

SELECT * 
FROM subscriptions
ORDER BY id
LIMIT 20 OFFSET 20;

4 - Batch limits

DynamoDB uses an API-based interaction model where each request incurs a cost, so it is better to group operations into a single API call when possible.
Fewer requests also reduce network overhead and improve response times.

DynamoDB provides batch operations

It allows to retrieve multiple items
It allows to perform multiple write operations (inserts or deletes)

Operations are executed from one or more tables in a single request.

For example, as DynamoDB can’t perform the SQL in operation, we need to retrieve individually each item based on their PK.

# Retrieve user from the database
table = dynamodb.Table('users')

batch_size = 100
user_profiles = []

keys = [{'user_id': user_id} for user_id in unique_user_ids]

for i in range(0, len(keys), batch_size):
    batch_keys = keys[i:i+batch_size]
    response = dynamodb.batch_get_item(
        RequestItems={
            table.name: {
                "Keys": batch_keys,
                'ConsistentRead': True
            }
        }
    )
    user_profiles.extend(response['Responses'].get(table.name, []))

    # Handle unprocessed keys
    while 'UnprocessedKeys' in response and response['UnprocessedKeys']:
        response = dynamodb.batch_get_item(
            RequestItems=response['UnprocessedKeys']
        )
        user_profiles.extend(response['Responses'].get(table.name, []))

Batch operation has many constraints :

Limited to 100 Primary Keys as batch input for GET requests
Limited to 25 Primary Keys as batch input for PUT/DELETE requests
No identical Primary Key in the batch get-item
Maximum of 16MB in input
Maximum of 16MB in output
Result may be paginated

In comparison,
PostgresSQL has no equivalent as it does not use APIs but we can retrieve users like :

SELECT * FROM users WHERE user_id IN (1, 2, 3);

5 - No Table Join

With DynamoDB it is not possible to do cross reference between table. It is intrinsic to NoSQL database

For example in a simple table :

subscriptions
user_id (Partition Key)
subscription_type (Sort Key)

users
user_id (Partition Key)
email

To retrieve the email for all the subscription user we need to chain the request GET and GET batch:

table = dynamodb.Table('subscriptions')

# Retrieve the list of newsletter subscribers
response = table.query(
  KeyConditionExpression=Key('subscription_type').eq('daily-newsletter')
)
user_ids = [item['user_id'] for item in response['Items']]

# Handling pagination if the result set is large
while 'LastEvaluatedKey' in response:
  response = table.query(
    KeyConditionExpression=Key('subscription_type').eq('daily-newsletter'),
    ExclusiveStartKey=response['LastEvaluatedKey']
  )
user_ids.extend([item['user_id'] for item in response['Items']])

# Remove duplicates
unique_user_ids = list(set(user_ids))  

batch_size = 100
user_profiles = []

keys = [{'user_id': user_id} for user_id in unique_user_ids]

table = dynamodb.Table('users')
        
# Create a batch of a maximum of 100 user_id and loop until all user_id are retrieved
for i in range(0, len(keys), batch_size):
    batch_keys = keys[i:i+batch_size]
    response = dynamodb.batch_get_item(
        RequestItems={
            table.name: {
                "Keys": batch_keys,
                'ConsistentRead': True
            }
        }
    )
    user_profiles.extend(response['Responses'].get(table.name, []))

    # Handle unprocessed keys
    while 'UnprocessedKeys' in response and response['UnprocessedKeys']:
        response = dynamodb.batch_get_item(
            RequestItems=response['UnprocessedKeys']
        )
        user_profiles.extend(response['Responses'].get(table.name, []))

In comparison,
PostgresSQL can easily join table and do complex join as needed :

SELECT *
FROM users as u, subscriptions as s
WHERE s.user_id = u.user_id
AND s.subscription_type = 'daily-newsletter'

6 - No Direct Filtering on Non-Key Attributes

In DynamoDB, the primary mechanism for retrieving data is through queries and scans based on the primary key attributes: the Partition Key and the Sort Key.

However, querying directly on attributes that are not part of the primary key is not supported in the same way. To query on non-key attributes efficiently, you need to use Global Secondary Indexes (GSIs) or Local Secondary Indexes (LSIs).

Without using GSI or LSI, a filter can be applied with the filter-expression parameter but this filter is applied after data retrieval. It’s a post-processing filter that is similar to an application-level filtering. The data are just not transferred to the application sparing unnecessary data over the network.
But Post-processing filter is totally inefficient with pagination.

table = dynamodb.Table('subscriptions')

# Perform the query
response = table.query(
  KeyConditionExpression=Key('subscription_type').eq('daily-newsletter'),
  FilterExpression=Attr('lang').eq('english')
)
user_ids = [item['user_id'] for item in response['Items']]

# Handling pagination if the result set is large
while 'LastEvaluatedKey' in response:
  response = table.query(
    KeyConditionExpression=Key('subscription_type').eq('daily-newsletter'),
    FilterExpression=Attr('lang').eq('english'),
    ExclusiveStartKey=response['LastEvaluatedKey']
  )
  user_ids.extend([item['user_id'] for item in response['Items']])

In comparison,
PostgresSQL can easily do filter on any table and do complex join as needed :

SELECT *
FROM subscriptions
WHERE subscription_type = 'daily-newsletter'
AND lang = 'english'

7 - Comparison with PostgresSQL

Feature	DynamoDB	PostgreSQL
Item Size Limit	400KB per item	Up to 1.6TB per item
Pagination	Requires manual handling with `LastEvaluatedKey`, no `OFFSET`	Supports `LIMIT` and `OFFSET` for direct pagination
Page Size	Maximum 1MB per request (response payload limit)	No strict page size limits, depends on memory and settings
Batch Operations	Limited to 100 keys for `BatchGetItem` or 25 items for `BatchWriteItem`, max 16MB request/response	No batch; SQL queries can handle complex conditions
Filtering	Filtering with `FilterExpression` is post-processing and inefficient for large datasets and pagination	Can filter on any attribute in SQL queries
Joins	No table joins; must perform multiple queries or denormalize data	Supports complex table joins with `JOIN`
Access Model	API-based interaction (HTTPS)	Direct connection via SQL
Query Language	NoSQL API with basic query and scan operations	SQL for complex and ad-hoc queries
Indexing	Requires Global/Local Secondary Indexes (GSI/LSI) for non-key queries	Automatic and custom indexes
Atomicity	Transactions available but limited to 25 operations per transaction	Transactions supported, ACID-compliant
Scalability	Scales horizontally, highly distributed and serverless	Scales vertically (add hardware) and horizontally (partitioning)
Data Model	NoSQL (Key-Value, Document); encourages denormalized data	Relational, supports normalized and complex relationships
Performance	Optimized for high throughput and predictable access patterns	Optimized for complex queries with proper indexing
Use Case	Ideal for high-performance key-value or document-based use cases	Ideal for complex applications with relational data

Of course, DynamoDB and PostgresSQL don't serve the same purpose.

Conclusion

I don’t encourage to use DynamoDB for :

User related table where queries are complex
Direct mapping between id ←→ value to handle a dictionary. DynamoDB can feel overkill.
A simpler solution, like ElastiCache (Redis) would be better suited for this scenario.
Short values in term of MB
For large binary data or files, DynamoDB is not designed to handle "big data" directly. Instead, it's common to store such data in S3 and keep references (e.g., URLs) in DynamoDB.
Small data records with high-frequency access patterns being better suited for DynamoDB.
Pagination
Pagination is inefficient and need to read the full table.

On the other hand,
DynamoDB is designed to handle applications where there are large numbers of read or write operations per second, often in the range of thousands or millions.

It is built for speed and scalability.

Horizontal Scalability

DynamoDB automatically partitions data across multiple nodes in a distributed system, spreading the load evenly
The capacity of your database scales automatically as your read/write workload increases

Performance Optimization

DynamoDB provides single-digit millisecond latency for both reads and writes, regardless of the size of your dataset
By leveraging Provisioned Throughput or On-Demand Mode, you can configure DynamoDB to handle spikes in traffic seamlessly without affecting performance

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up