More than 5 years have passed since last update.

Cassandra anti-patterns: Queues and queue-like datasets

Last updated at 2014-01-16Posted at 2014-01-16

Cassandra をキューとして使用することはアンチパターンとして知られているらしい。実際にどんな問題があるのか、Datastax のブログに良さげな記事があったので読んでみた。この記事はすこし古い (2013/04/26) けど Cassandra-1.2.x を対象としているので、現時点での Cassandra-2.0.x でも大体同じことが言えると思う。適当な意訳なので間違ってたらごめんなさい。

Author

By Aleksey Yeschenko - April 26, 2013
http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets

Deletes in Cassandra

Cassandra uses a log-structured storage engine. Because of this, deletes do not remove the rows and columns immediately and in-place. Instead, Cassandra writes a special marker, called a tombstone, indicating that a row, column, or range of columns was deleted. These tombstones are kept for at least the period of time defined by the gc_grace_seconds per-table setting. Only then a tombstone can be permanently discarded by compaction.

Cassandra はログ構造ストレージエンジンを使用します。このため、deletes は rows や columns を即座に削除しません。代わりに、その row や column あるいは columns の範囲が削除されたことを示す tombstone と呼ばれる特別なマーカを書き込みます。これらの tombstone は、少なくてもテーブルごとの gc_grace_seconds 設定値によって定義された期間は維持されます。その後 tombstone は compaction によって廃棄されます。

This scheme allows for very fast deletes (and writes in general), but it’s not free: aside from the obvious RAM/disk overhead of tombstones, you might have to pay a certain price when reading data back if you haven’t modelled your data well.

このスキーマは非常に高速な deletes (および一般的な書き込み) を許可しますが、自由ではありません。tombstones における RAM/disk のオーバヘッドは別としても、データを良くモデル化していなければ、データを読み返す場合に、一定のコストを払う必要が生じるでしょう。

Specifically, tombstones will bite you if you do lots of deletes (especially column-level deletes) and later perform slice queries on rows with a lot of tombstones.

具体的には、もし多くの deletes (特に column レベルの deletes) を行うと tombstones は問題が生じるでしょう。そして、多くの tombstones を保有した rows に対するレンジクエリー処理が遅くなるでしょう。

Symptoms of a wrong data model

To illustrate this scenario, let’s consider the most extreme case – using Cassandra as a durable queue, a known anti-pattern

このシナリオを説明するために、最も極端なケースで検討しましょう。Cassandra を永続キュー (アンチパターンとして知られている) として使用します。

CREATE TABLE queues (
    name text,
    enqueued_at timeuuid,
    payload blob,
    PRIMARY KEY (name, enqueued_at)
);

Having enqueued 10000 10-byte messages and then dequeued 9999 of them, one by one, let’s peek at the last remaining message using cqlsh with TRACING ON:

10-byte のメッセージを 10,000 個キューに格納し、そのうち 9,999 個のメッセージをひとつずつキューから取り除きます。次に、cqlsh で TRACING オプションを有効にして最後に残ったメッセージを取り出します。

SELECT enqueued_at, payload
  FROM queues
 WHERE name = 'queue-1'
 LIMIT 1;

activity                                   | source    | elapsed
-------------------------------------------+-----------+--------
                        execute_cql3_query | 127.0.0.3 |       0
                         Parsing statement | 127.0.0.3 |      48
                        Peparing statement | 127.0.0.3 |     362
          Message received from /127.0.0.3 | 127.0.0.1 |      42
             Sending message to /127.0.0.1 | 127.0.0.3 |     718
Executing single-partition query on queues | 127.0.0.1 |     145
              Acquiring sstable references | 127.0.0.1 |     158
                 Merging memtable contents | 127.0.0.1 |     189
Merging data from memtables and 0 sstables | 127.0.0.1 |     235
    Read 1 live and 19998 tombstoned cells | 127.0.0.1 |  251102
          Enqueuing response to /127.0.0.3 | 127.0.0.1 |  252976
             Sending message to /127.0.0.3 | 127.0.0.1 |  253052
          Message received from /127.0.0.1 | 127.0.0.3 |  324314
       Processing response from /127.0.0.1 | 127.0.0.3 |  324535
                          Request complete | 127.0.0.3 |  324812

Now even though the whole row was still in memory, the request took more than 300 milliseconds (all the numbers are from a 3-node ccm cluster running on a 2012 MacBook Air).

すべての row はメモリーに収まっていましたが、リクエストは 300 ミリ秒以上かかりました。検証した環境は ccm を使用して 2012 Macbook Air に 3 ノードの Cassandra クラスタを構築しました。

Why did the query take so long to complete?

A slice query will keep reading columns until one of the following condition is met (assuming regular, non-reverse order):

the specified limit of live columns has been read
a column beyond the finish column has been read (if specified)
all columns in the row have been read

レンジクエリーは、下記の条件のうちひとつに該当するまで columns を読み続けるでしょう。

指定された limit 件の有効な columns が読まれた
レンジの終了 column まで読まれた (指定してある場合)
row の中のすべての columns が読まれた

In the previous scenario Cassandra had to read 9999 tombstones (and create 9999 DeletedColumn objects) before it could get to the only live entry. And all the collected tombstones 1) were consuming heap and 2) had to be serialised and sent back to the coordinator node along with the single live column.

前のシナリオでは、唯一の有効なエントリを取得するために 9,999 個の tombstones (そして 9,999 個の DeletedClumn オブジェクトを作成する) を読む必要がありました。また集めた tombstones はヒープを消費し、シリアライズされ、ひとつの有効な column と共に coordinator ノードに送り返す必要がありました。

For comparison, it took less than 1 millisecond for the same query to complete when no column-level tombstones were involved.

比較のために、tombstones が含まれていない column に同じクエリーを実行すると 1 ミリ秒未満で終了しました。

The queue example might be extreme, but you’ll see the same behaviour when performing slice queries on any row with lots of deleted columns. Also, expiring columns, while more subtle, are going to have the same effect on slice queries once they expire and become tombstones.

このキューの例は極端かもしれませんが、delete された columns を多く含んだ任意の row においてレンジクエリーを行うと同じような挙動を見るでしょう。また TTL を使用する expiring columns においても、期限が切れにより tombstones になるため同じ挙動になるでしょう。

Potential workarounds

If you are seeing this pattern (have to read past many deleted columns before getting to the live ones), chances are that you got your data model wrong and must fix it.

For example, consider partitioning data with heavy churn rate into separate rows and deleting the entire rows when you no longer need them. Alternatively, partition it into separate tables and truncate them when they aren’t needed anymore.

In other words, if you use column-level deletes (or expiring columns) heavily and also need to perform slice queries over that data, try grouping columns with close ‘expiration date’ together and getting rid of them in a single move.

When you know where your live columns begin

Note that it’s possible to improve on this hypothetical queue scenario. Specifically, when knowing what the last entry was, a consumer can specify the start column and thus somewhat mitigate the effect of tombstones by not having to either 1) start scanning at the beginning of the row and 2) collect and keep all the irrelevant tombstones in memory.

この仮定したキューシナリオ上で改善することが可能であることに注意してください。特に、最後のエントリーが何であったかを知っている場合は、consumer は最初のカラムを指定することで、tombstones の影響を多少軽減することができます。

To show what I mean, let’s modify the original example by using the previously consumed entry’s key as the start column for the query, i.e.

オリジナルの例で行ったレンジクエリーにスタート column を指定して実行します。

SELECT enqueued_at, payload
  FROM queues
 WHERE name = 'queue-1'
   AND enqueued_at > 9d1cb818-9d7a-11b6-96ba-60c5470cbf0e
 LIMIT 1;

activity                                   | source    | elapsed
-------------------------------------------+-----------+--------
                        execute_cql3_query | 127.0.0.3 |       0
                         Parsing statement | 127.0.0.3 |      45
                        Peparing statement | 127.0.0.3 |     329
             Sending message to /127.0.0.1 | 127.0.0.3 |     965
          Message received from /127.0.0.3 | 127.0.0.1 |      34
Executing single-partition query on queues | 127.0.0.1 |     339
              Acquiring sstable references | 127.0.0.1 |     355
                 Merging memtable contents | 127.0.0.1 |     461
 Partition index lookup over for sstable 3 | 127.0.0.1 |    1122
Merging data from memtables and 1 sstables | 127.0.0.1 |    2268
        Read 1 live and 0 tombstoned cells | 127.0.0.1 |    4404
          Message received from /127.0.0.1 | 127.0.0.3 |    6109
          Enqueuing response to /127.0.0.3 | 127.0.0.1 |    4492
             Sending message to /127.0.0.3 | 127.0.0.1 |    4606
       Processing response from /127.0.0.1 | 127.0.0.3 |    6608
                          Request complete | 127.0.0.3 |    6901

Despite reading from disk this time, the complete request took 7 milliseconds. Specifying a start column allowed to start scanning the row close to the actual live column and to skip collecting all the tombstones. The difference grows larger with size of the row increasing.

今回はディスクから読むにもかかわらず、リクエストは 7 ミリ秒で完了しました。スタート column の指定は、実際に有効な column に近い row からスキャンしはじめるため、すべての tombstones の収集をスキップすることを可能にしました。row のサイズを増加させることにより、この違いは大きくなります。

Summary

Lots of deleted columns (also expiring columns) and slice queries don’t play well together. If you observe this pattern in your cluster, you should correct your data model. If you know where your live data begins, hint Cassandra with a start column, to reduce the scan times and the amount of tombstones to collect. Do not use Cassandra to implement a durable queue.

多くの削除された columns (さらに期限切れの columns) はレンジクエリーの速度に影響を与えます。もし、このパターンが現れる場合はデータモデルを修正するべきです。もし有効なデータがどこではじまるか知っている場合は、Cassandra にスタート column のヒントを与え、収集する tombstones の量を減らしスキャン時間を縮小してください。Cassandra を永続キューの実装として使用してはいけません。

一緒に読みたい

Is it possible to use a cassandra table as a basic queue
http://stackoverflow.com/questions/17945924/is-it-possible-to-use-a-cassandra-table-as-a-basic-queue
Safety valve on number of tombstones skipped on read path to prevent a full heap
https://issues.apache.org/jira/browse/CASSANDRA-5143
Avoid death-by-tombstone by default
https://issues.apache.org/jira/browse/CASSANDRA-6117
Message Queue
https://github.com/Netflix/astyanax/wiki/Message-Queue

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up