More than 5 years have passed since last update.

Spark + Cassandraのfilter, where句の使い方メモ

Last updated at 2016-01-08Posted at 2016-01-06

いまだクリアーになっていない点は

=, <= でrange filterを効率的にできるかどうか？
columnがpartition key, clustering keyで挙動が違うがその差異を完全に理解できてない
さらに、cqlshでの場合と、sparkで処理させた場合の違い、spark上でのpartitionがどうなるかが完全理解できてない。

いまの解は、

datetime(e.g. 2016010501)をdate, hourで作って、partition keyにする。
sparkで引くときは、

sc.CassandraTable("select * from table datetime = 2016010500")

みたいにして、dailyデータは24回クエリーして、unionしてrepartitionしている。

in, >=, <=でフィルタリングしていが、その場合、全なめして遅くて使えない。

SparkSQLの場合

全件とってカウント

val cc = new CassandraSQLContext(sc)
val df = cc.sql("select * from table")
df.show
df.count

しても、

where句でfilterしても

val cc = new CassandraSQLContext(sc)
val df = cc.sql("select * from table where datetime >= '2016-01-01 00:00:00' and datetime <= '2016-01-01 23:59:59'")
df.show
df.count

もpartitions数は同じになって謎
ここで、datetimeはpartition key

CassandraTableの場合

sc.cassandraTable("select * from table datetime in (2016010500, ..., 2016010523)")などとして、rddをとって#partitions=1となって使えない。
0 -23で取得して、unionして、repartitionしてするなどして対応

なぞなぞ、勘違いメモ

partition keyの不等号、range filtertokenを使うが当然、不等号は変な挙動になる。
　tokenはハッシュ値なので当たり前。。。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up