More than 3 years have passed since last update.

DWHの比較まとめ

Posted at 2021-06-20

特徴まとめ

	Redshift	BigQuery	Snowflake
課金ポイント	node type (ds2 / dc2 / RA3, avoid d*1 node types), number of nodes, reservations	storage (~2 cents per GB / month for warm data, 1 cent per GB / month for colder data), bytes scanned (on demand), slots (fixed / flex) and streaming inserts	the number of warehouses, the size of these warehouses and capacity storage
適切な場面（一言）	1日に何百、何千回ものクエリを実行する	オンデマンドでクエリ処理。大量のデータマイニングを行うユーザーや、特定の日に処理アクティビティが急増する	一定の実行時間を要するユーザーや、毎日何百、何千ものデータ量の多いクエリを実行

適切な場面

Redshift

1日に何百、何千回ものクエリを実行する能力
Best applied to scenarios that require constant computation for example
- NASDAQ daily reporting: Time-sensitive workload for data reporting
- Automated ad-bidding: Bids across certain ad networks are adjusted via predictive models on top of Redshift on a near real-time basis
- Live dashboards: Having live data streaming with continuous querying via refreshing

BigQuery

オンデマンドでクエリ処理
1日に数回のジョブ
Best applied to scenarios with spiky workloads (i.e. you’re running lots of queries occasionally, with high idle time)
- Recommendation models: That run once a day for ECommerce applications
- Ad-hoc reporting: Occasion complex queries for a quarterly report
- Sales intelligence: for sales or marketing teams to make ad-hoc discovery via analysing the data in any way they wish
- Machine learning: to discover new patterns in the data especially consumer behaviour

Snowflake

一定の実行時間を要する処理や、毎日何百、何千ものデータ量の多いクエリを実行する処理
Best applied to steadier, more continuous usage pattern but requires constant upscaling & downscaling
- Business Intelligence companies: Many concurrent users (100s to 1,000s) querying the data at the same time to discover a pattern in the data
- Providing data as a service: Giving thousands of client access to your data for analysis purposes in the form of an analytics user interface or data APIs

まとめると、

BigQueryはidle時間があるが、たまに負荷が高い場面
Redshift/Snowflakeは実行すべきクエリがconstantにたくさんある場面

が良さそう。

RedshiftとSnowflakeの違いは？

基本的にはRedshiftのインスタンスはCompute/Storageが一緒になっている（ds2とdc2。RA3は分離できるらしい）のに対し、SnowflakeはCompute/Storageを分離して提供
- Redshiftのリサイズ操作は、すぐに非常に高価になる可能性があり、大幅なダウンタイムにつながる可能性があります。Snowflakeでは、演算処理とストレージが分離されているため、スケールアップやスケールダウンのためにデータをコピーする必要はありません。データの演算能力を自由に切り替えることができます。
- Redshift: Elastic resizes generally complete quite quickly – around about the time it takes light to travel from Mars to Earth (3 minutes) with the caveat that you can only scale up or down by a factor of 2.
- Snowflake: It’s fast to change between warehouse types (e.g., small (S) => medium (M)) which roughly correlate to the number of vCPUs / memory you are getting and cluster size manually if required. Scaling policies can adjust the number of clusters automagically according to running workloads in either standard or economy modes. It’s a particularly elegant implementation for a scaling mechanism.
Redshiftの方が手がかかる（ex 圧縮は自分でやらないといけない）のに対し、Snowflakeはほとんど自動化されている
PauseとResume
- Redshifrt: At the moment there is no smart sleep / smart resume functionality based on workloads.
- Snowflake: Pause, resume semantics (both manual and automated based on workload).
Security
- Snowflake: Securityは選ぶオプション次第（場合によっては要件を満たせないこともあるかも）

Redshiftを使って思った問題点

計算能力とストレージを同時にスケールさせる必要がある。
- スケールさせるには、高価なインスタンスが必要となってしまう。この点を解決するのは
- MA3インスタンス
- ストレージと計算能力を分離しているSnowflake
- ストレージは無制限（課金はされる）のBigQuery
極めて遅い時がある
処理能力が超えるとめちゃくちゃ遅い。

その他

RedShiftは使用状況を確認できます。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up