GoogleCloudで汎用Database構築2 - DataFlow1 -

Last updated at 2016-10-31Posted at 2016-10-28

GYAOのtsです。
我々のチームは、オールパブリッククラウドで、Microservice Architectureを採用した次期バックエンドを設計中です。

経緯

前回の投稿で、目指すゴールはだいたい決まったので、それぞれのサブシステムについて、チーム内でプロトタイピングを行っていく。
今回はDataFlowをやってみる。

前提

いくつかCloudDataFlowの事前知識としてまとめておいたほうが良いものがあったので、まとめておく。
※順次増やします。

概要

CloudDataFlowは下記のようなpipeline（route）を作成する。実装は、現時点ではjava SDKとPython SDK（β）が提供されている。
Pipelineオブジェクトに対して、transforms（データ変換）をapplyしていく。input、outputに関しても、BigQueryやDatastore等、SDKによって提供される。もちろん拡張も可能。PipelineIOに関してはこちらを参照。

Source: https://cloud.google.com/dataflow/pipelines/design-principles#a-basic-pipeline
どことなくApache Camelを思い出す。。。

deploy

EclipseやMavenで作成した成果物をCloudStorageに転送する。
CloudStorageにjarファイル等の成果物が転送されるとほぼ同時に、GCEのインスタンスが自動で作成され、
そこにデプロイされる模様。今回はこちらを参考にEclipseで構築する。

Runner

定義したパイプラインの起動mode（Runner）としていくつか提供されている。

PipelineRunner	概要
DataflowPipelineRunner	パイプラインは、Googleのクラウド上で非同期に実行される。ジョブの進捗状況を監視することが可能。
BlockingDataflowPipelineRunner	DataflowPipelineRunnerと同じように、クラウド上で実行されるが、立ち上げたジョブが終了するのを待ちます。ジョブを実行しながら、待っている間、ジョブステータスの更新とコンソールメッセージを出力する。
DirectPipelineRunner	local execution, not in Google Cloud. ローカルマシンからの実行（クラウド上で実行しない）

Source: https://cloud.google.com/dataflow/pipelines/specifying-exec-params#asynchronous-execution

bounded & unbounded

上述のようにデータはPCollectionオブジェクトでやり取りするのだが、2種類の形式が存在する。

PCollection	概要	in,out
bounded	有限リソース	TextIO,BigQueryIO,DatastoreIO
unbounded	断続的リソース（stream）	PubsubIO,BigQueryIO

Source: https://cloud.google.com/dataflow/model/pcollection#bounded-and-unbounded-pcollections

今回は in（PubsubIO）-out(DatastoreIO)なので、streamデータをある程度の塊で扱わなければならない。
ソリューションとしてWindowingが提供されているのでそれを使用する。

Streaming Autoscaling

2016/10現在、

Beta
This is a Beta release of Streaming Autoscaling. This feature might be changed in backward-incompatible ways and is not subject to any SLA or deprecation policy.

Streaming Autoscalingに関してはβだ。。。うーむ。。。

The number of workers used for a streaming autoscaling pipeline ranges between N/15 and N workers where N is the value of --maxNumWorkers. For example, if your pipeline needs 3 or 4 workers in steady state, you could set --maxNumWorkers=15 and the pipeline will automatically scale between 1 and 15 workers.
--maxNumWorkers can be 1000 at most.

worker 3 or 4で --maxNumWorkers=15を指定する例が提示されている。実際に処理をしていく上でパフォチューが必要そう。

Source: https://cloud.google.com/dataflow/faq

Cloud Pub/Subのメッセージ順序性（余談）

途中で気付いたが、Cloud Pub/Subはメッセージの順序性を一切担保しない。というか、下記を見る限り順序性って必要？というスタンスっぽい。そこまでは言っていないが、要は拡張性、高可用性とトレードオフですよと。

なるほど。おっしゃる通り。Apache kafkaと同じ勢いでこんな感じで設計したが、考え直さなければ。。。
やっぱり皆でいじっていくことは必要ですね。

Message Ordering

次回

予備知識が相当必要ですね。自分のロースキルが悩ましい。。。
次回やっとハンズオン！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up