GCP: Dataproc

Posted at 2022-09-25

概要

Hadoop / Spark / Flink / Presto ...
カスタムイメージを作れる
クラスターをアップデートできる
- ワーカーノード数
- Graceful Decommissioning: ジョブ終了後にノードを減らす
- ラベル
Vertex AI notebookを起動できる

Best Practice

auto scale
scale only the secondary workers: without data nodes
long graceful decommission
preemptible: cheap, use only for fault tolerant
use gcs in the same region / use hdfs for IO
- local SSD is much faster than persistent disk
Enhanced Flexibility Mode
- primary worker shuffle: write data to primary workers
- HCFS shuffle: write data to HDFS

Machine Type

Predefined
- General
  - N1, N2, E2
- Comput opt
  - C2
- Momory opt
  - M1, M2
Custom
GPU
- nvidia tesla

Data

BigQuery
GCS
Bigtable
Cloud Logging
Cloud Monitoring
Pub/Sub Lite

HDFS or GCS

GCSを使う際の特徴

IOが増える
append や trancateできない
file system情報を見れない
レイテンシーが増える
ストレージコストが減る
永続化できる
相互利用できる

Metastore

Hive metastore
一元化されたメタデータリポジトリ
データの全体像の構築
MySQLからも移行できる

Labels

Team or cost center labels: Add labels based on team or cost center to distinguish Dataproc clusters and jobs owned by different teams (for example, team:research and team:analytics). You can use this type of label for cost accounting or budgeting.

オンプレからの移行

https://cloud.google.com/architecture/hadoop?hl=ja
https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc?hl=ja

小規模な短期間のクラスタを設計する、エフェメラル
データをGCSに移行する
- hadoop distcp
- pull / pushモデル
小規模にテスト、POCする
HBase → Bigtable、Impala → BigQuery
- スナップショット作成、GCSにエクスポート、Bigtableにインポート
- Dataflowも使える

セキュリティ

https://cloud.google.com/architecture/hadoop/hadoop-migration-security-guide
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/confidential-compute

Dataproc: GCP管理
GCE: ユーザー管理
Authentication: service account / user account
Authorization: IAM / Service account with access scopes
Cloud Audit Logs: who did what, where and when?
Confidential VMでインメモリ暗号化できる
CMEKでPersistent DiskやGCSを暗号化できる

ワークフローテンプレート

再利用可能なワークフロー構成
DAG
マネージドクラスタ: エフェメラル
クラスタセレクタ: 既存のクラスタの指定
ComposerやFunctionsやSchedulerとの統合
- スケジューリングできる
- Fire and Forget
パラメータ可能なフィールド
- ラベル、クラスタ名、ジョブの引数、スクリプトの変数、インスタンス数

サーバレスデプロイ

auto scale
on Kubernetes / GCE / Serverless

Dataproc on Kubernetes

YARNとKubernetesをまとめて管理
コンテナ環境で分析に集中できる
GKEによりコンテナとジョブの死活管理が容易

ML

SparkのBigQueryコネクタ
Spark ML

Price

on GCE
- size of cluster: vCPUs
- duration time: second, 1 minute minimum
- Persistent Disk
on GKE
- vCPU
- duration
- alive node pool will costs
Serverless
- number of Data Compute Units: 1 vCPU + 4GB RAM unit
- storage

終わり

現場からは以上です

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up