More than 5 years have passed since last update.

Spark on EMRからGoogle Cloud Storageに接続する

Posted at 2017-04-14

基本的には公式の手順どおりに設定するだけですが、EMRの場合はcore-site.xmlなどはクラスタ作成時にパラメータ渡していじります。

また今回、gcs-connector, credentialsの配置はbootstrap actionでやりました

以下、手順のメモ

環境

emr-5.4.0
Spark 2.1.0
Hadoop 2.7.3

1. アプリの用意

アプリをsbt assemblyして、できたjarをs3://<your-bucket>/path/to/app.jarに置いておきます

ここでは以下のサンプルアプリを使います

https://github.com/ocadaruma/spark-gcloud-example

2. credentialの用意

Google Cloud Platformのコンソールからサービスアカウントを作成し、p12形式のキーを落としておく
- json形式は使えないっぽい
キーはs3://<your-bucket>/path/to/credentials.p12に置いておきます

3. EMRクラスタの起動

bootstrap actionとconfigurationを設定して、クラスタを起動します

bootstrap action

bootstrap.sh

# !/bin/bash

cd ~
wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
aws s3 cp s3://<your-bucket>/path/to/credentials.p12 .

configuration

configuration.json

[
    {
        "Classification": "core-site",
        "Properties": {
            "fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
            "fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS",
            "google.cloud.auth.service.account.enable": "true",
            "google.cloud.auth.service.account.email": "example@<your-account>",
            "google.cloud.auth.service.account.keyfile": "/home/hadoop/credentials.p12",
            "fs.gs.project.id": "<your-gcp-project-id>"
        }
    },
    {
        "Classification": "hadoop-env",
        "Configurations": [
            {
                "Classification": "export",
                "Properties": {
                    "HADOOP_CLASSPATH": "\"$HADOOP_CLASSPATH:/home/hadoop/gcs-connector-latest-hadoop2.jar\""
                }
            }
        ]
    }
]

4. ジョブを流す

以下のようなカスタムJARステップを追加してみます

jar:

command-runner.jar

引数:

spark-submit --deploy-mode cluster \
  --class com.example.Main \
  --master yarn-cluster \
  --conf spark.jars=/home/hadoop/gcs-connector-latest-hadoop2.jar \
  s3://<your-bucket>/path/to/app.jar \
  100 gs://<your-gcs-bucket>/output

GCSに結果が出力されていることを確認します

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up