3
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

Spark on EMRからGoogle Cloud Storageに接続する

Posted at

基本的には公式の手順どおりに設定するだけですが、EMRの場合はcore-site.xmlなどはクラスタ作成時にパラメータ渡していじります。

また今回、gcs-connector, credentialsの配置はbootstrap actionでやりました

以下、手順のメモ

環境

  • emr-5.4.0
  • Spark 2.1.0
  • Hadoop 2.7.3

1. アプリの用意

アプリをsbt assemblyして、できたjarをs3://<your-bucket>/path/to/app.jarに置いておきます

ここでは以下のサンプルアプリを使います

2. credentialの用意

  • Google Cloud Platformのコンソールからサービスアカウントを作成し、p12形式のキーを落としておく
    • json形式は使えないっぽい
  • キーはs3://<your-bucket>/path/to/credentials.p12に置いておきます

3. EMRクラスタの起動

bootstrap actionとconfigurationを設定して、クラスタを起動します

bootstrap action

bootstrap.sh
#!/bin/bash

cd ~
wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
aws s3 cp s3://<your-bucket>/path/to/credentials.p12 .

configuration

configuration.json
[
    {
        "Classification": "core-site",
        "Properties": {
            "fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
            "fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS",
            "google.cloud.auth.service.account.enable": "true",
            "google.cloud.auth.service.account.email": "example@<your-account>",
            "google.cloud.auth.service.account.keyfile": "/home/hadoop/credentials.p12",
            "fs.gs.project.id": "<your-gcp-project-id>"
        }
    },
    {
        "Classification": "hadoop-env",
        "Configurations": [
            {
                "Classification": "export",
                "Properties": {
                    "HADOOP_CLASSPATH": "\"$HADOOP_CLASSPATH:/home/hadoop/gcs-connector-latest-hadoop2.jar\""
                }
            }
        ]
    }
]

4. ジョブを流す

以下のようなカスタムJARステップを追加してみます

  • jar:

command-runner.jar

  • 引数:
spark-submit --deploy-mode cluster \
  --class com.example.Main \
  --master yarn-cluster \
  --conf spark.jars=/home/hadoop/gcs-connector-latest-hadoop2.jar \
  s3://<your-bucket>/path/to/app.jar \
  100 gs://<your-gcs-bucket>/output
  • GCSに結果が出力されていることを確認します
3
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?