More than 5 years have passed since last update.

Glueの使い方的な㉗(Jupyter NotebookをGlueの開発で使う)

Last updated at 2018-10-10Posted at 2018-10-10

SageMaker の Jupyter Notebook と連携する

全体の流れ

Glue開発エンドポイント作成
IAM ポリシー, IAM ロール作成
ノートブック作成
テスト

Glue開発エンドポイント作成

Glueの使い方的な㉓(開発エンドポイントとノートブックの始め方_2018夏) で作成した、Glue開発エンドポイントを利用します。

IAM ポリシー, IAM ロール作成

手順は文末の"こちらも是非"に書いた公式のリンクの通りですが

IAM ポリシー作成

IAMのメニューからポリシー作成で以下のポリシーを作成する。

ポリシー名：AWSGlueSageMakerNotebook (任意の名前)
S3バケット名：* (ここでは*を指定してますが、絞りたい場合はご利用するバケット名)
AWSアカウントID：xxxxxxxxxxxx (ご自身のAWSアカウントID,12桁の数字)
リージョンコード：ap-northeast-1 (任意のリージョンコード)
Glueの開発エンドポイント名：tmp1 (作成したGlueの開発エンドポイント名、※URLのような長い名前のエンドポイントのことではなりません)

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::*"
            ]
        },
        {
            "Action": [
                "s3:GetObject"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::*"
            ]
        },
        {
            "Action": [
                "logs:CreateLogStream",
                "logs:DescribeLogStreams",
                "logs:PutLogEvents",
                "logs:CreateLogGroup"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:logs:ap-northeast-1:xxxxxxxxxxxx:log-group:/aws/sagemaker/*",
                "arn:aws:logs:ap-northeast-1:xxxxxxxxxxxx:log-group:/aws/sagemaker/*:log-stream:aws-glue-*"
            ]
        },
        {
            "Action": [
                "glue:UpdateDevEndpoint",
                "glue:GetDevEndpoint",
                "glue:GetDevEndpoints"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:glue:ap-northeast-1:xxxxxxxxxxxx:devEndpoint/tmp1*"
            ]
        }
    ]
}

IAM Role 作成

IAMの画面で、ロールの作成->"AWS サービス"を選択->"SageMaker"を選択->ユースケースを"SageMaker-Execution"を選択し"次へ"
Attached アクセス権限ポリシーで"AmazonSageMakerFullAccess"を選択し"次へ"
ロール名に"AWSGlueServiceSageMakerNotebookRole"を入れてロールの作成

※KMSでデータの暗号化復号化を行う場合の権限は今回はスキップ

ロール名：AWSGlueServiceSageMakerNotebookRole-test (ロール名は基本はAWSGlueServiceSageMakerNotebookRoleで始まる名前にする必要がある。今回はAWSGlueServiceSageMakerNotebookRole-testという名前にした)
アタッチポリシー１：AmazonSageMakerFullAccess (デフォで用意されてるやつ)
アタッチポリシー２：AWSGlueSageMakerNotebook (１つ前でつくったやつ)

ノートブック作成

Glueの画面の左側メニューから"Notebooks"をクリックし、SageMaker Notebookの[ノートブックサーバーの作成]をクリックする

※ちなみにZeppelin Notebook は隣に移動

以下を入力し、"ノートブックサーバーの作成"をクリックする

Notebook name：tmpnotebook
Attach to development endpoint：tmp1 (Glue開発エンドポイントのこと。作成したものを選択)
IAM ロール："AWSGlueServiceSageMakerNotebookRole-test" (今回作成したもの)
VPC：任意のVPC (開発エンドポイントのネットワーク設定や、VPC内のリソース(RDSとか)にアクセスするかどうかにもよる)
サブネット：任意のサブネット
セキュリティグループ：22番ポートが空いたもの

少し待って出来上がったら、チェックを入れ、"Open notebook"をクリック

Jupyter notebookが開く

テスト

Glueジョブをpysparkで実行

Glueの使い方的な㉓(開発エンドポイントとノートブックの始め方_2018夏) でためしたコードを実行する

画面右上の"New"をクリックし、"Sparkmagic(PySpark)"をクリック

以下のコードを実行する(1)

from pyspark.sql.types import *
# DataFrameを作ってみる
df = spark.createDataFrame ([
    (1,144.5,5.9,33,'M'),
    (2,167.2,5.4,45,'M'),
    (3,124.1,5.2,23,'F'),
    (4,144.5,5.9,33,'M'),
    (5,133.2,5.7,54,'F'),
    (3,124.1,5.2,23,'F'),
    (5,129.2,5.3,42,'M')
    ],['id','weight','height','age','gender'])
df.show()

以下のコードを実行する(2)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
### コメントアウト
### args = getResolvedOptions(sys.argv, ['JOB_NAME'])

### コメントアウト
### sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
### コメントアウト
### job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "se2", table_name = "se2_in0", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "se2", table_name = "se2_in0", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("deviceid", "string", "deviceid", "string"), ("uuid", "long", "uuid", "long"), ("appid", "long", "appid", "long"), ("country", "string", "country", "string"), ("year", "long", "year", "long"), ("month", "long", "month", "long"), ("day", "long", "day", "long"), ("hour", "long", "hour", "long")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("deviceid", "string", "deviceid", "string"), ("uuid", "long", "uuid", "long"), ("appid", "long", "appid", "long"), ("country", "string", "country", "string"), ("year", "long", "year", "long"), ("month", "long", "month", "long"), ("day", "long", "day", "long"), ("hour", "long", "hour", "long")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://test-glue00/se2/out0"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
df = dropnullfields3.toDF()
df.show()
# atasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://test-glue00/se2/out0"}, format = "parquet", transformation_ctx = "datasink4")
### コメントアウト
### job.commit()

Glue Examplesを実行

"Glue Examples"をクリックする

"Joining, Filtering, and Loading Relational Data with AWS Glue.ipynb"をクリックする

対象のコード部分にカーソル合わせ、shift+Enterで実行していく

コードの任意の箇所を修正し、shift+Enterで実行して試してみる
databaseをサンプルのlegislators=>se2 (se２は自分で作ったdb)
table_nameをサンプルのpersons_json=>se2_in0 (se2_in0は自分で作ったtable)

他のコードも赤字の箇所を修正してサンプルをためしてみてください。

こちらも是非

Jupyterノートブック構築用IAM関連作成(公式)
https://docs.aws.amazon.com/ja_jp/glue/latest/dg/create-sagemaker-notebook-policy.html
https://docs.aws.amazon.com/ja_jp/glue/latest/dg/create-an-iam-role-sagemaker-notebook.html

Glueの使い方まとめ
https://qiita.com/pioho07/items/32f76a16cbf49f9f712f

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up