More than 1 year has passed since last update.

データ分析基盤構築：Terraformを用いたRDS、Glue、Redshiftによるデータ移行

Last updated at 2024-07-04Posted at 2024-06-27

はじめに

本リポジトリは、Terraformを使用してAWS上にデータ分析基盤を構築し、RDSからRedshiftへのデータ移行を行うための手順をまとめたものです。これにより、インフラ管理の効率化とデータ分析の迅速化が可能となります。

作成したレポジトリーはこちらです。

目的

AWS上のデータ分析基盤を自動構築し、インフラ管理の効率化を実現
RDSからRedshiftへのデータ移行を自動化し、データ分析を迅速化

アーキテクチャ

基盤は以下のコンポーネントで構成されています。

RDS: MySQLをホスト
Glue: データ変換とデータロードのためのサーバーレスサービス
Redshift: 大量データを効率的に処理するデータウェアハウス

実装方法

IAM

IAM（Identity and Access Management）は、AWSのサービスやリソースへのアクセスを制御するための重要なコンポーネントです。ここでは、Terraformを使用してIAMロールとポリシーを設定し、各サービスが必要な権限を持つようにします。

infra/modules/iam/aws_iam_role.tf

resource "aws_iam_role" "glue_role" {
  name = "glue_role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = ""
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "glue.amazonaws.com"
        }
      },
    ]
  })

  tags = {
    Name = "${var.app_name}-glue-iam-role"
  }
}

AWSGlueServiceRolにアタッチします。

AWSGlueServiceRoleとは

EC2、S3、Cloudwatch Logs などの関連サービスへのアクセスを許可する AWS Glue サービスロールのポリシー EC2, S3AWSGlueServiceRole は AWS マネージドポリシーです。

ただし、s3バケットの名前の頭にaws-glue-をつける必要があります。

infra/modules/iam/aws_iam_policy_attachment.tf

resource "aws_iam_role_policy_attachment" "glue_service_role_attachment" {
  role       = aws_iam_role.glue_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
}

参考記事

network

ネットワーク設定は、データベースやデータウェアハウスが安全に通信できるようにするために重要です。ここでは、VPC、サブネット、セキュリティグループの設定を行います。

1. VPCとインターネットゲートウェイの作成

VPCとインターネットゲートウェイを作成します。これにより、ネットワークの基盤が整います。

infra/modules/network/vpc.tf

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_support   = true
  enable_dns_hostnames = true
  instance_tenancy     = "default"

  tags = {
    Name = "${var.app_name}-main-vpc"
  }
}

インターネットゲートウェイの作成

infra/modules/network/internet_gateway.tf

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "${var.app_name}-igw"
  }
}

2. サブネットの作成

次に、パブリックおよびプライベートサブネットを作成します。パブリックサブネットにはインターネットアクセスがあり、プライベートサブネットは内部通信専用です。

infra/modules/network/subnet.tf

resource "aws_subnet" "public_subnet_1a" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.11.0/24"
  availability_zone       = "ap-northeast-1a"
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.app_name}-public-1a"
  }
}

resource "aws_subnet" "public_subnet_1c" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.12.0/24"
  availability_zone       = "ap-northeast-1c"
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.app_name}-public-1c"
  }
}

resource "aws_subnet" "private_subnet_1a" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.21.0/24"
  availability_zone       = "ap-northeast-1a"
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.app_name}-private-1a"
  }
}

resource "aws_subnet" "private_subnet_1c" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.22.0/24"
  availability_zone       = "ap-northeast-1c"
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.app_name}-private-1c"
  }
}

3. ルートテーブルとルートテーブルの関連付け

次に、パブリックおよびプライベートサブネットのルートテーブルを作成し、ルートテーブルをサブネットに関連付けます。

infra/modules/network/route_table.tf

resource "aws_route_table" "public_rt" {
  vpc_id = aws_vpc.main.id
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.igw.id
  }
  tags = {
    Name = "${var.app_name}-public-rt"
  }
}

resource "aws_route_table_association" "public_rt_1a" {
  route_table_id = aws_route_table.public_rt.id
  subnet_id      = aws_subnet.public_subnet_1a.id
}

resource "aws_route_table_association" "public_rt_1c" {
  route_table_id = aws_route_table.public_rt.id
  subnet_id      = aws_subnet.public_subnet_1c.id
}

resource "aws_route_table" "private_rt" {
  vpc_id = aws_vpc.main.id
  tags = {
    Name = "${var.app_name}-private-rt"
  }
}

resource "aws_route_table_association" "private_rt_1a" {
  route_table_id = aws_route_table.private_rt.id
  subnet_id      = aws_subnet.private_subnet_1a.id
}

resource "aws_route_table_association" "private_rt_1c" {
  route_table_id = aws_route_table.private_rt.id
  subnet_id      = aws_subnet.private_subnet_1c.id
}

resource "aws_db_subnet_group" "db-subnet-group" {
  name       = "db-sg"
  subnet_ids = [aws_subnet.private_subnet_1a.id, aws_subnet.private_subnet_1c.id]
}

resource "aws_redshift_subnet_group" "redshift-subnet-group" {
  name       = "redshift-subnet-group"
  subnet_ids = [aws_subnet.private_subnet_1a.id, aws_subnet.private_subnet_1c.id]
}

4. セキュリティグループの作成

セキュリティグループを設定して、各サービスが適切に通信できるようにします。

infra/modules/network/security_group.tf

# SecurityGroup for RDS
resource "aws_security_group" "rds_sg" {
  name   = "rds-source-sg"
  vpc_id = aws_vpc.main.id
  tags = {
    Name = "${var.app_name}-rds-sg"
  }
}

# SecurityGroup for opmng
resource "aws_security_group" "opmng_sg" {
  name   = "opmng-sg"
  vpc_id = aws_vpc.main.id
  tags = {
    Name = "${var.app_name}-opmng-sg"
  }
}

# SecurityGroup for glue
resource "aws_security_group" "glue_sg" {
  name   = "glue-sg"
  vpc_id = aws_vpc.main.id
  tags = {
    Name = "${var.app_name}-glue-sg"
  }
}

# SecurityGroup for RedShift
resource "aws_security_group" "redshift_sg" {
  name   = "redshift-sg"
  vpc_id = aws_vpc.main.id
  tags = {
    Name = "${var.app_name}-redshift-sg"
  }
}

5. セキュリティグループルールの設定

各セキュリティグループに必要なルールを設定します。

infra/modules/network/security_group_rule.tf

locals {
  opmng_sg_id_list = [
    { type = "ingress", port = "22" },
    { type = "egress", port = "80" },
    { type = "egress", port = "443" },
    { type = "egress", port = "${var.db_ports[0].internal}" },
    { type = "egress", port = "5439" },
  ]
}

# SecurityGroupRules for opmng
resource "aws_security_group_rule" "opmng_web" {
  for_each = { for i in local.opmng_sg_id_list : i.port => i }

  type              = each.value.type
  from_port         = tonumber(each.value.port)
  to_port           = tonumber(each.value.port)
  protocol          = "tcp"
  cidr_blocks       = ["0.0.0.0/0"]
  security_group_id = aws_security_group.opmng_sg.id
}


# SecurityGroupRules for source db
resource "aws_security_group_rule" "db_in_tcp3306_from_opmng" {
  type                     = "ingress"
  from_port                = var.db_ports[0].internal
  to_port                  = var.db_ports[0].external
  protocol                 = var.db_ports[0].protocol
  source_security_group_id = aws_security_group.opmng_sg.id
  security_group_id        = aws_security_group.rds_sg.id
}

resource "aws_security_group_rule" "db_in_tcp3306_from_glue" {
  type                     = "ingress"
  from_port                = var.db_ports[0].internal
  to_port                  = var.db_ports[0].external
  protocol                 = var.db_ports[0].protocol
  source_security_group_id = aws_security_group.glue_sg.id
  security_group_id        = aws_security_group.rds_sg.id
}

# GlueGroupRules for source db
resource "aws_security_group_rule" "glue_in_tcp65535" {
  type              = "egress"
  from_port         = 0
  to_port           = 0
  protocol          = -1
  cidr_blocks       = ["0.0.0.0/0"]
  security_group_id = aws_security_group.glue_sg.id
}

resource "aws_security_group_rule" "glue_inbound_all" {
  type              = "ingress"
  from_port         = 0
  to_port           = 65535
  protocol          = -1
  cidr_blocks       = ["0.0.0.0/0"]
  security_group_id = aws_security_group.glue_sg.id
}

# RedshiftGroupRules for source db
resource "aws_security_group_rule" "redshift_in_tcp65535_from_glue" {
  type                     = "ingress"
  from_port                = 5439
  to_port                  = 5439
  protocol                 = "tcp"
  source_security_group_id = aws_security_group.glue_sg.id
  security_group_id        = aws_security_group.redshift_sg.id
}

resource "aws_security_group_rule" "redshift_in_tcp65535_from_opmng" {
  type                     = "ingress"
  from_port                = 5439
  to_port                  = 5439
  protocol                 = "tcp"
  source_security_group_id = aws_security_group.opmng_sg.id
  security_group_id        = aws_security_group.redshift_sg.id
}

resource "aws_security_group_rule" "redshift_out_tcp665535" {
  type              = "egress"
  from_port         = 0
  to_port           = 0
  protocol          = "-1"
  cidr_blocks       = ["0.0.0.0/0"]
  security_group_id = aws_security_group.redshift_sg.id
}

VPCエンドポイント

S3に接続するVPCエンドポイントを設定します。

infra/modules/network/aws_vpc_endpoint.tf

resource "aws_vpc_endpoint" "glue_s3" {
  service_name      = "com.amazonaws.ap-northeast-1.s3"
  vpc_endpoint_type = "Gateway"
  vpc_id            = aws_vpc.main.id
  route_table_ids   = [aws_route_table.private_rt.id]

  tags = {
    "Name" = "ecr-s3-endpoint"
  }
}

Redshift

Amazon Redshiftは、高速でスケーラブルなデータウェアハウスサービスです。ここでは、Terraformを使用してRedshiftクラスタを設定します。

infra/modules/redshift/aws_redshift_cluster.tf

resource "aws_redshift_cluster" "main-redshift" {
  cluster_identifier        = "main-redshift-cluster"
  database_name             = var.db_name
  master_username           = var.db_username
  master_password           = var.db_password
  node_type                 = "dc2.large"
  cluster_type              = "single-node"
  publicly_accessible       = false
  skip_final_snapshot       = true
  vpc_security_group_ids    = ["${var.security-group-redshift-id}"]
  cluster_subnet_group_name = var.redshift-subnet-group-name
}

Glue

Amazon Glueは、データの発見、準備、統合のためのサーバーレスETL（Extract, Transform, Load）サービスです。ここでは、Glueの設定を行います。

1. Glueスクリプトの設定

Glueジョブの中心となるスクリプトを作成します。このスクリプトでは、RDSからデータを読み込み、Redshiftにデータを書き込む処理を行います。

infra/modules/glue/src/glue_script.py

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit

## @params: [JOB_NAME, REDSHIFT_DATABASE, REDSHIFT_USER, REDSHIFT_PASSWORD, REDSHIFT_HOST, REDSHIFT_PORT, RDS_DATABASE, RDS_USER, RDS_PASSWORD, RDS_HOST, RDS_PORT]
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'REDSHIFT_DATABASE', 'REDSHIFT_USER', 'REDSHIFT_PASSWORD', 'REDSHIFT_HOST', 'REDSHIFT_PORT', 'RDS_DATABASE', 'RDS_USER', 'RDS_PASSWORD', 'RDS_HOST', 'RDS_PORT'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# RDS からデータを読み込む
rds_url = f"jdbc:mysql://{args['RDS_HOST']}:{args['RDS_PORT']}/{args['RDS_DATABASE']}"
try:
    df_rds = spark.read.format("jdbc") \
        .option("url", rds_url) \
        .option("driver", "com.mysql.cj.jdbc.Driver") \
        .option("user", args['RDS_USER']) \
        .option("password", args['RDS_PASSWORD']) \
        .option("dbtable", "users") \
        .load()
except Exception as e:
    print(f"Failed to read from RDS: {e}")
    raise

# Select specific columns
df_selected = df_rds.select("id", "email", "age", "gender", "occupation")

# Redshift にデータを書き込む
redshift_url = f"jdbc:redshift://{args['REDSHIFT_HOST']}:{args['REDSHIFT_PORT']}/{args['REDSHIFT_DATABASE']}"
try:
    df_selected.write \
        .format("jdbc") \
        .option("url", redshift_url) \
        .option("driver", "com.amazon.redshift.jdbc42.Driver") \
        .option("user", args['REDSHIFT_USER']) \
        .option("password", args['REDSHIFT_PASSWORD']) \
        .option("dbtable", "users") \
        .mode("append") \
        .save()
except Exception as e:
    print(f"Failed to write to Redshift: {e}")
    raise

job.commit()

2. Glueカタログデータベースの作成

Glueジョブで使用するカタログデータベースを作成します。

infra/modules/glue/aws_glue_catalog_database.tf

resource "aws_glue_catalog_database" "my_glue_db" {
  name = "my_glue_db"
}

3. Glue接続の設定

RDSとRedshiftに接続するためのGlue接続を作成します。

infra/modules/glue/aws_glue_connection.tf

resource "aws_glue_connection" "rds_to_glue" {
  name = "rds-to-glue"
  connection_properties = {
    "JDBC_CONNECTION_URL" = "jdbc:mysql://${var.db_address}:3306/${var.db_name}"
    "USERNAME"            = var.db_username
    "PASSWORD"            = var.db_password
  }
  physical_connection_requirements {
    availability_zone      = "ap-northeast-1a"
    security_group_id_list = [var.security-group-glue-id]
    subnet_id              = var.subnet-private-subnet-1a-id
  }
}

resource "aws_glue_connection" "glue_to_redshift" {
  name = "glue-to-redshift"
  connection_properties = {
    JDBC_CONNECTION_URL = "jdbc:postgresql://${var.redshift-endpoint}/${var.db_name}"
    "USERNAME"          = var.db_username
    "PASSWORD"          = var.db_password
  }
  physical_connection_requirements {
    availability_zone      = "ap-northeast-1a"
    security_group_id_list = [var.security-group-glue-id]
    subnet_id              = var.subnet-private-subnet-1a-id
  }
}

4. Glueクローラーの設定

Glueクローラーを設定して、RDSのデータベーススキーマを自動的に検出します。

infra/modules/glue/aws_glue_crawler.tf

resource "aws_glue_crawler" "my_crawler" {
  name          = "my_crawler"
  database_name = var.db_name
  role          = var.glue_role_arn

  jdbc_target {
    connection_name = aws_glue_connection.rds_to_glue.name
    path            = "${var.db_name}/%"
  }

  schema_change_policy {
    delete_behavior = "LOG"
  }
}

5. Glueジョブの設定

最後に、Glueジョブを設定して、ETL処理を自動化します。

infra/modules/glue/aws_glue_job.tf

resource "aws_glue_job" "my_glue_job" {
  name     = "my_glue_job"
  role_arn = var.glue_role_arn
  connections = [
    aws_glue_connection.rds_to_glue.name,
    aws_glue_connection.glue_to_redshift.name
  ]
  command {
    name            = "glueetl"
    script_location = "s3://${var.s3_bucket_name}/glue_script.py"
    python_version  = "3"
  }
  default_arguments = {
    "--JOB_NAME"          = "rds_to_redshift_job"
    "--REDSHIFT_DATABASE" = var.db_name
    "--REDSHIFT_USER"     = var.db_username
    "--REDSHIFT_PASSWORD" = var.db_password
    "--REDSHIFT_HOST"     = var.redshift-dns-name
    "--REDSHIFT_PORT"     = 5439
    "--RDS_DATABASE"      = var.db_name
    "--RDS_USER"          = var.db_username
    "--RDS_PASSWORD"      = var.db_password
    "--RDS_HOST"          = var.db_address
    "--RDS_PORT"          = 3306
  }
}

これで、Amazon Glueを使用したETLプロセスの設定が完了です。Glueジョブを実行すると、RDSからデータを読み込み、必要な変換を行ってRedshiftにデータをロードします。

結果

EC2にSSH接続し、MySQLのusersテーブルを確認

まず、EC2インスタンスにSSHで接続します。

cd infra/modules/ec2/src
ssh -i "todolist-keypair.pem" ec2-user@ec2-xxxxxxxxx.ap-northeast-1.compute.amazonaws.com

次に、MySQLデータベースに接続します。

mysql -h source-db.xxxxxxxxx.ap-northeast-1.rds.amazonaws.com -u admindbuser -pMysql1aaaaA newproject

最後に、usersテーブルの内容を確認します。

select * from users;

Glueのジョブを実行する

AWS Glueのジョブを実行するには、以下の手順に従ってください。

AWS Management Consoleにログイン
Glueサービスに移動
実行したいジョブを選択し、[Run Job]ボタンをクリック

EC2にSSH接続し、RedShiftのusersテーブルを確認

以下のようにredshiftにtableを作られました。

まず、EC2インスタンスにSSHで接続します。

cd infra/modules/ec2/src
ssh -i "todolist-keypair.pem" ec2-user@ec2-xxxxxxxxx.ap-northeast-1.compute.amazonaws.com

次に、Redshiftクラスターに接続します。

psql -h main-redshift-cluster.xxxxxxxxx.ap-northeast-1.redshift.amazonaws.com -U admindbuser -d newproject -p 5439

最後に、usersテーブルの内容を確認します。

select * from users;

これらの手順を実行することで、EC2インスタンスからMySQLおよびRedshiftデータベースのusersテーブルにアクセスし、内容を確認することができます。AWS Glueのジョブも簡単に実行することができます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up