目的:
Cloudera Data PlatformのCDWに、既にIcebergテーブルに保存したデータがあります。
Snowflake からアクセスしてみたいです。
1) アーキテクチャ
Image source : https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume
2) 新規作成
2.1) Snowflake の外部Volumeを新規作成
Step 1: Configure access permissions for the S3 bucket
Create a policy in AWS IAM.
I created a policy called zzeng-Snowflake-ext
:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:GetObjectVersion",
"s3:DeleteObject",
"s3:DeleteObjectVersion"
],
"Resource": "arn:aws:s3:::<my-bucket>/data/zzeng/*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::<my-bucket>",
"Condition": {
"StringLike": {
"s3:prefix": [
"data/zzeng/*"
]
}
}
}
]
}
Step 2: AWSのIAMロールを作成
Step 3: SSE-KMS暗号化のIAMロール (optional)¶
今回暗号化しないので、SKIP
Step 4: Snowflakeの外部Volumeを作成
Snowflake のSQL実行:
CREATE OR REPLACE EXTERNAL VOLUME extIcebergVolC
STORAGE_LOCATIONS =
(
(
NAME = 'zzeng-iceberg-se-s3-ap-northeast-1'
STORAGE_PROVIDER = 'S3'
STORAGE_BASE_URL = 's3://<my-bucket>/data/zzeng/'
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<my-AWS-id*****>:role/zzeng-Snowflake-ext-role'
)
);
```
### Step 5: Snowflake accountとAWS IAM userを紐づけ
Snowflake で下記SQLを実行:
```sql
DESC EXTERNAL VOLUME extIcebergVolC;
実行結果:
zzeng#COMPUTE_WH@ZZENG.PUBLIC>DESC EXTERNAL VOLUME extIcebergVolC;
+-------------------+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+
| parent_property | property | property_type | property_value | property_default |
|-------------------+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------|
| | ALLOW_WRITES | Boolean | true | true |
| STORAGE_LOCATIONS | STORAGE_LOCATION_1 | String | {"NAME":"zzeng-iceberg-se-s3-ap-northeast-1","STORAGE_PROVIDER":"S3","STORAGE_BASE_URL":"s3://<my-bucket-id>/data/zzeng/","STORAGE_ALLOWED_LOCATIONS":["s3://<my-bucket-id>/data/zzeng/*"],"STORAGE_REGION":"us-east-2","PRIVILEGES_VERIFIED":true,"STORAGE_AWS_ROLE_ARN":"<STORAGE_AWS_ROLE_ARN>","STORAGE_AWS_IAM_USER_ARN":"<STORAGE_AWS_ROLE_ARN>","STORAGE_AWS_EXTERNAL_ID":"<a long string for STORAGE_AWS_EXTERNAL_ID>","ENCRYPTION_TYPE":"NONE","ENCRYPTION_KMS_KEY_ID":""} | |
| STORAGE_LOCATIONS | ACTIVE | String | zzeng-iceberg-se-s3-ap-northeast-1 | |
+-------------------+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+
3 Row(s) produced. Time Elapsed: 0.949s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>
STORAGE_AWS_IAM_USER_ARN と STORAGE_AWS_EXTERNAL_IDを控える
Step 6: Object StoreのBucketeアクセスのIAM user権限を設定
前のStep 5で貰った値で、AWSのIAM RoleのTrustshipを編集。
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"AWS": "<snowflake_user_arn>"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<snowflake_external_id>"
}
}
}
]
}
2.2) CDWで新しいIcebergテーブルを作成
下記S3フォルダで作成する予定。
s3a://${my-test-bucket}/data/${user_id}/airlines/airlines
HUE画面で、下記SQLを実行
CREATE DATABASE ${user_id}_airlines_ice;
drop table if exists ${user_id}_airlines_ice.airlines;
CREATE EXTERNAL TABLE ${user_id}_airlines_ice.airlines (code string, description string)
STORED BY ICEBERG
STORED AS PARQUET
LOCATION 's3a://${cdp_env_bucket}/data/${user_id}/airlines/airlines'
tblproperties("format-version"="2",'external.table.purge'='true');
INSERT INTO ${user_id}_airlines_ice.airlines
SELECT * FROM ${user_id}_airlines_csv.airlines_csv;
select * from ${user_id}_airlines_ice.airlines;
select count(*) from ${user_id}_airlines_ice.airlines;
SQLを実行したら下記フォルダが出来上がる。
Metadataの情報をチェック:
SHOW CREATE TABLE airlines;
実行結果:
CREATE EXTERNAL TABLE `zzeng_airlines_ice`.`airlines`(
`code` string,
`description` string)
ROW FORMAT SERDE
'org.apache.iceberg.mr.hive.HiveIcebergSerDe'
STORED BY
'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION
's3a://<my-bucket-name>/data/zzeng/airlines/airlines'
TBLPROPERTIES (
'bucketing_version'='2',
'current-schema'='{"type":"struct","schema-id":0,"fields":[{"id":1,"name":"code","required":false,"type":"string"},{"id":2,"name":"description","required":false,"type":"string"}]}',
'current-snapshot-id'='6686807318178502869',
'current-snapshot-summary'='{"added-data-files":"1","added-records":"1491","added-files-size":"25686","changed-partition-count":"1","total-records":"1491","total-files-size":"25686","total-data-files":"1","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0"}',
'current-snapshot-timestamp-ms'='1710324490640',
'external.table.purge'='true',
'format-version'='2',
'iceberg.orc.files.only'='false',
'metadata_location'='s3a://<my-bucket-name>/data/zzeng/airlines/airlines/metadata/00001-7e28a998-42f9-4466-8884-32d450af5c85.metadata.json',
'previous_metadata_location'='s3a://<my-bucket-name>/data/zzeng/airlines/airlines/metadata/00000-9179e76f-369b-47ca-b01f-20e6bffd36a5.metadata.json',
'serialization.format'='1',
'snapshot-count'='1',
'table_type'='ICEBERG',
'transient_lastDdlTime'='1710324491',
'uuid'='1a86667c-6c7c-4318-a976-ba7fd49c13b0',
'write.delete.mode'='merge-on-read',
'write.format.default'='parquet',
'write.merge.mode'='merge-on-read',
'write.update.mode'='merge-on-read')
2.3) Snowflake でIceberg Table を作成し、先ほどのCDW Iceberg Tableをアクセス
SQL
CREATE OR REPLACE ICEBERG TABLE airlines
CATALOG='zzengIcebergCatalogInt'
EXTERNAL_VOLUME='extIcebergVolC'
BASE_LOCATION='airlines/airlines'
METADATA_FILE_PATH='metadata/00001-7e28a998-42f9-4466-8884-32d450af5c85.metadata.json'
;
3) 結果チェック
In Snowflake:
zzeng#COMPUTE_WH@ZZENG.PUBLIC>select count(*) from AIRLINES;
+----------+
| COUNT(*) |
|----------|
| 1491 |
+----------+
1 Row(s) produced. Time Elapsed: 0.393s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>select * from AIRLINES limit 3;
+------+--------------------+
| CODE | DESCRIPTION |
|------+--------------------|
| 02Q | Titan Airways |
| 04Q | Tradewind Aviation |
| 05Q | "Comlux Aviation |
+------+--------------------+
3 Row(s) produced. Time Elapsed: 4.705s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>
In CDW Hive HUE:
select count(*) from ${user_id}_airlines_ice.airlines;
4) データ変更
4.1) データ追加
現時点、SnowflakeのIcebergカタログを使わず、外部のIcebergテーブルを利用する場合、Insertできないです。
Reference URL:
- https://docs.snowflake.com/en/user-guide/tables-iceberg-manage
- https://docs.snowflake.com/en/user-guide/tables-iceberg
こちらの説明では、
You can use INSERT and UPDATE statements to modify an Iceberg table that uses Snowflake as the catalog.
Snowflake をIcebergのカタログとして利用する時はINSERT/UPDATEできますが、Catalog Integrationの場合は更新できません。
実際やってみました。
zzeng#COMPUTE_WH@ZZENG.PUBLIC>INSERT INTO airlines (code, description) VALUES
('A1', 'Airline 1 Description'),
('A2', 'Airline 2 Description'),
('A3', 'Airline 3 Description'),
('A4', 'Airline 4 Description'),
('A5', 'Airline 5 Description'),
('A6', 'Airline 6 Description'),
('A7', 'Airline 7 Description'),
('A8', 'Airline 8 Description'),
('A9', 'Airline 9 Description'),
('A10', 'Airline 10 Description'),
('A11', 'Airline 11 Description'),
('A12', 'Airline 12 Description'),
('A13', 'Airline 13 Description'),
('A14', 'Airline 14 Description'),
('A15', 'Airline 15 Description'),
('A16', 'Airline 16 Description'),
('A17', 'Airline 17 Description'),
('A18', 'Airline 18 Description'),
('A19', 'Airline 19 Description'),
('A20', 'Airline 20 Description');
091357 (42601): SQL Compilation error: Iceberg table AIRLINES with an external catalog integration is a read-only table and cannot be modified
zzeng#COMPUTE_WH@ZZENG.PUBLIC>
エラーです。
CDWのHiveでやってみると、
INSERT INTO `zzeng_airlines_ice`.`airlines` (code, description) VALUES
('A1', 'Airline 1 Description'),
('A2', 'Airline 2 Description'),
('A3', 'Airline 3 Description'),
('A4', 'Airline 4 Description'),
('A5', 'Airline 5 Description'),
('A6', 'Airline 6 Description'),
('A7', 'Airline 7 Description'),
('A8', 'Airline 8 Description'),
('A9', 'Airline 9 Description'),
('A10', 'Airline 10 Description'),
('A11', 'Airline 11 Description'),
('A12', 'Airline 12 Description'),
('A13', 'Airline 13 Description'),
('A14', 'Airline 14 Description'),
('A15', 'Airline 15 Description'),
('A16', 'Airline 16 Description'),
('A17', 'Airline 17 Description'),
('A18', 'Airline 18 Description'),
('A19', 'Airline 19 Description'),
('A20', 'Airline 20 Description');
成功。
4.2) CDWでIcebergテーブルを更新したら、Snowflake 側に反映
現在の課題:Hive側でIcebergテーブルを更新しましたが、Snowflake 側は反映されていない。
なぜ?
Icebergを更新したら、Metadataが変わりました。
Snowflake側はそれを検知していないから、前の世代のMetadataを見ていた。
zzeng#COMPUTE_WH@ZZENG.PUBLIC>select count(*) from AIRLINES;
+----------+
| COUNT(*) |
|----------|
| 1491 |
+----------+
1 Row(s) produced. Time Elapsed: 0.678s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>
Snowflake 公式ドキュメント:
In Snowflake, The metadata files do not identify the most recent snapshot of an Iceberg table.
Ref URL: https://docs.snowflake.com/en/user-guide/tables-iceberg
手動でSnowflake のMetadata設定変更。
まずは新しいMetadataのファイル名を確認。
これでALTER文作成:
ALTER ICEBERG TABLE AIRLINES REFRESH 'metadata/00002-c33ae888-1af3-4f64-830b-eac9e0a95983.metadata.json';
更新、動作確認結果:
zzeng#COMPUTE_WH@ZZENG.PUBLIC>ALTER ICEBERG TABLE AIRLINES REFRESH 'metadata/00002-c33ae888-1af3-4f64-830b-eac9e0a95983.metadata.json';
+----------------------------------+
| status |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+
1 Row(s) produced. Time Elapsed: 10.199s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>select count(*) from AIRLINES;
+----------+
| COUNT(*) |
|----------|
| 1511 |
+----------+
1 Row(s) produced. Time Elapsed: 0.204s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>
5) 制約事項
Copied from Snowflake documents:
Metadata files
The metadata files do not identify the most recent snapshot of an Iceberg table.
You cannot modify the location of the data files or snapshot using the ALTER ICEBERG TABLE command. To modify either of these settings, you must recreate the table (using the CREATE OR REPLACE ICEBERG TABLE syntax).