2
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

Snowflake でCloudera Data PlatformのCDW Iceberg テーブルをアクセス

Last updated at Posted at 2024-03-15

目的:
Cloudera Data PlatformのCDWに、既にIcebergテーブルに保存したデータがあります。
Snowflake からアクセスしてみたいです。

1) アーキテクチャ

image.png

Image source : https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume

2) 新規作成

2.1) Snowflake の外部Volumeを新規作成

Step 1: Configure access permissions for the S3 bucket

Create a policy in AWS IAM.

I created a policy called zzeng-Snowflake-ext :

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:DeleteObject",
                "s3:DeleteObjectVersion"
            ],
            "Resource": "arn:aws:s3:::<my-bucket>/data/zzeng/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::<my-bucket>",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "data/zzeng/*"
                    ]
                }
            }
        }
    ]
}

AWS Console の設定:
image.png

Step 2: AWSのIAMロールを作成

zzeng-Snowflake-ext-role:
image.png

作成した後、ARNを控える
04Untitled.png

Step 3: SSE-KMS暗号化のIAMロール (optional)¶

今回暗号化しないので、SKIP

Step 4: Snowflakeの外部Volumeを作成

Snowflake のSQL実行:

 

CREATE OR REPLACE EXTERNAL VOLUME extIcebergVolC
   STORAGE_LOCATIONS =
      (
         (
            NAME = 'zzeng-iceberg-se-s3-ap-northeast-1'
            STORAGE_PROVIDER = 'S3'
            STORAGE_BASE_URL = 's3://<my-bucket>/data/zzeng/'
            STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<my-AWS-id*****>:role/zzeng-Snowflake-ext-role'
         )
      );
      ```


### Step 5: Snowflake accountとAWS IAM userを紐づけ

Snowflake で下記SQLを実行:

```sql
DESC EXTERNAL VOLUME extIcebergVolC;

実行結果:

zzeng#COMPUTE_WH@ZZENG.PUBLIC>DESC EXTERNAL VOLUME extIcebergVolC;
+-------------------+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+
| parent_property   | property           | property_type | property_value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | property_default |
|-------------------+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------|
|                   | ALLOW_WRITES       | Boolean       | true                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | true             |
| STORAGE_LOCATIONS | STORAGE_LOCATION_1 | String        | {"NAME":"zzeng-iceberg-se-s3-ap-northeast-1","STORAGE_PROVIDER":"S3","STORAGE_BASE_URL":"s3://<my-bucket-id>/data/zzeng/","STORAGE_ALLOWED_LOCATIONS":["s3://<my-bucket-id>/data/zzeng/*"],"STORAGE_REGION":"us-east-2","PRIVILEGES_VERIFIED":true,"STORAGE_AWS_ROLE_ARN":"<STORAGE_AWS_ROLE_ARN>","STORAGE_AWS_IAM_USER_ARN":"<STORAGE_AWS_ROLE_ARN>","STORAGE_AWS_EXTERNAL_ID":"<a long string for STORAGE_AWS_EXTERNAL_ID>","ENCRYPTION_TYPE":"NONE","ENCRYPTION_KMS_KEY_ID":""} |                  |
| STORAGE_LOCATIONS | ACTIVE             | String        | zzeng-iceberg-se-s3-ap-northeast-1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                  |
+-------------------+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+
3 Row(s) produced. Time Elapsed: 0.949s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>

STORAGE_AWS_IAM_USER_ARN と STORAGE_AWS_EXTERNAL_IDを控える

Step 6: Object StoreのBucketeアクセスのIAM user権限を設定

前のStep 5で貰った値で、AWSのIAM RoleのTrustshipを編集。

image.png

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "<snowflake_user_arn>"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "<snowflake_external_id>"
        }
      }
    }
  ]
}

2.2) CDWで新しいIcebergテーブルを作成

下記S3フォルダで作成する予定。

s3a://${my-test-bucket}/data/${user_id}/airlines/airlines

HUE画面で、下記SQLを実行

image.png

CREATE DATABASE ${user_id}_airlines_ice;
drop table if exists ${user_id}_airlines_ice.airlines;

CREATE EXTERNAL TABLE ${user_id}_airlines_ice.airlines (code string, description string) 
STORED BY ICEBERG
STORED AS PARQUET
LOCATION 's3a://${cdp_env_bucket}/data/${user_id}/airlines/airlines'
tblproperties("format-version"="2",'external.table.purge'='true');

INSERT INTO ${user_id}_airlines_ice.airlines
  SELECT * FROM ${user_id}_airlines_csv.airlines_csv;
  
  
select * from ${user_id}_airlines_ice.airlines;
  
select count(*) from ${user_id}_airlines_ice.airlines;

SQLを実行したら下記フォルダが出来上がる。

image.png

Metadataの情報をチェック:

SHOW CREATE TABLE airlines;

実行結果:

CREATE EXTERNAL TABLE `zzeng_airlines_ice`.`airlines`(
  `code` string, 
  `description` string)
ROW FORMAT SERDE 
  'org.apache.iceberg.mr.hive.HiveIcebergSerDe' 
STORED BY 
  'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 

LOCATION
    's3a://<my-bucket-name>/data/zzeng/airlines/airlines'
  TBLPROPERTIES (
    'bucketing_version'='2', 
    'current-schema'='{"type":"struct","schema-id":0,"fields":[{"id":1,"name":"code","required":false,"type":"string"},{"id":2,"name":"description","required":false,"type":"string"}]}', 
    'current-snapshot-id'='6686807318178502869', 
    'current-snapshot-summary'='{"added-data-files":"1","added-records":"1491","added-files-size":"25686","changed-partition-count":"1","total-records":"1491","total-files-size":"25686","total-data-files":"1","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0"}', 
    'current-snapshot-timestamp-ms'='1710324490640', 
    'external.table.purge'='true', 
    'format-version'='2', 
    'iceberg.orc.files.only'='false', 
    'metadata_location'='s3a://<my-bucket-name>/data/zzeng/airlines/airlines/metadata/00001-7e28a998-42f9-4466-8884-32d450af5c85.metadata.json', 
    'previous_metadata_location'='s3a://<my-bucket-name>/data/zzeng/airlines/airlines/metadata/00000-9179e76f-369b-47ca-b01f-20e6bffd36a5.metadata.json', 
    'serialization.format'='1', 
    'snapshot-count'='1', 
    'table_type'='ICEBERG', 
    'transient_lastDdlTime'='1710324491', 
    'uuid'='1a86667c-6c7c-4318-a976-ba7fd49c13b0', 
    'write.delete.mode'='merge-on-read', 
    'write.format.default'='parquet', 
    'write.merge.mode'='merge-on-read', 
    'write.update.mode'='merge-on-read')

2.3) Snowflake でIceberg Table を作成し、先ほどのCDW Iceberg Tableをアクセス

SQL

CREATE OR REPLACE ICEBERG TABLE airlines
  CATALOG='zzengIcebergCatalogInt'
  EXTERNAL_VOLUME='extIcebergVolC'
  BASE_LOCATION='airlines/airlines'
  METADATA_FILE_PATH='metadata/00001-7e28a998-42f9-4466-8884-32d450af5c85.metadata.json'
;

3) 結果チェック

In Snowflake:

zzeng#COMPUTE_WH@ZZENG.PUBLIC>select count(*) from AIRLINES;
+----------+
| COUNT(*) |
|----------|
|     1491 |
+----------+
1 Row(s) produced. Time Elapsed: 0.393s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>select * from AIRLINES limit 3;
+------+--------------------+
| CODE | DESCRIPTION        |
|------+--------------------|
| 02Q  | Titan Airways      |
| 04Q  | Tradewind Aviation |
| 05Q  | "Comlux Aviation   |
+------+--------------------+
3 Row(s) produced. Time Elapsed: 4.705s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>

In CDW Hive HUE:

select count(*) from ${user_id}_airlines_ice.airlines;

image.png

4) データ変更

4.1) データ追加

現時点、SnowflakeのIcebergカタログを使わず、外部のIcebergテーブルを利用する場合、Insertできないです。
Reference URL:

image.png

こちらの説明では、

You can use INSERT and UPDATE statements to modify an Iceberg table that uses Snowflake as the catalog.

Snowflake をIcebergのカタログとして利用する時はINSERT/UPDATEできますが、Catalog Integrationの場合は更新できません。

実際やってみました。

zzeng#COMPUTE_WH@ZZENG.PUBLIC>INSERT INTO airlines (code, description) VALUES
                                   ('A1', 'Airline 1 Description'),
                                   ('A2', 'Airline 2 Description'),
                                   ('A3', 'Airline 3 Description'),
                                   ('A4', 'Airline 4 Description'),
                                   ('A5', 'Airline 5 Description'),
                                   ('A6', 'Airline 6 Description'),
                                   ('A7', 'Airline 7 Description'),
                                   ('A8', 'Airline 8 Description'),
                                   ('A9', 'Airline 9 Description'),
                                   ('A10', 'Airline 10 Description'),
                                   ('A11', 'Airline 11 Description'),
                                   ('A12', 'Airline 12 Description'),
                                   ('A13', 'Airline 13 Description'),
                                   ('A14', 'Airline 14 Description'),
                                   ('A15', 'Airline 15 Description'),
                                   ('A16', 'Airline 16 Description'),
                                   ('A17', 'Airline 17 Description'),
                                   ('A18', 'Airline 18 Description'),
                                   ('A19', 'Airline 19 Description'),
                                   ('A20', 'Airline 20 Description');
091357 (42601): SQL Compilation error: Iceberg table AIRLINES with an external catalog integration is a read-only table and cannot be modified
zzeng#COMPUTE_WH@ZZENG.PUBLIC>

エラーです。

CDWのHiveでやってみると、

INSERT INTO `zzeng_airlines_ice`.`airlines` (code, description) VALUES
('A1', 'Airline 1 Description'),
('A2', 'Airline 2 Description'),
('A3', 'Airline 3 Description'),
('A4', 'Airline 4 Description'),
('A5', 'Airline 5 Description'),
('A6', 'Airline 6 Description'),
('A7', 'Airline 7 Description'),
('A8', 'Airline 8 Description'),
('A9', 'Airline 9 Description'),
('A10', 'Airline 10 Description'),
('A11', 'Airline 11 Description'),
('A12', 'Airline 12 Description'),
('A13', 'Airline 13 Description'),
('A14', 'Airline 14 Description'),
('A15', 'Airline 15 Description'),
('A16', 'Airline 16 Description'),
('A17', 'Airline 17 Description'),
('A18', 'Airline 18 Description'),
('A19', 'Airline 19 Description'),
('A20', 'Airline 20 Description');

成功。

image.png

4.2) CDWでIcebergテーブルを更新したら、Snowflake 側に反映

現在の課題:Hive側でIcebergテーブルを更新しましたが、Snowflake 側は反映されていない。

なぜ?

Icebergを更新したら、Metadataが変わりました。
Snowflake側はそれを検知していないから、前の世代のMetadataを見ていた。

zzeng#COMPUTE_WH@ZZENG.PUBLIC>select count(*) from AIRLINES;
+----------+
| COUNT(*) |
|----------|
|     1491 |
+----------+
1 Row(s) produced. Time Elapsed: 0.678s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>

Snowflake 公式ドキュメント:

In Snowflake, The metadata files do not identify the most recent snapshot of an Iceberg table.

Ref URL: https://docs.snowflake.com/en/user-guide/tables-iceberg

手動でSnowflake のMetadata設定変更。

まずは新しいMetadataのファイル名を確認。

image.png

これでALTER文作成:

ALTER ICEBERG TABLE AIRLINES REFRESH 'metadata/00002-c33ae888-1af3-4f64-830b-eac9e0a95983.metadata.json';

更新、動作確認結果:

zzeng#COMPUTE_WH@ZZENG.PUBLIC>ALTER ICEBERG TABLE AIRLINES REFRESH 'metadata/00002-c33ae888-1af3-4f64-830b-eac9e0a95983.metadata.json';
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+
1 Row(s) produced. Time Elapsed: 10.199s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>select count(*) from AIRLINES;
+----------+
| COUNT(*) |
|----------|
|     1511 |
+----------+
1 Row(s) produced. Time Elapsed: 0.204s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>

5) 制約事項

Copied from Snowflake documents:

Metadata files

    The metadata files do not identify the most recent snapshot of an Iceberg table.

    You cannot modify the location of the data files or snapshot using the ALTER ICEBERG TABLE command. To modify either of these settings, you must recreate the table (using the CREATE OR REPLACE ICEBERG TABLE syntax).
2
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?