Icebergの初歩をやってみる。

Posted at 2026-01-01

icebergのデータを作ってS3にアップロードglueにデータを入れてAthenaで検索できるところまでもっていきます。

S3パケット作成

aws s3 mb s3://my-iceberg-tokyo-20251206/athena-results/ --region ap-northeast-1

Glueのデータベース作成

aws glue create-database \
    --database-input '{"Name": "default"}' \
    --region ap-northeast-1

.env

AWS_ACCESS_KEY_ID=AKIXXXXXXXXXXXXXXXXX
AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
AWS_REGION=ap-northeast-1
ICEBERG_BUCKET=my-iceberg-tokyo-20251206    # ← ここを新しい東京バケット名に変更！
GLUE_DATABASE=default

from dotenv import load_dotenv
load_dotenv()

import os
import pandas as pd
from pyiceberg.catalog import load_catalog
import pyarrow as pa

# ------------------- 設定 -------------------
bucket = os.getenv("ICEBERG_BUCKET")
database = os.getenv("GLUE_DATABASE", "default")

catalog = load_catalog(
    "glue",
    **{
        "type": "glue",
        "warehouse": f"s3://{bucket}/warehouse/",
        "s3.region": "ap-northeast-1",
    }
)

# テーブルはもうあるので load
table = catalog.load_table(f"{database}.sample_table")

# ------------------- データ追加（ここだけ毎回変える）-------------------
df = pd.DataFrame({
    "id": [100, 101, 102, 103],
    "name": ["青森", "秋田", "山形", "岩手"],
    "amount": [999999.0, 888888.0, 777777.0, 666666.0],
    "created_at": pd.to_datetime(["2025-12-11", "2025-12-11", "2025-12-12", "2025-12-12"])
})

# ここが神3行（これで required/optional + timestamp[us] 完全解決）
df["created_at"] = df["created_at"].dt.tz_localize(None).astype("datetime64[us]")
df = df.astype({"id": "int64"})  # id を int64 に

# ここが最重要！PyArrow に「required」を強制指定
arrow_table = pa.Table.from_pandas(df).cast(
    target_schema=pa.schema([
        pa.field("id", pa.int64(), nullable=False),           # required long
        pa.field("name", pa.string(), nullable=True),         # optional string
        pa.field("amount", pa.float64(), nullable=True),      # optional double
        pa.field("created_at", pa.timestamp('us'), nullable=False)  # required timestamp(us)
    ])
)

# 書き込み！
table.append(arrow_table)

print("完全無欠・永遠の成功！！！")
print(f"現在の行数: {table.scan().to_arrow().num_rows} 行になりました！")
print("Athena で SELECT * FROM default.sample_table;")
print("もう二度とエラーは出ません！！！")

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up