【PLATEAU×Python】東京23区の建物DB構築（後編）— Pythonで建物データベースを構築する

Last updated at 2026-04-12Posted at 2026-04-12

シリーズ記事

前編: CityGMLのデータ構造を完全理解

後編（この記事）: Pythonで建物データベースを構築する

はじめに

前編では、PLATEAU CityGML のデータ構造を解説しました。ZIP の中身、GML ファイルの構造、建物の18属性、コードリストの仕組みを理解しました。

この後編では、その知識をもとに Python だけで東京23区約100万棟の建物データベースを構築 します。

処理の全体像

環境構築

pip install pandas pyarrow

標準ライブラリの xml.etree.ElementTree と zipfile を使うため、追加の XML パーサーは不要です。

Step 1: ZIP からGMLファイルを探索する

まず、指定フォルダ内の CityGML ZIP を一覧化し、各 ZIP 内の udx/bldg/*.gml を取得します。

from pathlib import Path
import zipfile
import pandas as pd

INPUT_ZIP_ROOT = Path("./citygml_input_zips")

# ファイル名に "citygml" を含む ZIP を再帰検索
zip_paths = sorted([
    p.resolve()
    for p in INPUT_ZIP_ROOT.rglob("*.zip")
    if "citygml" in p.name.lower()
])

print(f"取得ZIP数: {len(zip_paths)}")
# → 取得ZIP数: 23

各 ZIP の udx/bldg/ 配下にある GML ファイルを取得します。

zip_gml_map = {}
for zip_path in zip_paths:
    with zipfile.ZipFile(str(zip_path), "r") as zf:
        gml_list = sorted([
            n for n in zf.namelist()
            if n.startswith("udx/bldg/") and n.lower().endswith(".gml")
        ])
    zip_gml_map[str(zip_path)] = gml_list

total_gml = sum(len(v) for v in zip_gml_map.values())
print(f"対象GML総数: {total_gml}")
# → 対象GML総数: 985

Step 2: XML 解析のためのヘルパー関数

CityGML は名前空間付きの XML です。名前空間を毎回指定するのは面倒なので、local_name でプレフィックスを除去するヘルパーを用意します。

import xml.etree.ElementTree as ET

def local_name(tag):
    """名前空間プレフィックスを除去"""
    return tag.split("}", 1)[-1] if "}" in tag else tag

def direct_children(elem, key):
    """直下の子要素で local_name が一致するもの"""
    return [child for child in elem if local_name(child.tag) == key]

def first_direct_child(elem, key):
    children = direct_children(elem, key)
    return children[0] if children else None

def first_text(elem, key):
    """直下の子要素のテキストを取得"""
    child = first_direct_child(elem, key)
    if child is None:
        return None
    text = (child.text or "").strip()
    return text if text != "" else None

def descendant_first_text(elem, target_local_name):
    """子孫要素を再帰探索してテキストを取得"""
    for child in elem.iter():
        if local_name(child.tag) == target_local_name:
            text = (child.text or "").strip()
            if text != "":
                return text
    return None

関数	用途	戻り値
`direct_children(elem, key)`	直下の子要素を取得	リスト
`first_direct_child(elem, key)`	最初の子要素を取得	要素 or None
`first_text(elem, key)`	子要素のテキストを取得	文字列 or None
`descendant_first_text(elem, key)`	子孫を再帰探索してテキスト取得	文字列 or None

Step 3: 建物属性の抽出関数

各 bldg:Building 要素から、DB化に必要な属性を取り出す関数群を実装します。

建物ID と自治体コード

def extract_building_id_and_city_code(building_elem):
    attr_elem = first_direct_child(building_elem, "buildingIDAttribute")
    if attr_elem is None:
        return None, None
    building_id = descendant_first_text(attr_elem, "buildingID")
    city_code = descendant_first_text(attr_elem, "city")
    return building_id, city_code

# 使用例
# → ("13103-bldg-904", "13103")

住所

def extract_address_text(building_elem):
    address_elem = first_direct_child(building_elem, "address")
    if address_elem is None:
        return None
    # LocalityName → AddressLine → CountryName の順で探索
    for tag in ("LocalityName", "AddressLine", "CountryName"):
        value = descendant_first_text(address_elem, tag)
        if value is not None:
            return value
    return None

# 使用例
# → "東京都港区赤坂一丁目"

lod0RoofEdge（建物外周座標）

def extract_lod0_poslists(building_elem):
    lod0_elem = first_direct_child(building_elem, "lod0RoofEdge")
    if lod0_elem is None:
        return None
    poslists = []
    for child in lod0_elem.iter():
        if local_name(child.tag) == "posList":
            text = (child.text or "").strip()
            if text != "":
                poslists.append(text)
    return poslists if poslists else None

# 使用例: 緯度 経度 高さ の3つ組が並ぶ座標列
# → ["35.667... 139.742... 0 35.668... 139.742... 0 ..."]

Step 4: 高さの補完ロジック

PLATEAU データでは、measuredHeight に無効値（0, 9999, 負値）が入っていることがあります。その場合、3D ジオメトリの Z 座標差から高さを推定します。

INVALID_HEIGHT_VALUES = {9999, -9999, 0, -1}

def parse_poslist_z_values(pos_text):
    """posList から Z値（3番目ごと）を抽出"""
    vals = [float(x) for x in pos_text.strip().split()]
    if len(vals) >= 3 and len(vals) % 3 == 0:
        return vals[2::3]
    return []

def z_range_from_elem(elem):
    """要素内の全 posList から Z の max-min を計算"""
    if elem is None:
        return None
    z_values = []
    for child in elem.iter():
        if local_name(child.tag) == "posList":
            text = (child.text or "").strip()
            if text != "":
                z_values.extend(parse_poslist_z_values(text))
    if not z_values:
        return None
    return max(z_values) - min(z_values)

def is_invalid_height(value):
    if value is None:
        return True
    try:
        v = float(value)
    except (ValueError, TypeError):
        return True
    return v in INVALID_HEIGHT_VALUES or v <= 0

def resolve_height(building_elem):
    """高さを解決する（フォールバック付き）"""
    measured = first_text(building_elem, "measuredHeight")
    if not is_invalid_height(measured):
        return float(measured), "measuredHeight"

    # lod1Solid → lod2Solid → boundedBy の順で推定
    for source_name, key in [
        ("lod1Solid_z", "lod1Solid"),
        ("lod2Solid_z", "lod2Solid"),
    ]:
        elem = first_direct_child(building_elem, key)
        h = z_range_from_elem(elem)
        if h is not None and h > 0:
            return float(h), source_name

    # boundedBy（複数要素をまとめて計算）
    bounded_list = direct_children(building_elem, "boundedBy")
    if bounded_list:
        dummy = ET.Element("dummy")
        for b in bounded_list:
            dummy.append(b)
        h = z_range_from_elem(dummy)
        if h is not None and h > 0:
            return float(h), "boundedBy_z"

    return None, None

Step 5: 全 ZIP を処理してDBを構築する

準備が整ったので、23区分の全 ZIP を処理します。メモリ効率のため、100,000件ごとに Parquet ファイルに分割保存 します。

import json
import time
from pathlib import Path

OUTPUT_DIR = Path("./plateau_building_db")
PARTS_DIR = OUTPUT_DIR / "parts"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
PARTS_DIR.mkdir(parents=True, exist_ok=True)

ROWS_PER_PART = 100_000
OUTPUT_COLUMNS = [
    "source_zip", "gml_file", "gml_id", "building_id", "city_code",
    "area_code_full", "class", "usage", "measuredHeight", "height_source",
    "storeysAboveGround", "storeysBelowGround", "address", "lod0RoofEdge_posList",
]

TARGET_SA_NAME = "13+区市町村コード+大字・町コード+町・丁目コード"

メインループ

rows_buffer = []
part_infos = []
error_rows = []
part_no = 1
building_count = 0

for zip_path in zip_paths:
    zip_name = Path(zip_path).name
    gml_list = zip_gml_map[str(zip_path)]
    print(f"--- {zip_name} | GML数={len(gml_list)} ---")

    with zipfile.ZipFile(str(zip_path), "r") as zf:
        for gml_path in gml_list:
            raw = zf.read(gml_path)
            root = ET.fromstring(raw)

            for elem in root.iter():
                if local_name(elem.tag) != "Building":
                    continue

                gml_id = elem.attrib.get("{http://www.opengis.net/gml}id")
                building_id, city_code = extract_building_id_and_city_code(elem)
                usage_values = all_texts(elem, "usage")
                height_value, height_source = resolve_height(elem)
                lod0_poslists = extract_lod0_poslists(elem)

                record = {
                    "source_zip": zip_name,
                    "gml_file": gml_path,
                    "gml_id": gml_id,
                    "building_id": building_id,
                    "city_code": city_code,
                    "area_code_full": extract_string_attribute_value(
                        elem, TARGET_SA_NAME
                    ),
                    "class": first_text(elem, "class"),
                    "usage": "|".join(usage_values) if usage_values else None,
                    "measuredHeight": height_value,
                    "height_source": height_source,
                    "storeysAboveGround": first_text(elem, "storeysAboveGround"),
                    "storeysBelowGround": first_text(elem, "storeysBelowGround"),
                    "address": extract_address_text(elem),
                    "lod0RoofEdge_posList": (
                        json.dumps(lod0_poslists, ensure_ascii=False)
                        if lod0_poslists else None
                    ),
                }
                rows_buffer.append(record)
                building_count += 1

                # 100,000件ごとに分割保存
                if len(rows_buffer) >= ROWS_PER_PART:
                    df = pd.DataFrame(rows_buffer, columns=OUTPUT_COLUMNS)
                    path = PARTS_DIR / f"buildings_part_{part_no:05d}.parquet"
                    df.to_parquet(path, index=False)
                    print(f"Parquet保存: {path} | rows={len(df)}")
                    part_no += 1
                    rows_buffer = []

# 残りを保存
if rows_buffer:
    df = pd.DataFrame(rows_buffer, columns=OUTPUT_COLUMNS)
    path = PARTS_DIR / f"buildings_part_{part_no:05d}.parquet"
    df.to_parquet(path, index=False)

print(f"\n=== 完了 ===")
print(f"抽出建物件数: {building_count:,}")

実行結果

実際にColabで23区分を処理した結果：

--- 13101_chiyoda-ku_pref_2023_citygml_2_op.zip | GML数=21 ---
--- 13102_chuo-ku_pref_2023_citygml_2_op.zip | GML数=22 ---
--- 13103_minato-ku_pref_2023_citygml_2_op.zip | GML数=39 ---
  ...
Parquet保存: buildings_part_00001.parquet | rows=100000
Parquet保存: buildings_part_00002.parquet | rows=100000
  ...
Parquet保存: buildings_part_00014.parquet | rows=100000

14パート × 100,000件 = 140万棟以上 のデータが Parquet 形式で保存されました。

Step 6: 構築したDBを確認する

サンプルデータの確認（10,000件）

sample_df = pd.read_csv("building_sample_100.csv")
print(f"件数: {len(sample_df)}")
print(sample_df.isna().sum())

件数: 10000

欠損数:
gml_id                     0
gml_file                   0
building_id                0
city_code                  0
area_code_full          1308   ← 約13%
class                      0
usage                      0
measuredHeight             0
height_source              0
storeysAboveGround         0
storeysBelowGround         0
address                 1305   ← 約13%
lod0RoofEdge_posList       0

area_code_full と address に約13%の欠損がありますが、それ以外は完全に取得できています。

基本統計量

print(sample_df[["measuredHeight", "storeysAboveGround", "storeysBelowGround"]].describe())

       measuredHeight  storeysAboveGround  storeysBelowGround
count    10000.000000        10000.000000        10000.000000
mean        16.550754         1311.728700         1307.995800
std         14.700770         3370.150747         3371.597256
min          2.243000            0.000000            0.000000
25%          7.900000            2.000000            0.000000
50%         11.900000            4.000000            0.000000
75%         21.200000            8.000000            0.000000
max        264.600000         9999.000000         9999.000000

注意: storeysAboveGround の mean が 1311 と異常に高いのは、無効値 9999 が混在しているためです。実利用時はフィルタリングが必要です。

全パートの結合読み込み

from pathlib import Path
import pandas as pd

parts = sorted(Path("./plateau_building_db/parts").glob("*.parquet"))
df = pd.concat([pd.read_parquet(p) for p in parts], ignore_index=True)
print(f"総建物数: {len(df):,}")
print(df.groupby("city_code").size().sort_index())

ファイル構成

ツールを再利用しやすいように、モジュール分割しています。

plateau-building-db/
├── main.py               # CLI エントリポイント
├── config.py             # 設定・定数
├── xml_utils.py          # XML解析ヘルパー + 高さ補完ロジック
├── zip_scanner.py        # ZIP探索
├── building_extractor.py # 建物レコード生成
├── requirements.txt
└── README.md

実行コマンド

# 全件処理
python main.py --input_dir ./citygml_input_zips

# サンプル（10,000件で停止）
python main.py --input_dir ./citygml_input_zips --sample 10000

# 出力先とパートサイズを変更
python main.py --input_dir ./zips --output_dir ./output --rows_per_part 50000

まとめ

標準ライブラリ（xml.etree.ElementTree + zipfile）だけで CityGML を解析できる
measuredHeight の無効値は 3Dジオメトリの Z差分で補完する
100,000件ずつ Parquet 分割保存でメモリ効率を確保
東京23区で 190万棟以上 の建物DBが構築できた
Parquet 形式なので pandas や DuckDB で直接クエリ可能

このDBを使えば、区ごとの建物高さ分布、用途別の統計、地図上のポリゴン可視化など、さまざまな分析が可能になります。

GitHub

ソースコード一式は以下のリポジトリで公開しています。

シリーズ記事

前編: CityGMLのデータ構造を完全理解
後編（この記事）: Pythonで建物データベースを構築する

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up