More than 1 year has passed since last update.

MIERUNE

GeoPandas(GeoDataFrame)のread/writeなら1000万レコードを10秒で読み込めるpyogrioを使って高速に行おう！

Last updated at 2022-07-06Posted at 2022-06-07

GeoParquet以外が遅すぎる

先日こんな記事を書いてGeoParquetの書き込み・読み込みがいかに早いか、という紹介をしました。
GeoPandasをやるならFlatGeobufより10倍早いGeoParquetを使おう！

結果としてはこんな感じになりました。

write

GeoParquet: 31.5 s
FlatGeobuf: 8min 6s
Shapefile: 9min 30s
GeoJSON: 10min
GeoPackage: 13min 37s

read

GeoParquet: 5.67 s
FlatGeobuf: 2min 57s
GeoPackage: 3min 12s
Shapefile: 3min 46s
GeoJSON: 4min 1s

ダントツでGeoParquetが高速ですね。

が、しかしGeoParquetは現段階では実験的なフォーマットです。
中間ファイルとして計算途中のデータを一時的に吐き出しておくなど、活用方法は色々あるかと思いますが、プロダクション環境でホイホイ利用するにはちょっと危険ともいえます。

と、いうことで他のフォーマットでのI/O処理も高速化していきましょう。

幸いなことに、それらを手軽に高速化できるパッケージが存在していますのでありがたく利用させていただくことにしましょう。

pyogrio: https://github.com/geopandas/pyogrio

pyogrioで高速にread/write

デモデータ作成

環境構築については例の如く、以下をご参照ください。

GeoParquetは根っから高速なのでそれ以外の「GeoPackage・FlatGeobuf・GeoJSON・Shapefile」で試してみましょう

まずは以下のようにして、GeoPandas・numpy・shapelyを用いてランダムでポイントデータを作成していきます。

%%time
import geopandas as gpd
import numpy as np
from shapely.geometry import box

%%time
box = box(-180, -90, 180, 90)
data = {"geometry": [box.wkt]}

poly = gpd.GeoSeries.from_wkt(data["geometry"]).unary_union

x_min, y_min, x_max, y_max = poly.bounds
# CPU times: user 1.28 ms, sys: 1.09 ms, total: 2.37 ms
# Wall time: 1.48 ms

%%time
n = 10_000_000

x = np.random.uniform(x_min, x_max, n)
y = np.random.uniform(y_min, y_max, n)
# CPU times: user 150 ms, sys: 30.1 ms, total: 180 ms
# Wall time: 179 ms

%%time
points = gpd.GeoSeries(gpd.points_from_xy(x, y))
points = points[points.within(poly)]
# CPU times: user 3.39 s, sys: 387 ms, total: 3.78 s
# Wall time: 3.78 s

%%time
gdf = gpd.GeoDataFrame(geometry=points)
# CPU times: user 406 ms, sys: 108 ms, total: 513 ms
# Wall time: 512 ms

write

その後、pyogrioをインポートしてGeoDataFrameを出力していきましょう。

import pyogrio

GeoPackage

%%time
# GeoPackage
pyogrio.write_dataframe(
    gdf,
    "pyogrio_export.gpkg",
    index=False,
    driver="GPKG",
    layer="layer",
    spatial_index="NO",
)
# CPU times: user 3min 17s, sys: 2min 40s, total: 5min 57s
# Wall time: 6min 1s

FlatGeobuf

%%time
# FlatGeobuf
pyogrio.write_dataframe(
    gdf,
    "pyogrio_export.fgb",
    index=False,
    driver="FlatGeobuf",
    spatial_index="NO",
)
# CPU times: user 59.9 s, sys: 11.1 s, total: 1min 11s
# Wall time: 1min 11s

GeoJSON

%%time
# GeoJSON
pyogrio.write_dataframe(
    gdf,
    "pyogrio_export.geojson",
    index=False,
    driver="GeoJSON",
    spatial_index="NO",
)
# CPU times: user 59.9 s, sys: 2.75 s, total: 1min 2s
# Wall time: 1min 3s

Shapefile

%%time
# Shapefile
pyogrio.write_dataframe(
    gdf,
    "pyogrio_export.shp",
    index=False,
    driver="ESRI Shapefile",
    spatial_index="NO",
)
# CPU times: user 36.5 s, sys: 38.3 s, total: 1min 14s
# Wall time: 1min 15s

GeoDataFrameのto_fileメソッドで出力した時と比較して、以下のようになりました。

GeoPackage: 13min 37s → 6min 1s
FlatGeobuf: 8min 6s → 1min 11s
GeoJSON: 10min → 1min 3s
Shapefile: 9min 30s → 1min 15s

GeoPackageの書き込みがやはり若干遅いですが、GeoJSONなどは1/10程度と、かなり高速化されましたね。

続いて読み込みも行っていきましょう。

read

GeoPackage

%%time
# GeoPackage
geo_package_gdf = pyogrio.read_dataframe("pyogrio_export.gpkg")
geo_package_gdf.head()
# CPU times: user 10 s, sys: 3.84 s, total: 13.8 s
# Wall time: 17.2 s
# geometry
# 0	POINT (-131.98306 55.73088)
# 1	POINT (-105.69364 -79.89538)
# 2	POINT (-128.63403 58.42864)
# 3	POINT (-96.14466 0.41497)
# 4	POINT (-152.41998 67.50038)

FlatGeobuf

%%time
# FlatGeobuf
flat_geobuf_gdf = pyogrio.read_dataframe("pyogrio_export.fgb")
flat_geobuf_gdf.head()
# CPU times: user 8.47 s, sys: 2.2 s, total: 10.7 s
# Wall time: 10.8 s
# geometry
# 0	POINT (179.92146 -89.98066)
# 1	POINT (179.85347 -89.95284)
# 2	POINT (179.94596 -89.95279)
# 3	POINT (179.97843 -89.95176)
# 4	POINT (179.96145 -89.84289)

GeoJSON

%%time
# GeoJSON
geojson_gdf = pyogrio.read_dataframe("pyogrio_export.geojson")
geojson_gdf.head()
# CPU times: user 1min 24s, sys: 2.68 s, total: 1min 27s
# Wall time: 1min 26s
# geometry
# 0	POINT (-131.98306 55.73088)
# 1	POINT (-105.69364 -79.89538)
# 2	POINT (-128.63403 58.42864)
# 3	POINT (-96.14466 0.41497)
# 4	POINT (-152.41998 67.50038

Shapefile

%%time
# Shapefile
shapefile_gdf = pyogrio.read_dataframe("pyogrio_export.shp")
shapefile_gdf.head()
# CPU times: user 13.7 s, sys: 2.03 s, total: 15.8 s
# Wall time: 15.9 s
# FID	geometry
# 0	0	POINT (-131.98306 55.73088)
# 1	1	POINT (-105.69364 -79.89538)
# 2	2	POINT (-128.63403 58.42864)
# 3	3	POINT (-96.14466 0.41497)
# 4	4	POINT (-152.41998 67.50038)

こちらもGeoDataFrameのto_fileメソッドで出力した時と比較してみましょう。

GeoPackage: 3min 12s → 17.2 s
FlatGeobuf: 2min 57s → 10.8 s
GeoJSON: 4min 1s → 1min 26s
Shapefile: 3min 46s → 15.9 s

テキストデータである点からGeoJSONだけ若干遅いですが、その他のフォーマットは20s以内に読みこみが完了しています。

FlatGeobufだけデータの出力がソートされていますが、データは問題なく1000万レコード存在するようでした。

結論

今回紹介したpyogrioもGeoParquet同様にまだ実験段階にあるパッケージになります。このためプロダクション環境などで利用するのは危険かつ、出力されたデータには検収が必要かもしれませんが、ローカルマシンで作業する場合などにおいては、ビッグデータも効率よく扱うことができるのでとても良いと思います。

みなさんも使ってみてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up