More than 3 years have passed since last update.

python の google-cloud-storage でバケット内のオブジェクトを start_offset, end_offset, (prefix)を使って直近の日付ファイル名に絞り込む

Posted at 2021-05-19

背景

対象バケットのオブジェクト一覧は以下で出力できる

from google.cloud import storage

client = storage.Client()
bucket = client.bucket('BUCKET_NAME')
blobs = client.list_blobs(bucket)
for blob in blobs:
    print(blob.name)

出力イメージ

others/20210515.csv
others/20210516.csv
others/20210517.csv
others/20210518.csv
others/20210519.csv
others/20210520.csv
path/to/files/20210516.csv
path/to/files/20210517.csv
path/to/files/20210518.csv
path/to/files/20210519.csv
path/to/files/20210520.csv

この時、 取得日から2日前までのobjectのみを取得したい とする。
prefixだけではうまくできない。
start_offsetを使うとこの絞り込みができたので紹介。

まずはprefixで絞り込む方法

よく使うprefixで絞り込む方法を一応紹介。
prefixで指定したpathのオブジェクトに絞り込んでくれる。

prefix.py

from google.cloud import storage

client = storage.Client()
bucket = client.bucket('BUCKET_NAME')
blobs = client.list_blobs(bucket, prefix='path/to/files')
for blob in blobs:
    print(blob.name)

出力イメージ

path/to/files/20210516.csv
path/to/files/20210517.csv
path/to/files/20210518.csv
path/to/files/20210519.csv
path/to/files/20210520.csv

start_offset で直近2日に絞り込み

実行日は 2021/05/20 だった場合に、2日前の 2021/05/18 からのファイルだけ取得したいとする。その場合以下のようにstart_offsetを使うと絞り込める。

from google.cloud import storage
from datetime import datetime, timedelta
from zoneinfo import ZoneInfo

client = storage.Client()
bucket = client.bucket('BUCKET_NAME')
now = datetime.now(ZoneInfo("Asia/Tokyo"))
two_days_ago_str = (now - timedelta(days=2)).strftime('%Y%m%d')
blobs = list(client.list_blobs(
    bucket,
    prefix='path/to/files/',
    start_offset=f'path/to/files/{two_days_ago_str}'
))
for blob in blobs:
    print(blob.name)

出力イメージ

path/to/files/20210518.csv
path/to/files/20210519.csv
path/to/files/20210520.csv

start_offset　で指定した日付以降のオブジェクトに絞り込まれているのが確認できる。
gcsのオブジェクトは辞書順にsortされており、start_offset を指定すると、その文字列よりも辞書順でそれ以降のオブジェクトを取得してくれるようになる。
end_offsetを指定すると辞書順でそれ以前のものに絞り込まれる。

参照

Storage Client#list_blobs

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

python の google-cloud-storage で バケット内のオブジェクトを start_offset, end_offset, (prefix)を使って直近の日付ファイル名に絞り込む

背景

まずはprefixで絞り込む方法

start_offset で直近2日に絞り込み

参照

Storage Client#list_blobs

python の google-cloud-storage でバケット内のオブジェクトを start_offset, end_offset, (prefix)を使って直近の日付ファイル名に絞り込む