More than 1 year has passed since last update.

PythonでAzure Blob Storageのファイル一覧をPandas DataFrameに出力する

Posted at 2023-11-16

Blobに適当にアップロードしたファイルを一覧化し、ファイル名とか作成日時とかで集計・フィルターする必要があったので備忘録として。

import os
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
from azure.storage.blob import BlobServiceClient, __version__

CONNECTION_STRING = ""
container_name = ""

# blob接続
blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)
container_client = blob_service_client.get_container_client(container_name)

# df作成
cols = ["file_name", "folder", "basename", "creation_time", "last_modified"]
df = pd.DataFrame(columns=cols)

# blob情報をdictに突っ込む関数
def create_row_dict(blob):
    row_dict = {}
    row_dict["file_path"] = blob.name
    row_dict["directory"] = os.path.dirname(blob.name)
    row_dict["file_name"] = os.path.basename(blob.name)
    row_dict["creation_time"] = blob.creation_time
    row_dict["last_modified"] = blob.last_modified
    return row_dict

# blob情報取得
blob_list = container_client.list_blobs()

# めっちゃ時間かかるので並列化
with ThreadPoolExecutor(12) as e:
    ret = e.map(create_row_dict, blob_list)

tmp_dict = {}
for i, r in enumerate(ret):
    tmp_dict[i] = r

# dictをdf化
df = pd.DataFrame.from_dict(tmp_dict, orient="index")

# csvファイルのみ抽出
only_csv_df = df.query("file_name.str.contains('.csv')").reset_index(drop=True)

# 特定期間内に作成されたファイルのみ抽出
only_csv_df["creation_time"] = pd.to_datetime(only_csv_df["creation_time"], utc=True, errors='coerce').dropna()
only_csv_df = only_csv_df.sort_values("creation_time")
filterd_df = only_csv_df.query("creation_time >= '2023-02-01 00:00:00+00:00' & creation_time <= '2023-05-01 00:00:00+00:00'")

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up