0
0

PythonでAzure Blob Storageのファイル一覧をPandas DataFrameに出力する

Posted at

Blobに適当にアップロードしたファイルを一覧化し、ファイル名とか作成日時とかで集計・フィルターする必要があったので備忘録として。

import os
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
from azure.storage.blob import BlobServiceClient, __version__

CONNECTION_STRING = ""
container_name = ""

# blob接続
blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)
container_client = blob_service_client.get_container_client(container_name)

# df作成
cols = ["file_name", "folder", "basename", "creation_time", "last_modified"]
df = pd.DataFrame(columns=cols)

# blob情報をdictに突っ込む関数
def create_row_dict(blob):
    row_dict = {}
    row_dict["file_path"] = blob.name
    row_dict["directory"] = os.path.dirname(blob.name)
    row_dict["file_name"] = os.path.basename(blob.name)
    row_dict["creation_time"] = blob.creation_time
    row_dict["last_modified"] = blob.last_modified
    return row_dict

# blob情報取得
blob_list = container_client.list_blobs()

# めっちゃ時間かかるので並列化
with ThreadPoolExecutor(12) as e:
    ret = e.map(create_row_dict, blob_list)

tmp_dict = {}
for i, r in enumerate(ret):
    tmp_dict[i] = r

# dictをdf化
df = pd.DataFrame.from_dict(tmp_dict, orient="index")

# csvファイルのみ抽出
only_csv_df = df.query("file_name.str.contains('.csv')").reset_index(drop=True)

# 特定期間内に作成されたファイルのみ抽出
only_csv_df["creation_time"] = pd.to_datetime(only_csv_df["creation_time"], utc=True, errors='coerce').dropna()
only_csv_df = only_csv_df.sort_values("creation_time")
filterd_df = only_csv_df.query("creation_time >= '2023-02-01 00:00:00+00:00' & creation_time <= '2023-05-01 00:00:00+00:00'")
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0