More than 1 year has passed since last update.

[python]AzureMLでファイルをマウントしたとき日本語がURLエンコードになる件

Last updated at 2023-04-03Posted at 2023-04-03

Azureさん、日本語でおｋ

Azureファイル共有とかBlobからファイルを読み込むときなどに、dataset.mount()でtempフォルダでマウントすると、os.listdir()でファイル名の一覧取得時に日本語がURLエンコードされてしまいます。
厄介なことに実際のファイル名は日本語名なものですから、URLエンコードされたファイル名（パス）じゃ読み込めないんですね～

コード

今回はAzure MachineLearning Studioのコンピューティングインスタンス上で実行するように作ったときのコード。
データセットにパワポが詰まったフォルダを指定してて、pptxファイルを読みだそうとしてますね。

python

# 環境：Python 3.10 - SDK v2
from azureml.core import Workspace, Dataset
from pptx import Presentation
import pandas as pd
import os
import urllib.parse
import tempfile

# ワークスペースの設定、適当に。
subscription_id = <<subscription_id>>
resource_group = <<resource_group>>
workspace_name = <<workspace_name>>

workspace = Workspace(subscription_id, resource_group, workspace_name)

# データセットの設定、適当に。
dataset = Dataset.get_by_name(workspace, name = <<Dataset_name>>)

# データセットをマウント
mounted_path = tempfile.mkdtemp()
mount_context = dataset.mount(mounted_path)
mount_context.start()

file_names = os.listdir(mounted_path)

for i in file_names:
    # urllib.parse.unquote()で日本語に再変換
    decoded_str = urllib.parse.unquote(i)

    file_path = mounted_path+'/'+ decoded_str

    # pptxファイルのロード
    if '.pptx' in file_path:
        prs = Presentation(file_path)

結果の確認

言わずもがな

↓

コードは適当に編集したのでバグってるかも。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up