第5回　Huggingface hubにデータセットをアップロードする

Posted at 2025-01-07

前回

今回はデータセットを実際にhub上で作り，そのデータセットをローカルで取得して，加工して再度アップロードしてみる

Upload with Hub UI

まずは新しいデータセットを作る　プロフィールから飛べる

今回はPrivateなデータセットを作る

できたデータセットのfile and versionsから，add fileを選択しデータをアップロードする

今回は以下のようなcsvファイルを2つアップロードしてみる　中身は全く同じ

train.csv

observation.state_x,observation.state_y,action_x,action_y,episode_index,frame_index,timestamp,next.reward,next.done,next.success,index,task_index
222.0,97.0,233.0,71.0,0,0,0.0,0.19029748439788818,False,False,0,0
225.2523956298828,89.31253051757812,229.0,83.0,0,1,0.10000000149011612,0.19029748439788818,False,False,1,0

test.csv

observation.state_x,observation.state_y,action_x,action_y,episode_index,frame_index,timestamp,next.reward,next.done,next.success,index,task_index
222.0,97.0,233.0,71.0,0,0,0.0,0.19029748439788818,False,False,0,0
225.2523956298828,89.31253051757812,229.0,83.0,0,1,0.10000000149011612,0.19029748439788818,False,False,1,0

データセットカードを追加する　ユーザがデータセットを理解するために必要不可欠
ライセンスやtaskなどはlabelっぽく選択できるようになっていて，自動的にその内容が反映される

privateのdataset cardはproユーザしか見れないらしい

いいデータセットはとてもわかり易い

Upload with Python

Python経由でデータセットをアップロードする場合は，huggingface hubへのログインが必要

pip install huggingface_hub
huggingface-cli login

以下でTokenを入力する

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):

Tokenはプロフィールから作成できる

先程作ったデータセットをロードしてみる

>>> from datasets import load_dataset
>>> dataset = load_dataset("Takoko/dataset_tutorial_lerobot", split="train")
README.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70.0/70.0 [00:00<00:00, 263kB/s]
Generating train split: 2 examples [00:00, 936.54 examples/s]
Generating test split: 2 examples [00:00, 1186.84 examples/s]

多少の処理を加えてアップロードする．今回は第3回で出てきたmap関数を使って，state.x'とstate.y`からステップごとの観測移動量を計算してデータに加えてみる

>>> def add_obs_diff(examples, indices):
...     # Initialize the new keys to store the differences
...     examples["observation.state_diff_x"] = []
...     examples["observation.state_diff_y"] = []
...     for idx, (state_x, state_y) in zip(indices, zip(examples["observation.state_x"], examples["observation.state_y"])):
...         if idx == 0:  # Handle edge case for the first element
...             diff_x = 0
...             diff_y = 0
...         else:
...             # Calculate the difference with the previous element in the dataset
...             diff_x = float(state_x) - float(dataset[idx - 1]["observation.state_x"])
...             diff_y = float(state_y) - float(dataset[idx - 1]["observation.state_y"])
...         # Append the computed differences to the respective lists
...         examples["observation.state_diff_x"].append(diff_x)
...         examples["observation.state_diff_y"].append(diff_y)
...     return examples
... 
>>> dataset.map(add_obs_diff, batched=True, with_indices=True)

hubにアップロードする

dataset.push_to_hub("Takoko/dataset_tutorial_lerobot", private=True)

以上

Link

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

第5回 Huggingface hubにデータセットをアップロードする

Upload with Hub UI

Upload with Python

Link

第5回　Huggingface hubにデータセットをアップロードする