第4回　データセットを作る

Last updated at 2025-01-06Posted at 2025-01-06

前回

今回はデータセットを自作してみる

File-based Builders

csv fileからデータセットを作ることができる
例として以下のCSVを用意 lerobot/pushtの一部

myfile.csv

observation.state_x,observation.state_y,action_x,action_y,episode_index,frame_index,timestamp,next.reward,next.done,next.success,index,task_index
222.0,97.0,233.0,71.0,0,0,0.0,0.19029748439788818,False,False,0,0
225.2523956298828,89.31253051757812,229.0,83.0,0,1,0.10000000149011612,0.19029748439788818,False,False,1,0

データセットを作ってみる

>>> dataset = load_dataset("csv", data_files="myfile.csv")
>>> dataset['train']
Dataset({
    features: ['observation.state_x', 'observation.state_y', 'action_x', 'action_y', 'episode_index', 'frame_index', 'timestamp', 'next.reward', 'next.done', 'next.success', 'index', 'task_index'],
    num_rows: 2
})
>>> dataset['train'][0:2]

Folder Based Builders

ImageFolderやAudioFoloderを使うと，フォルダ構成から自動的にデータセットを作ることができる
ここはチュートリアルそのまま
例えば以下のようなpathにデータがあるとする

pokemon/train/grass/bulbasaur.png
pokemon/train/fire/charmander.png
pokemon/train/water/squirtle.png

pokemon/test/grass/ivysaur.png
pokemon/test/fire/charmeleon.png
pokemon/test/water/wartortle.png

この場合，ImageFolderでは，左から順にデータの名前/split/label/image.png or jpg or...で認識する

参照: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/folder-based-builder.png

このようなデータセットを作るときには，pathを指定すれば良い

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon")

caption等が含まれる場合は，metadata.csvを作って配置しておく
file_name columnが必須

metadata.csv

file_name, text
bulbasaur.png, There is a plant seed on its back right from the day this Pokémon is born.
charmander.png, It has a preference for hot things.
squirtle.png, When it retracts its long neck into its shell, it squirts out water with vigorous force.

配置場所

dataset/
│
├── train/
│   ├── class1/
│   │   ├── img1.jpg
│   │   ├── img2.jpg
│   │   └── ...
│   ├── class2/
│   │   ├── img1.jpg
│   │   ├── img2.jpg
│   │   └── ...
│   └── ...
│
├── validation/
│   ├── class1/
│   │   ├── img1.jpg
│   │   ├── img2.jpg
│   │   └── ...
│   ├── class2/
│   │   ├── img1.jpg
│   │   ├── img2.jpg
│   │   └── ...
│   └── ...
│
├── metadata.csv

From Python Dictionary

dictからも作れる
from_generatorがメモリ効率がいい

>>> from datasets import Dataset
>>> def gen():
...     yield {
...         "observation.state": [222.0, 97.0],
...         "action": [233.0, 71.0],
...         "episode_index": 0,
...         "frame_index": 0,
...         "timestamp": 0.0,
...         "next.reward": 0.19029748439788818,
...         "next.done": False,
...         "next.success": False,
...         "index": 0,
...         "task_index": 0,
...     }
...     yield {
...         "observation.state": [225.2523956298828, 89.31253051757812],
...         "action": [229.0, 83.0],
...         "episode_index": 0,
...         "frame_index": 1,
...         "timestamp": 0.10000000149011612,
...         "next.reward": 0.19029748439788818,
...         "next.done": False,
...         "next.success": False,
...         "index": 1,
...         "task_index": 0,
...     }
... 
>>> ds = Dataset.from_generator(gen)
Generating train split: 2 examples [00:00, 319.85 examples/s]
>>> ds[0]
{'observation.state': [222.0, 97.0], 'action': [233.0, 71.0], 'episode_index': 0, 'frame_index': 0, 'timestamp': 0.0, 'next.reward': 0.19029748439788818, 'next.done': False, 'next.success': False, 'index': 0, 'task_index': 0}

IterableDatasetも使える

>>> from datasets import IterableDataset
>>> ds = IterableDataset.from_generator(gen)
>>> for example in ds:
...     print(example)
... 
{'observation.state': [222.0, 97.0], 'action': [233.0, 71.0], 'episode_index': 0, 'frame_index': 0, 'timestamp': 0.0, 'next.reward': 0.19029748439788818, 'next.done': False, 'next.success': False, 'index': 0, 'task_index': 0}
{'observation.state': [225.2523956298828, 89.31253051757812], 'action': [229.0, 83.0], 'episode_index': 0, 'frame_index': 1, 'timestamp': 0.10000000149011612, 'next.reward': 0.19029748439788818, 'next.done': False, 'next.success': False, 'index': 1, 'task_index': 0}

from_dictが最もベーシックな手法

>>> ds = Dataset.from_dict({
...      "observation.state": [[222.0, 97.0], [225.2523956298828, 89.31253051757812]],
...      "action": [[233.0, 71.0], [229.0, 83.0]],
...      "episode_index": [0, 0],
...      "frame_index": [0, 1],
...      "timestamp": [0.0, 0.10000000149011612],
...      "next.reward": [0.19029748439788818, 0.19029748439788818],
...      "next.done": [False, False],
...      "next.success": [False, False],
...      "index": [0, 1],
...      "task_index": [0, 0],
...  })
>>> ds[0]
{'observation.state': [222.0, 97.0], 'action': [233.0, 71.0], 'episode_index': 0, 'frame_index': 0, 'timestamp': 0.0, 'next.reward': 0.19029748439

audioの場合はcast_columnを使って音声データをキャストする

>>> from datasets import Audio
>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())

Link

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

第4回 データセットを作る

File-based Builders

Folder Based Builders

From Python Dictionary

Link

第4回　データセットを作る