0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

第4回 データセットを作る

Last updated at Posted at 2025-01-06

前回

今回はデータセットを自作してみる

File-based Builders

csv fileからデータセットを作ることができる
例として以下のCSVを用意 lerobot/pushtの一部

myfile.csv
observation.state_x,observation.state_y,action_x,action_y,episode_index,frame_index,timestamp,next.reward,next.done,next.success,index,task_index
222.0,97.0,233.0,71.0,0,0,0.0,0.19029748439788818,False,False,0,0
225.2523956298828,89.31253051757812,229.0,83.0,0,1,0.10000000149011612,0.19029748439788818,False,False,1,0

データセットを作ってみる

>>> dataset = load_dataset("csv", data_files="myfile.csv")
>>> dataset['train']
Dataset({
    features: ['observation.state_x', 'observation.state_y', 'action_x', 'action_y', 'episode_index', 'frame_index', 'timestamp', 'next.reward', 'next.done', 'next.success', 'index', 'task_index'],
    num_rows: 2
})
>>> dataset['train'][0:2]

Folder Based Builders

ImageFolderAudioFoloderを使うと,フォルダ構成から自動的にデータセットを作ることができる
ここはチュートリアルそのまま
例えば以下のようなpathにデータがあるとする

pokemon/train/grass/bulbasaur.png
pokemon/train/fire/charmander.png
pokemon/train/water/squirtle.png

pokemon/test/grass/ivysaur.png
pokemon/test/fire/charmeleon.png
pokemon/test/water/wartortle.png

この場合,ImageFolderでは,左から順にデータの名前/split/label/image.png or jpg or...で認識する

image.png
参照: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/folder-based-builder.png

このようなデータセットを作るときには,pathを指定すれば良い

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon")

caption等が含まれる場合は,metadata.csvを作って配置しておく
file_name columnが必須

metadata.csv
file_name, text
bulbasaur.png, There is a plant seed on its back right from the day this Pokémon is born.
charmander.png, It has a preference for hot things.
squirtle.png, When it retracts its long neck into its shell, it squirts out water with vigorous force.

配置場所

dataset/
│
├── train/
│   ├── class1/
│   │   ├── img1.jpg
│   │   ├── img2.jpg
│   │   └── ...
│   ├── class2/
│   │   ├── img1.jpg
│   │   ├── img2.jpg
│   │   └── ...
│   └── ...
│
├── validation/
│   ├── class1/
│   │   ├── img1.jpg
│   │   ├── img2.jpg
│   │   └── ...
│   ├── class2/
│   │   ├── img1.jpg
│   │   ├── img2.jpg
│   │   └── ...
│   └── ...
│
├── metadata.csv

From Python Dictionary

dictからも作れる
from_generatorがメモリ効率がいい

>>> from datasets import Dataset
>>> def gen():
...     yield {
...         "observation.state": [222.0, 97.0],
...         "action": [233.0, 71.0],
...         "episode_index": 0,
...         "frame_index": 0,
...         "timestamp": 0.0,
...         "next.reward": 0.19029748439788818,
...         "next.done": False,
...         "next.success": False,
...         "index": 0,
...         "task_index": 0,
...     }
...     yield {
...         "observation.state": [225.2523956298828, 89.31253051757812],
...         "action": [229.0, 83.0],
...         "episode_index": 0,
...         "frame_index": 1,
...         "timestamp": 0.10000000149011612,
...         "next.reward": 0.19029748439788818,
...         "next.done": False,
...         "next.success": False,
...         "index": 1,
...         "task_index": 0,
...     }
... 
>>> ds = Dataset.from_generator(gen)
Generating train split: 2 examples [00:00, 319.85 examples/s]
>>> ds[0]
{'observation.state': [222.0, 97.0], 'action': [233.0, 71.0], 'episode_index': 0, 'frame_index': 0, 'timestamp': 0.0, 'next.reward': 0.19029748439788818, 'next.done': False, 'next.success': False, 'index': 0, 'task_index': 0}

IterableDatasetも使える

>>> from datasets import IterableDataset
>>> ds = IterableDataset.from_generator(gen)
>>> for example in ds:
...     print(example)
... 
{'observation.state': [222.0, 97.0], 'action': [233.0, 71.0], 'episode_index': 0, 'frame_index': 0, 'timestamp': 0.0, 'next.reward': 0.19029748439788818, 'next.done': False, 'next.success': False, 'index': 0, 'task_index': 0}
{'observation.state': [225.2523956298828, 89.31253051757812], 'action': [229.0, 83.0], 'episode_index': 0, 'frame_index': 1, 'timestamp': 0.10000000149011612, 'next.reward': 0.19029748439788818, 'next.done': False, 'next.success': False, 'index': 1, 'task_index': 0}

from_dictが最もベーシックな手法

>>> ds = Dataset.from_dict({
...      "observation.state": [[222.0, 97.0], [225.2523956298828, 89.31253051757812]],
...      "action": [[233.0, 71.0], [229.0, 83.0]],
...      "episode_index": [0, 0],
...      "frame_index": [0, 1],
...      "timestamp": [0.0, 0.10000000149011612],
...      "next.reward": [0.19029748439788818, 0.19029748439788818],
...      "next.done": [False, False],
...      "next.success": [False, False],
...      "index": [0, 1],
...      "task_index": [0, 0],
...  })
>>> ds[0]
{'observation.state': [222.0, 97.0], 'action': [233.0, 71.0], 'episode_index': 0, 'frame_index': 0, 'timestamp': 0.0, 'next.reward': 0.19029748439

audioの場合はcast_columnを使って音声データをキャストする

>>> from datasets import Audio
>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())

Link

目次

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?