前回
今回はデータセットを自作してみる
File-based Builders
csv fileからデータセットを作ることができる
例として以下のCSVを用意 lerobot/pusht
の一部
myfile.csv
observation.state_x,observation.state_y,action_x,action_y,episode_index,frame_index,timestamp,next.reward,next.done,next.success,index,task_index
222.0,97.0,233.0,71.0,0,0,0.0,0.19029748439788818,False,False,0,0
225.2523956298828,89.31253051757812,229.0,83.0,0,1,0.10000000149011612,0.19029748439788818,False,False,1,0
データセットを作ってみる
>>> dataset = load_dataset("csv", data_files="myfile.csv")
>>> dataset['train']
Dataset({
features: ['observation.state_x', 'observation.state_y', 'action_x', 'action_y', 'episode_index', 'frame_index', 'timestamp', 'next.reward', 'next.done', 'next.success', 'index', 'task_index'],
num_rows: 2
})
>>> dataset['train'][0:2]
Folder Based Builders
ImageFolder
やAudioFoloder
を使うと,フォルダ構成から自動的にデータセットを作ることができる
ここはチュートリアルそのまま
例えば以下のようなpathにデータがあるとする
pokemon/train/grass/bulbasaur.png
pokemon/train/fire/charmander.png
pokemon/train/water/squirtle.png
pokemon/test/grass/ivysaur.png
pokemon/test/fire/charmeleon.png
pokemon/test/water/wartortle.png
この場合,ImageFolderでは,左から順にデータの名前/split/label/image.png or jpg or...
で認識する
参照: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/folder-based-builder.png
このようなデータセットを作るときには,pathを指定すれば良い
>>> dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon")
caption等が含まれる場合は,metadata.csv
を作って配置しておく
file_name
columnが必須
metadata.csv
file_name, text
bulbasaur.png, There is a plant seed on its back right from the day this Pokémon is born.
charmander.png, It has a preference for hot things.
squirtle.png, When it retracts its long neck into its shell, it squirts out water with vigorous force.
配置場所
dataset/
│
├── train/
│ ├── class1/
│ │ ├── img1.jpg
│ │ ├── img2.jpg
│ │ └── ...
│ ├── class2/
│ │ ├── img1.jpg
│ │ ├── img2.jpg
│ │ └── ...
│ └── ...
│
├── validation/
│ ├── class1/
│ │ ├── img1.jpg
│ │ ├── img2.jpg
│ │ └── ...
│ ├── class2/
│ │ ├── img1.jpg
│ │ ├── img2.jpg
│ │ └── ...
│ └── ...
│
├── metadata.csv
From Python Dictionary
dict
からも作れる
from_generator
がメモリ効率がいい
>>> from datasets import Dataset
>>> def gen():
... yield {
... "observation.state": [222.0, 97.0],
... "action": [233.0, 71.0],
... "episode_index": 0,
... "frame_index": 0,
... "timestamp": 0.0,
... "next.reward": 0.19029748439788818,
... "next.done": False,
... "next.success": False,
... "index": 0,
... "task_index": 0,
... }
... yield {
... "observation.state": [225.2523956298828, 89.31253051757812],
... "action": [229.0, 83.0],
... "episode_index": 0,
... "frame_index": 1,
... "timestamp": 0.10000000149011612,
... "next.reward": 0.19029748439788818,
... "next.done": False,
... "next.success": False,
... "index": 1,
... "task_index": 0,
... }
...
>>> ds = Dataset.from_generator(gen)
Generating train split: 2 examples [00:00, 319.85 examples/s]
>>> ds[0]
{'observation.state': [222.0, 97.0], 'action': [233.0, 71.0], 'episode_index': 0, 'frame_index': 0, 'timestamp': 0.0, 'next.reward': 0.19029748439788818, 'next.done': False, 'next.success': False, 'index': 0, 'task_index': 0}
IterableDataset
も使える
>>> from datasets import IterableDataset
>>> ds = IterableDataset.from_generator(gen)
>>> for example in ds:
... print(example)
...
{'observation.state': [222.0, 97.0], 'action': [233.0, 71.0], 'episode_index': 0, 'frame_index': 0, 'timestamp': 0.0, 'next.reward': 0.19029748439788818, 'next.done': False, 'next.success': False, 'index': 0, 'task_index': 0}
{'observation.state': [225.2523956298828, 89.31253051757812], 'action': [229.0, 83.0], 'episode_index': 0, 'frame_index': 1, 'timestamp': 0.10000000149011612, 'next.reward': 0.19029748439788818, 'next.done': False, 'next.success': False, 'index': 1, 'task_index': 0}
from_dict
が最もベーシックな手法
>>> ds = Dataset.from_dict({
... "observation.state": [[222.0, 97.0], [225.2523956298828, 89.31253051757812]],
... "action": [[233.0, 71.0], [229.0, 83.0]],
... "episode_index": [0, 0],
... "frame_index": [0, 1],
... "timestamp": [0.0, 0.10000000149011612],
... "next.reward": [0.19029748439788818, 0.19029748439788818],
... "next.done": [False, False],
... "next.success": [False, False],
... "index": [0, 1],
... "task_index": [0, 0],
... })
>>> ds[0]
{'observation.state': [222.0, 97.0], 'action': [233.0, 71.0], 'episode_index': 0, 'frame_index': 0, 'timestamp': 0.0, 'next.reward': 0.19029748439
audioの場合はcast_column
を使って音声データをキャストする
>>> from datasets import Audio
>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
Link
目次