Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

Hugging Face Hubにdatasetを作る方法

Last updated at Posted at 2023-10-11





Share a dataset using the CLI

At Hugging Face, we are on a mission to democratize good Machine Learning and we believe in the value of open source. That’s why we designed 🤗 Datasets so that anyone can share a dataset with the greater ML community. There are currently thousands of datasets in over 100 languages in the Hugging Face Hub, and the Hugging Face team always welcomes new contributions!

Huggingfaceでは、よい機械学習を民主化(みんなが使えるように)することを使命とし、オープンソース化することの価値を信じています。それが私たちがHugging Face Datasetsを設計した理由であり、誰もがデータセットをよりよい機械学習コミュニティーに共有することができます。 現在までにHuggingface Hubには100を超える言語で何千ものデータセットが存在し、HuggingFaceチームは常に新しい貢献を歓迎いたします。

Dataset repositories offer features such as:

  • Free dataset hosting
  • Dataset versioning
  • Commit history and diffs
  • Metadata for discoverability
  • Dataset cards for documentation, licensing, limitations, etc.
    This guide will show you how to share a dataset that can be easily accessed by anyone.


  • フリーデータセットホスティング
  • データセットのバージョン管理
  • 履歴と差分表示
  • より分かりやすくするための?メタデータ
  • 文書化、ライセンス管理、利用制限のためのデータセットカード

Add a dataset

You can share your dataset with the community with a dataset repository on the Hugging Face Hub. It can also be a private dataset if you want to control who has access to it.
In a dataset repository, you can either host all your data files and configure your dataset to define which file goes to which split. The following formats: CSV, TSV, JSON, JSON lines, text, Parquet, Arrow, SQLite. The script also supports many kinds of compressed file types such as: GZ, BZ2, LZ4, LZMA or ZSTD. For example, your dataset can be made of .json.gz files.

データセットを共有することもできますが、あなたが望めばそれを秘匿することもできます。データセットリポジトリでは、どのファイルをどのように分割するか、自分で設定することもできます。次のフォーマット(CSV, TSV, JSON, JSON lines, text, Parquet, Arrow, SQLite)や、圧縮ファイル(GZ, BZ2, LZ4, LZMA, ZSTD)などのファイル形式もサポートしています。例えば、自分で使うデータセットは.json.gzファイルとして作成することができます。

On the other hand, if your dataset is not in a supported format or if you want more control over how your dataset is loaded, you can write your own dataset script.
When loading a dataset from the Hub, all the files in the supported formats are loaded, following the repository structure. However if there’s a dataset script, it is downloaded and executed to download and prepare the dataset instead.
For more information on how to load a dataset from the Hub, take a look at the load a dataset from the Hub tutorial.

HuggingFace Hubからデータセットがロードされたとき、リポジトリの構成に従ってサポートされた形式のすべてのファイルがロードされます。しかし、そこにデータセットのスクリプトが存在するとき、データセットのスクリプトファイルがダウンロード&データのダウンロードと前処理が実行されます。
HuggingFaceHubからデータをロードする方法の詳細は、HuggingFaceHubのチュートリアル"load a dataset from the Hub"を見てね。


  1. Create the repository
  2. Prepare your files
  3. Upload your files
  4. (Optional) Add the dataset loading script
  5. Verify the files have been correctly staged. Then you can commit and push your files
  6. Ask for a help and reviews

Create the repository

Sharing a community dataset will require you to create an account on hf.co if you don’t have one yet. You can directly create a new dataset repository from your account on the Hugging Face Hub, but this guide will show you how to upload a dataset from the terminal.

もしHuggingface Hubのアカウントを持っていければ、コミュニティーにデータセットを共有(アップロード)する際には必要なので作っておいてください。
Hugging Face Hub上の自分のアカウントを使用して、直接新しいデータセットリポジトリを作ることもできますが、ここではパソコンのターミナルからデータセットをアップロードする方法を見ていきましょう。

Make sure you are in the virtual environment where you installed Datasets, and run the following command:
huggingface-cli login
Login using your Hugging Face Hub credentials, and create a new dataset repository:
huggingface-cli repo create your_dataset_name --type dataset


!huggingface-cli login

Hugging Face Hubの認証システムを利用してログインし、新しいデータセットリポジトリを作成してください。

!huggingface-cli repo create データセット名 --type dataset

Add the -organization flag to create a repository under a specific organization:
huggingface-cli repo create データセット名 --type dataset --organization 組織名

特定の組織(グループ?)に紐づけてリポジトリを作成する場合は、-organization のフラグ(引数)を設定して実行して下さい。

Clone the repository


Install Git LFS and clone your repository:

Git Large File Storageをインストールしてね。

# Make sure you have git-lfs installed
# (https://git-lfs.github.com/)
!git lfs install
!git clone https://huggingface.co/datasets/namespace/your_dataset_name

Here the namespace is either your username or your organization name.


Prepare your files

Now is a good time to check your directory to ensure the only files you’re uploading are:

  • The data files of the dataset
  • The dataset card README.md
  • (optional) your_dataset_name.py is your dataset loading script (optional if your data files are already in the supported formats csv/jsonl/json/parquet/txt). To create a dataset script, see the dataset script page.


  • データセットのデータファイル
  • データセットカード(README.md)
  • (option)your_dataset_name.py
    ・・・データセットをロードするためのスクリプトファイル。データセットがcsv, jsonl, json, parquet, textなどのHugging Face Hubでサポートされた形式のファイルの場合は必要ない。

Upload your files

You can directly upload your files to your repository on the Hugging Face Hub, but this guide will show you how to upload the files from the terminal.
It is important to add the large data files first with git lfs track or else you will encounter an error later when you push your files:

今からターミナルの操作をしてHuggingface Hubのリポジトリへアップロードしていくよ。ちなみに、Git LFSで大規模データをaddしておかないと、のちのちエラー吐くます。ここテスト出るからね(*'▽')

cp /somewhere/data/*.json .
git lfs track *.json
git add .gitattributes
git add *.json
git commit -m "add json files"

(Optional) Add the dataset loading script:


cp /somewhere/data/load_script.py .
git add --all

Verify the files have been correctly staged. Then you can commit and push your files:

正しくステージされたファイルを検証しし、コミット and プッシュ。

git status
git commit -m "First version of the your_dataset_name dataset."
git push

Congratulations, your dataset has now been uploaded to the Hugging Face Hub where anyone can load it in a single line of code! 🥳

dataset = load_dataset("namespace/your_dataset_name")

これであなたのデータセットはHugging Face Hubにアップロードされ、誰もが一行でロードできるようになりました(^▽^)/おめでとう!

dataset = load_dataset("ユーザ名か組織名/データセット名")

Finally, don’t forget to enrich the dataset card to document your dataset and make it discoverable! Check out the Create a dataset card guide to learn more.


Ask for a help and reviews

If you need help with a dataset script, feel free to check the datasets forum: it’s possible that someone had similar issues and shared how they managed to fix them.

もしデータセットのスクリプトについて助けが必要であれば、datasets forumに来てね。誰かがあなたと似たような問題に直面して、ほかの誰かがそれを解決しているかも。

Then if your script is ready and if you wish your dataset script to be reviewed by the Hugging Face team, you can open a discussion in the Community tab of your dataset with this message:

# Dataset rewiew request for <Dataset name>
## Description

<brief description of the dataset>
## Files to review
- file1
- file2
- ...

Members of the Hugging Face team will be happy to review your dataset script and give you advice.

もしスクリプトが仕上がり、Hugging Face チームメンバーの誰かからの評価が欲しかったらデータセットページのコミュ二ティータブでDiscussionを開いて、次のメッセージを送ってね。

Datasets on GitHub (legacy)

Datasets used to be hosted on our GitHub repository, but all datasets have now been migrated to the Hugging Face Hub.
The legacy GitHub datasets were added originally on our GitHub repository and therefore don’t have a namespace on the Hub: “squad”, “glue”, etc. unlike the other datasets that are named “username/dataset_name” or “org/dataset_name”.

データセットはHugging FaceのGithubリポジトリへホストされてきましたが、現在ではすべてのデータセットはHuggingFace Hubへ移動しています。Githubで公開されていたデータセットはそもそもGithubリポジトリにあり、"squad"や"glue"といった感じで、"username/dataset_name"とか、"org/dataset_name"のような他のデータセットと同じ名前は持っていません。

The distinction between a Hub dataset within or without a namespace only comes from the legacy sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself.


Those datasets are now maintained on the Hub: if you think a fix is needed, please use their “Community” tab to open a discussion or create a Pull Request. The code of these datasets is reviewed by the Hugging Face team.

それらのデータセットは現在Hugging Face Hubで維持・管理されています。もし修正が必要であればそれぞれのデータセットのコミュニティータブでディスカッションを開くか、create a pull requestを行ってください(Github上でってこと?かも)。これらのデータセットのコードはHugging Faceチームによってレビューされます。

What are Dataset Cards?

Each dataset may be documented by the README.md file in the repository. This file is called a dataset card, and the Hugging Face Hub will render its contents on the dataset’s main page. To inform users about how to responsibly use the data, it’s a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used.

それぞれのデータセットは、README.mdファイルによってリポジトリ内で文書化されます。このファイルはデータセットカードと呼ばれ、HuggingFace Hubはこの内容をデータセットのメインページに描画します。ユーザに責任をもってデータを使う方法を知らせるには、データセット内に含まれるすべての潜在的なバイアスについての情報を含めるとよいでしょう。一般的には、データセットカードはユーザがデータセットの内容を理解するのを助け、ユーザにデータセットがどのように使用されるべきかについての文脈(説明)を与えてくれます。

You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub. Tags are defined in a YAML metadata section at the top of the README.md file.


Dataset card metadata

A dataset repo will render its README.md as a dataset card. To control how the Hub displays the card, you should create a YAML section in the README file to define some metadata. Start by adding three --- at the top, then include all of the relevant metadata, and close the section with another group of --- like the example below:


- "List of ISO 639-1 code for your language"
- lang1
- lang2
pretty_name: "Pretty Name of the Dataset"
- tag1
- tag2
license: "any valid license identifier"
- task1
- task2

The metadata that you add to the dataset card enables certain interactions on the Hub. For example:

  • Allow users to filter and discover datasets at https://huggingface.co/datasets.
  • If you choose a license using the keywords listed in the right column of this table, the license will be displayed on the dataset page.


When creating a README.md file in a dataset repository on the Hub, use Metadata UI to fill the main metadata:

README.mdファイルがデータセットリポジトリ内に作られたら、Metadata UIを使用して主要なメタデータを埋めることもできます

To see metadata fields, see the detailed dataset card metadata specification here.


Dataset card creation guide

For a step-by-step guide on creating a dataset card, check out the Create a dataset card guide.


Create a dataset card

Each dataset should have a dataset card to promote responsible usage and inform users of any potential biases within the dataset. This idea was inspired by the Model Cards proposed by Mitchell, 2018. Dataset cards help users understand a dataset’s contents, the context for using the dataset, how it was created, and any other considerations a user should be aware of.


Creating a dataset card is easy and can be done in a just a few steps:

  1. Go to your dataset repository on the Hub and click on Create Dataset Card to create a new README.md file in your repository.
  2. Use the Metadata UI to select the tags that describe your dataset. You can add a license, language, pretty_name, the task_categories, size_categories, and any other tags that you think are relevant. These tags help users discover and find your dataset on the Hub.
    (For a complete, but not required, set of tag options you can also look at the Dataset Card specifications. This’ll have a few more tag options like multilinguality and language_creators which are useful but not absolutely necessary.)
  3. Click on the Import dataset card template link to automatically create a template with all the relevant fields to complete. Fill out the template sections to the best of your ability. Take a look at the Dataset Card Creation Guide for more detailed information about what to include in each section of the card. For fields you are unable to complete, you can write [More Information Needed].


  1. huggingface hubのデータセットリポジトリで、Create Dataset CardをクリックしてREADMEファイルをリポジトリ内に作成してね
  2. メタデータUIを使ってタグをデータセットを説明する選択し、ライセンスや言語、pretty_name(?)、タスクカテゴリ、サイズカテゴリなどの関連タグをつけてね。これらのタグはほかのユーザが検索をかけた時、あなたのデータセットを見つけやすくします。
  3. Import dataset card templateのリンクを押して自動的にすべての関連したフィールドを埋めるテンプレートを作成してね。できるだけテンプレートセクションを埋めましょう。詳しくはDataset Card Creation Guideを見てね。もし埋められないセクションがあれば、[More Information Needed]って書いておいてね。

Once you’re done, commit the changes to the README.md file and you’ll see the completed dataset card on your repository.
YAML also allows you to customize the way your dataset is loaded by defining splits and/or configurations without the need to write any code.
Feel free to take a look at the SNLI, CNN/DailyMail, and Allociné dataset cards as examples to help you get started.

具体的な例として、SNLI, CNN/DailyMail, and Allociné データセットカードをご参照ください。

Reading through existing dataset cards, such as the ELI5 dataset card, is a great way to familiarize yourself with the common conventions.


Linking a Paper


If the dataset card includes a link to a paper on arXiv, the Hub will extract the arXiv ID and include it in the dataset tags with the format arxiv:. Clicking on the tag will let you:

  • Visit the Paper page
  • Filter for other models on the Hub that cite the same paper.

もしデータセットカードにアーカイブの論文へのリンクが含まれていれば、HuggingFace HubはアーカイブIDを実行してデータセットカードのタグにarxiv:<ID>のフォーマットで付与します。タグをクリックすると、次のような選択肢が現れます。

  • 上が論文ページへのリンク
  • 下が同じ論文からの引用があるHuggingFace内の他のモデルを検索するためのフィルタを実行するボタン

Create an image dataset

  • CSVファイルを作ってみたはいいけど、データセットは大量のCSVじゃ、ちょっと処理しきれないよ...(;´・ω・)
  • ということで普通はどうやってデータセットを作るのかを見ていく。

There are two methods for creating and sharing an image dataset. This guide will show you how to:

  • Create an image dataset with ImageFolder and some metadata. This is a no-code solution for quickly creating an image dataset with several thousand images.
  • Create an image dataset by writing a loading script. This method is a bit more involved, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale image datasets.

You can control access to your dataset by requiring users to share their contact information first. Check out the Gated datasets guide for more information about how to enable this feature on the Hub.


  1. ImageFolderとメタデータで画像データセットを作成する。この方法では、ノーコードで素早く何千もの画像データセットを作成します。
  2. ローディングスクリプトを書いて画像データセットを作成します。この方法は少々複雑ですが、より複雑で大規模の画像データセットを定義・ダウンロード・生成する際により大きな柔軟性を持たせることができます。



The ImageFolder is a dataset builder designed to quickly load an image dataset with several thousand images without requiring you to write any code.

💡 Take a look at the Split pattern hierarchy to learn more about how ImageFolder creates dataset splits based on your dataset repository structure.

ImageFolderがどのようにデータセットリポジトリの構造からデータセットの分割を行うのかを知りたければ、"Sturucture your repository"のページを参照してね。

ImageFolder automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:




Then users can load your dataset by specifying imagefolder in load_dataset() and the directory in data_dir:


from datasets import load_dataset

dataset = load_dataset("imagefolder", data_dir="/path/to/folder")

You can also use imagefolder to load datasets involving multiple splits. To do so, your dataset directory should have the following structure:

また、複数の分割(mulyiple spilits...って、データが混ざっているってことかな?)を含むデータセットをダウンロードする際にも、ImageFolderを利用することができます。そのためには、データセットディレクトリは次のような構造である必要があります。


If all image files are contained in a single directory or if they are not on the same level of directory structure, label column won’t be added automatically. If you need it, set drop_labels=False explicitly.


If there is additional information you’d like to include about your dataset, like text captions or bounding boxes, add it as a metadata.csv file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file metadata.jsonl.



You can also zip your images:



Your metadata.csv file must have a file_name column which links image files with their metadata:


0001.png,This is a first value of a text feature you added to your images
0002.png,This is a second value of a text feature you added to your images
0003.png,This is a third value of a text feature you added to your images

or using metadata.jsonl:

{"file_name": "0001.png", "additional_feature": "This is a first value of a text feature you added to your images"}
{"file_name": "0002.png", "additional_feature": "This is a second value of a text feature you added to your images"}
{"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"}

If metadata files are present, the inferred labels based on the directory name are dropped by default. To include those labels, set drop_labels=False in load_dataset.






0001.png,beautiful classic music
0002.png, beathoven


{"file_name": "0001.png", "caption": "beautiful classic music"}
{"file_name": "0002.png", "caption": "Beathoven"}
{"file_name": "0003.png", "additional_feature": Mozalt"}


Image captioning

Image captioning datasets have text describing an image. An example metadata.csv may look like:

0001.png,This is a golden retriever playing with a ball
0002.png,A german shepherd
0003.png,One chihuahua

Load the dataset with ImageFolder, and it will create a text column for the image captions:

dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
# >> "This is a golden retriever playing with a ball"

Upload dataset to the Hub

Once you’ve created a dataset, you can share it to the Hub with the push_to_hub() method. Make sure you have the huggingface_hub library installed and you’re logged in to your Hugging Face account (see the Upload with Python tutorial for more details).

データセットを作り終わったら、これをpush_to_hub()関数を使ってHubへpushできます。huggingface_hubライブラリがインストールされ、Huggingface アカウントにログインしていることを確認してね。
(詳細はUpload with Pyhton チュートリアルを見てね。)

Upload your dataset with push_to_hub():

from datasets import load_dataset

dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")

Loading Script

Write a dataset loading script to share a dataset. It defines a dataset’s splits and configurations, and handles downloading and generating a dataset. The script is located in the same folder or repository as the dataset and should have the same name.

├── README.md
├── my_dataset.py
└── data/  # optional, may contain your images or TAR archives

This structure allows your dataset to be loaded in one line:

from datasets import load_dataset
dataset = load_dataset("path/to/my_dataset")

This guide will show you how to create a dataset loading script for image datasets, which is a bit different from creating a loading script for text datasets. You’ll learn how to:

  • Create a dataset builder class.
  • Create dataset configurations.
  • Add dataset metadata.
  • Download and define the dataset splits.
  • Generate the dataset.
  • Generate the dataset metadata (optional).
  • Upload the dataset to the Hub.
    The best way to learn is to open up an existing image dataset loading script, like Food-101, and follow along!
    To help you get started, we created a loading script template you can copy and use as a starting point!


  • データセットビルダクラスを生成する
  • データセットのメタデータを加える
  • データセットのダウンロードし、データの分割を定義。
  • データセットを生成する
  • (必要があれば)データセットのメタデータを生成する
  • Huggingface Hubにデータセットをアップロードする

Create a dataset builder class

GeneratorBasedBuilder is the base class for datasets generated from a dictionary generator. Within this class, there are three methods to help create your dataset:

  • info stores information about your dataset like its description, license, and features.
  • split_generators downloads the dataset and defines its splits.
  • generate_examples generates the images and labels for each split.

GeneratorBasedBuilder は辞書型生成器から生成されたデータセットのための基本となるクラスです。このクラスでは、データセットを生成する3つの方法があります。

  • info(クラスの前のコード部分)では、データセットの説明やライセンス、特徴などデータセットに関する情報を保存します
  • split_generators(クラスの中身メソッド)はデータセットをダウンロードし、分割方法を定義します。
  • generate_examples(クラスの中身メソッド)は分割されたデータに対して、それぞれ画像とラベルを生成します。

Start by creating your dataset class as a subclass of GeneratorBasedBuilder and add the three methods. Don’t worry about filling in each of these methods yet, you’ll develop those over the next few sections:


class Food101(datasets.GeneratorBasedBuilder):
    """Food-101 Images dataset"""

    def _info(self):

    def _split_generators(self, dl_manager):

    def _generate_examples(self, images, metadata_path):

Multiple configurations

In some cases, a dataset may have more than one configuration. For example, if you check out the Imagenette dataset, you’ll notice there are three subsets.


To create different configurations, use the BuilderConfig class to create a subclass for your dataset. Provide the links to download the images and labels in data_url and metadata_urls:


class Food101Config(datasets.BuilderConfig):
    """Builder Config for Food-101"""
    def __init__(self, data_url, metadata_urls, **kwargs):
        """BuilderConfig for Food-101.
          data_url: `string`, url to download the zip file from.
          metadata_urls: dictionary with keys 'train' and 'validation' containing the archive metadata URLs
          **kwargs: keyword arguments forwarded to super.
        super(Food101Config, self).__init__(version=datasets.Version("1.0.0"), **kwargs)
        self.data_url = data_url
        self.metadata_urls = metadata_urls

Now you can define your subsets at the top of GeneratorBasedBuilder. Imagine you want to create two subsets in the Food-101 dataset based on whether it is a breakfast or dinner food.
Define your subsets with Food101Config in a list in BUILDER_CONFIGS.
For each configuration, provide a name, description, and where to download the images and labels from.


class Food101(datasets.GeneratorBasedBuilder):
    """Food-101 Images dataset"""
            description="Food types commonly eaten during breakfast.",
                "train": "https://link-to-breakfast-foods-train.txt", 
                "validation": "https://link-to-breakfast-foods-validation.txt"
            description="Food types commonly eaten during dinner.",
                "train": "https://link-to-dinner-foods-train.txt", 
                "validation": "https://link-to-dinner-foods-validation.txt"

Now if users want to load the breakfast configuration, they can use the configuration name:


from datasets import load_dataset
ds = load_dataset("food101", "breakfast", split="train")

Add dataset metadata

Adding information about your dataset is useful for users to learn more about it. This information is stored in the DatasetInfo class which is returned by the info method. Users can access this information by:


from datasets import load_dataset_builder
ds_builder = load_dataset_builder("food101")

There is a lot of information you can specify about your dataset, but some important ones to include are:

  1. description provides a concise description of the dataset.
  2. features specify the dataset column types. Since you’re creating an image loading script, you’ll need to include the Image feature.
  3. supervised_keys specify the input feature and label.
  4. homepage provides a link to the dataset homepage.
  5. citation is a BibTeX citation of the dataset.
  6. license states the dataset’s license.

You’ll notice a lot of the dataset information is defined earlier in the loading script which makes it easier to read. There are also other ~Datasets.Features you can input, so be sure to check out the full list for more details.


  1. データセットに関する詳細な説明
  2. データセットのカラムの片を特定する特徴量。画像の特徴量もここに含みます
  3. 入力の特徴量とラベルを特定するsupervised_keys(?)
  4. データセットのホームページへのリンク
  5. データセットのBibTeX(参考文献)の引用
  6. データセットのライセンス


def _info(self):
    return datasets.DatasetInfo(
                "image": datasets.Image(),
                "label": datasets.ClassLabel(names=_NAMES),
        supervised_keys=("image", "label"),
        task_templates=[ImageClassification(image_column="image", label_column="label")],

Download and define the dataset splits


Now that you’ve added some information about your dataset, the next step is to download the dataset and generate the splits.


  1. Use the DownloadManager.download() method to download the dataset > and any other metadata you’d like to associate with it. This method accepts:
  • a name to a file inside a Hub dataset repository (in other words, the data/ folder)
  • a URL to a file hosted somewhere else
  • a list or dictionary of file names or URLs

In the Food-101 loading script, you’ll notice again the URLs are defined earlier in the script.


  • Huggingfaceのデータセットリポジトリのなかの、ファイルの名前
  • ホストされたファイルへのURL
  • ファイル名やURLの辞書型リスト

After you’ve downloaded the dataset, use the SplitGenerator to organize the images and labels in each split. Name each split with a standard name like: Split.TRAIN, Split.TEST, and SPLIT.Validation.

データセットをダウンロードした後、SplitGeneratorを使って画像とラベルの分割データを構成します。それぞれの分割データをSplit.TRAIN, Split.TEST, SPILT.Validationのような名前にしてください。

In the gen_kwargs parameter, specify the file paths to the images to iterate over and load. If necessary, you can use DownloadManager.iter_archive() to iterate over images in TAR archives. You can also specify the associated labels in the metadata_path. The images and metadata_path are actually passed onto the next step where you’ll actually generate the dataset.


To stream a TAR archive file, you need to use DownloadManager.iter_archive()! The DownloadManager.download_and_extract() function does not support TAR archives in streaming mode.

TARファイルを流すには、DownloadManager.iter_archiveを使用する必要があります。DownloadManager.download_and_extract()関数はTARファイルをstreaming modeではサポートしていません。

def _split_generators(self, dl_manager):
    archive_path = dl_manager.download(_BASE_URL)
    split_metadata_paths = dl_manager.download(_METADATA_URLS)
    return [
                "images": dl_manager.iter_archive(archive_path),
                "metadata_path": split_metadata_paths["train"],
                "images": dl_manager.iter_archive(archive_path),
                "metadata_path": split_metadata_paths["test"],

Generate the dataset

The last method in the GeneratorBasedBuilder class actually generates the images and labels in the dataset. It yields a dataset according to the stucture specified in features from the info method. As you can see, generate_examples accepts the images and metadata_path from the previous method as arguments.

見てもらえばわかるように、generate_examples はimagesmetadata_pathを前のメソッドから引数として受け取ります。

To stream a TAR archive file, the metadata_path needs to be opened and read first. TAR files are accessed and yielded sequentially. This means you need to have the metadata information in hand first so you can yield it with its corresponding image.


Now you can write a function for opening and loading examples from the dataset:

def _generate_examples(self, images, metadata_path):
    """Generate images and labels for splits."""
    with open(metadata_path, encoding="utf-8") as f:
        files_to_keep = set(f.read().split("\n"))
    for file_path, file_obj in images:
        if file_path.startswith(_IMAGES_DIR):
            if file_path[len(_IMAGES_DIR) : -len(".jpg")] in files_to_keep:
                label = file_path.split("/")[2]
                yield file_path, {
                    "image": {"path": file_path, "bytes": file_obj.read()},
                    "label": label,

Generate the dataset metadata (optional)

The dataset metadata can be generated and stored in the dataset card (README.md file).
Run the following command to generate your dataset metadata in README.md and make sure your new loading script works correctly:


datasets-cli test path/to/<your-dataset-loading-script> --save_info --all_configs

If your loading script passed the test, you should now have the dataset_info YAML fields in the header of the README.md file in your dataset folder.

もしローディングスクリプトがテストをパスすれば、データセットフォルダのREADME.mdファイルのトップにdataset_info YAMLのコード部分があるはずです。

Upload the dataset to the Hub

Once your script is ready, create a dataset card and upload it to the Hub.
Congratulations, you can now load your dataset from the Hub! 🥳

スクリプトが準備万端ならすぐに、データセットカードを作ってHubにアップロードしましょう! これであなたのデータセットをHuggingfaceHubからロードできますよ!

from datasets import load_dataset



DatasetInfo :

  • ファイルツリーの構造
  • dataの中身はこんな感じ


  • LoadingScriptの中身が分かんないって感じたら、GoogleColabで一つ一つ試してみるのもありです。
!pip install datasets
import datasets
from huggingface_hub import HfApi
from datasets import DownloadManager, DatasetInfo
from datasets.data_files import DataFilesDict
import os
import json

# AWSなど特別なデータの置き場がないので、
# 私の場合はHugggingfaceのAPIででーたの URLを取得する必要があります
dl_manager = DownloadManager()
_NAME = "mickylan2367/LoadingScriptPractice"
_REVISION = "main"
hfh_dataset_info = HfApi().dataset_info(_NAME, revision=_REVISION, timeout=100.0)


DatasetInfo: { 
  {'_id': '652500e0d4b61d080797cc9a',
   'author': 'mickylan2367',
   'cardData': {'language': ['en'], 'license': 'cc-by-sa-4.0', 'tags': ['music']},
   'citation': None,
   'description': None,
   'disabled': False,
   'downloads': 0,
   'gated': False,
   'id': 'mickylan2367/LoadingScriptPractice',
   'lastModified': '2023-10-10T11:04:50.000Z',
   'likes': 0,
   'private': False,
   'sha': '....',
   'siblings': [RepoFile: {'blob_id': None, 'lfs': None, 'rfilename': '.gitattributes', 'size': None},
                RepoFile: {'blob_id': None, 'lfs': None, 'rfilename': 'LoadingScriptPractice.py', 'size': None},
                RepoFile: {'blob_id': None, 'lfs': None, 'rfilename': 'README.md', 'size': None},
                RepoFile: {'blob_id': None, 'lfs': None, 'rfilename': 'data/data-0000.zip', 'size': None},
                RepoFile: {'blob_id': None, 'lfs': None, 'rfilename': 'data/metadata.jsonl', 'size': None}],
   'tags': ['language:en', 'license:cc-by-sa-4.0', 'music', 'region:us']}
  • **.zipのURLをDict型として取得
data_path = DataFilesDict.from_hf_repo(
    {datasets.Split.TRAIN: ["**"]},
    allowed_extensions=["zip", ".zip"],
{NamedSplit('train'): ['hf://datasets/mickylan2367/LoadingScriptPractice
  • URLからdl_managerでダウンロードしてiter_archive()にする。
  • dl_manager.download(URL) : URLからダウンロードしてパスを返す
  • dl_manager.iter_archive(PATH) : (ファイル名, オブジェクト)のタプル型イテレータ(for文で回せるオブジェクト型)を返す
for iter in dl_manager.iter_archive(dl_manager.download(data_path["train"][0])):

出力は(ファイル名, zipされていたデータのオブジェクト)のタプル
('spectrogram_00000.png', <zipfile.ZipExtFile name='spectrogram_00000.png' mode='r' compress_type=deflate>)
('spectrogram_00001.png', <zipfile.ZipExtFile name='spectrogram_00001.png' mode='r' compress_type=deflate>)
('spectrogram_00002.png', <zipfile.ZipExtFile name='spectrogram_00002.png' mode='r' compress_type=deflate>)
('spectrogram_00003.png', <zipfile.ZipExtFile name='spectrogram_00003.png' mode='r' compress_type=deflate>)
('spectrogram_00004.png', <zipfile.ZipExtFile name='spectrogram_00004.png' mode='r' compress_type=deflate>)
('spectrogram_00005.png', <zipfile.ZipExtFile name='spectrogram_00005.png' mode='r' compress_type=deflate>)
('spectrogram_00006.png', <zipfile.ZipExtFile name='spectrogram_00006.png' mode='r' compress_type=deflate>)
('spectrogram_00007.png', <zipfile.ZipExtFile name='spectrogram_00007.png' mode='r' compress_type=deflate>)



Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?