More than 3 years have passed since last update.

自然言語処理のサンプルデータセットを簡単に取り扱えるライブラリdatasetsの紹介

Posted at 2021-02-24

はじめに

huggingfaceから、自然言語処理をする上でのサンプルデータを簡単に取り扱えるdatasetsというライブラリがあることを今更知ったので、簡単な使い方のメモを残しておきます。

公式のページはこちら（まぁこちらのUsageに使い方かいてあるんですけどね...）

https://github.com/huggingface/datasets

Quick tourもあります

https://huggingface.co/docs/datasets/quicktour.html

使い方

pipでインストールできます。

pip install datasets

datasetsで公開されているデータセットはdatasets.list_datasets()で確認できます。
本記事の投稿現在（2021年2月24日）では680件のデータセットが公開されているようです。

import datasets
# 公開されているデータセット一覧
print(datasets.list_datasets()[:10]) # 最初の10件だけ表示
print(len(datasets.list_datasets())) # 全件数も確認
# ['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc', 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue', 'ajgt_twitter_ar', 'allegro_reviews']
# 680

ざっとみた感じいろんなデータセットが公開されているようで、映画のレビューコメントにポジネガのラベルが付与されたIMDbデータセットや質疑応答のデータセットであるSQuADデータセットとか、色々ありますね。

例えば、感情分析に使えそうなデータセットemotionを読み込むには以下のような感じでdatasets.load_datasetを使います。split='train'といった形で指定すれば学習データだけ取ってくる、なんてこともできるようです。

# load_datasetで簡単にデータセットをダウンロードできる
emotion_dataset = datasets.load_dataset('emotion')

# 学習データだけ取得したい場合
emotion_train_dataset = datasets.load_dataset('emotion', split='train')

# DatasetDict型で取得される
print(emotion_dataset)
# DatasetDict({
#   train: Dataset({
#        features: ['text', 'label'],
#        num_rows: 16000
#    })
#    validation: Dataset({
#        features: ['text', 'label'],
#        num_rows: 2000
#    })
#    test: Dataset({
#        features: ['text', 'label'],
#        num_rows: 2000
#    })
# })

datasets.load_datasetで取得したDatasetDict型のデータは辞書データのようにアクセスできます。
上記の例だと学習データが16000件、バリデーションデータが2000件、テストデータが2000件入ってることがわかります。
学習データにアクセスするには以下のようにすればOK。

# 学習データ上から10件表示
print(emotion_dataset['train']['text'][:10])
# ['i didnt feel humiliated',
# 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
# 'im grabbing a minute to post i feel greedy wrong',
# 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property',
# 'i am feeling grouchy',
# 'ive been feeling a little burdened lately wasnt sure why that was',
# 'ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny',
# 'i feel as confused about life as a teenager or as jaded as a year old man',
# 'i have been with petronas for years i feel that petronas has performed well and made a huge profit',
# 'i feel romantic too']

# 正解ラベルは数値で格納されている
print(emotion_dataset['train']['label'][:10])
# [0, 0, 3, 2, 3, 0, 5, 4, 1, 2]

Iとかが小文字になっていたり、'が消されていたりと、基本的な前処理はされているようですかね。めっちゃ便利。
これだけでもうBERTとかにすぐにでもぶちこめそうですが、正解ラベルが数値だとなんのこっちゃかよくわからんので、そういった場合は公式のリポジトリのdatasets/フォルダにアクセスして、データセット毎の詳細ページにアクセスすれば、確認できます。
今回の例だとemotion/フォルダにアクセスしてDataset Structure内のData Fieldsにラベル情報が記載されていました。
(https://github.com/huggingface/datasets/tree/master/datasets/emotion)

default

text: a string feature.
label: a classification label, with possible values including sadness (0), joy (1), love (2), anger (3), fear (4).

0がsadnessとか1がjoyとか書かれてますね。ただ正解ラベルは5まであるのに、こちらには5のラベルが何かについて記載されていませんでした。ただの記載漏れですかね？5のラベルは実際のソースコードemotion.pyの40行目にclass_namesが記載されていたので、おそらくsurpriseだと推測できます。

class Emotion(datasets.GeneratorBasedBuilder):
    def _info(self):
        class_names = ["sadness", "joy", "love", "anger", "fear", "surprise"]
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=datasets.Features(
                {"text": datasets.Value("string"), "label": datasets.ClassLabel(names=class_names)}
            ),
            supervised_keys=("text", "label"),
            homepage=_URL,
            citation=_CITATION,
        )

データの取り方と正解ラベルの情報が確認できればあとは煮るなり焼くなり好きにしちゃったらいいと思います。

おわり

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up