0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

ReazonSpeechデータセットをダウンロードしてみた

Last updated at Posted at 2023-12-12

1. ReazonSpeechデータセットとは

テレビ番組から収集された約19000時間の音声データを含む日本語ASRコーパス.
ReazonSpeechの音声コーパスや音声認識モデルが,商用利用可能で無償で提供されています.

音声認識モデルは上のサイトから試すことも可能ですが,今回は自前で学習したかったので,データセットをダウンロードするまでのお話です.

2. データセットのダウンロード

基本的にReazonSpeech HowToガイドに従って進めました.
若干詰まったところがあったり,私自身HuggingFaceからのロードが初めてだったりしたため,具体的な動作も含めてまとめていきます.

ブラウザで同意

HuggingFace上で規約に同意する必要があります.
同意した後にもDatasetViewerでは以下のようなエラーが出ますが,このままで大丈夫そうなので次に進みます.

Cannot get the config names for the dataset.
Error code:   ConfigNamesError
Exception:    GatedRepoError
Message:      403 Client Error. (Request ID: Root=1-65337e1a-11cb0c8b33cd42255b7329e9;0bdf386e-d997-4ba2-baeb-a5168564b673)

Cannot access gated repo for url https://huggingface.co/api/datasets/reazon-research/reazonspeech.
Access to dataset reazon-research/reazonspeech is restricted and you are not in the authorized list. Visit https://huggingface.co/datasets/reazon-research/reazonspeech to ask for access.
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/dataset/config_names.py", line 65, in compute_config_names_response
                  for config in sorted(get_dataset_config_names(path=dataset, token=hf_token))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/inspect.py", line 351, in get_dataset_config_names
                  dataset_module = dataset_module_factory(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 1512, in dataset_module_factory
                  raise e1 from None
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 1479, in dataset_module_factory
                  raise e
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 1453, in dataset_module_factory
                  dataset_info = hf_api.dataset_info(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
                  return fn(*args, **kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 1741, in dataset_info
                  hf_raise_for_status(r)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/huggingface_hub/utils/_errors.py", line 277, in hf_raise_for_status
                  raise GatedRepoError(message, response) from e
              huggingface_hub.utils._errors.GatedRepoError: 403 Client Error. (Request ID: Root=1-65337e1a-11cb0c8b33cd42255b7329e9;0bdf386e-d997-4ba2-baeb-a5168564b673)
              
              Cannot access gated repo for url https://huggingface.co/api/datasets/reazon-research/reazonspeech.
              Access to dataset reazon-research/reazonspeech is restricted and you are not in the authorized list. Visit https://huggingface.co/datasets/reazon-research/reazonspeech to ask for access.

必要なモジュールのインストール

$ pip install datasets soundfile librosa

HuggingFaceへのログイン

コーパスをダウンロードする際には,こちらに従ってCLIからもログインする必要があります.

$ pip install -U "huggingface_hub[cli]"
$ huggingface-cli login

ここでTokenを聞かれるので,HuggingFace > Settingsから作成したREAD権限のAccess Tokenをコピペしたら完了です.

データセットのロード

pythonを起動して以下を実行します.
デフォルトでは約5時間の音声(350MB)を含む小さいデータセットへのアクセスになっています.

>>> from datasets import load_dataset
>>> ds = load_dataset("reazon-research/reazonspeech")

ここで以下のエラーが出ました(2023/12/11現在).
ブラウザで https://reazonspeech.s3.abci.ai/small.tsv にアクセスしても繋がらなかったため検索してみたところ,全ABCIサービス停止中とのこと.

>>> ds = load_dataset("reazon-research/reazonspeech")
Using the latest cached version of the module from /Users/{$PATH}/.cache/huggingface/modules/datasets_modules/datasets/reazon-research--reazonspeech/00f9d8f336dd718ea4c26dba7be9a2ce3795b9d92903c626baa912de3021ba2d (last modified on Mon Dec 11 16:05:54 2023) since it couldn't be found locally at reazon-research/reazonspeech., or remotely on the Hugging Face Hub.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/{$PATH}/.env/lib/python3.10/site-packages/datasets/load.py", line 2152, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/{$PATH}/.env/lib/python3.10/site-packages/datasets/builder.py", line 948, in download_and_prepare
    self._download_and_prepare(
  File "/Users/{$PATH}/.env/lib/python3.10/site-packages/datasets/builder.py", line 1711, in _download_and_prepare
    super()._download_and_prepare(
  File "/Users/{$PATH}/.env/lib/python3.10/site-packages/datasets/builder.py", line 1021, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/Users/{$PATH}/.cache/huggingface/modules/datasets_modules/datasets/reazon-research--reazonspeech/00f9d8f336dd718ea4c26dba7be9a2ce3795b9d92903c626baa912de3021ba2d/reazonspeech.py", line 81, in _split_generators
    meta_path = dl_manager.download_and_extract(meta_url)
  File "/Users/{$PATH}/.env/lib/python3.10/site-packages/datasets/download/download_manager.py", line 561, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/Users/{$PATH}/.env/lib/python3.10/site-packages/datasets/download/download_manager.py", line 424, in download
    downloaded_path_or_paths = map_nested(
  File "/Users/{$PATH}/.env/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 456, in map_nested
    return function(data_struct)
  File "/Users/{$PATH}/.env/lib/python3.10/site-packages/datasets/download/download_manager.py", line 450, in _download
    return cached_path(url_or_filename, download_config=download_config)
  File "/Users/{$PATH}/.env/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 182, in cached_path
    output_path = get_from_cache(
  File "/Users/{$PATH}/.env/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 599, in get_from_cache
    raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})")
ConnectionError: Couldn't reach https://reazonspeech.s3.abci.ai/small.tsv (ConnectTimeout(MaxRetryError("HTTPSConnectionPool(host='reazonspeech.s3.abci.ai', port=443): Max retries exceeded with url: /small.tsv (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x15c0f33d0>, 'Connection to reazonspeech.s3.abci.ai timed out. (connect timeout=100)'))")))

サービス停止期間が終わればアクセスできると信じて,続きは後日またトライしてみます.

後日実行してみたところ,上手くダウンロードできました!!

名称未設定.jpg

進捗があればまた追記します.

初心者による記事のため,情報に誤りがある可能性があります.

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?