2
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

NLP Course 5章のPUBMED_title_abstracts_2019_baseline.jsonl.zstはダウンロードできない

Posted at

TL; DR

Hugging Face NLP Course 5章4節でダウンロードするように指示されているデータセットが利用できない状態になっており、別のデータセットで代用して学習を進めた話。

実行環境

Google Colab Pro CPUランタイム

困っていたこと

Hugging Face NLP Course 5章4節で以下のコードを使ってPUBMED_title_abstractsデータセットをロードするように言われるが、実際にはコード内にあるURLは使えなくなっていて、FileNotFoundErrorが返ってくる。

from datasets import load_dataset

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-3-3ed6975c9f4e> in <cell line: 5>()
      3 # This takes a few minutes to run, so go grab a tea or coffee while you wait :)
      4 data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
----> 5 pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
      6 pubmed_dataset

6 frames
/usr/local/lib/python3.10/dist-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   2127 
   2128     # Create a dataset builder
-> 2129     builder_instance = load_dataset_builder(
   2130         path=path,
   2131         name=name,

/usr/local/lib/python3.10/dist-packages/datasets/load.py in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, use_auth_token, storage_options, **config_kwargs)
   1813         download_config = download_config.copy() if download_config else DownloadConfig()
   1814         download_config.storage_options.update(storage_options)
-> 1815     dataset_module = dataset_module_factory(
   1816         path,
   1817         revision=revision,

/usr/local/lib/python3.10/dist-packages/datasets/load.py in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1428             download_config=download_config,
   1429             download_mode=download_mode,
-> 1430         ).get_module()
   1431     # Try locally
   1432     elif path.endswith(filename):

/usr/local/lib/python3.10/dist-packages/datasets/load.py in get_module(self)
    956         base_path = Path(self.data_dir or "").expanduser().resolve().as_posix()
    957         patterns = sanitize_patterns(self.data_files) if self.data_files is not None else get_data_patterns(base_path)
--> 958         data_files = DataFilesDict.from_patterns(
    959             patterns,
    960             download_config=self.download_config,

/usr/local/lib/python3.10/dist-packages/datasets/data_files.py in from_patterns(cls, patterns, base_path, allowed_extensions, download_config)
    672         for key, patterns_for_key in patterns.items():
    673             out[key] = (
--> 674                 DataFilesList.from_patterns(
    675                     patterns_for_key,
    676                     base_path=base_path,

/usr/local/lib/python3.10/dist-packages/datasets/data_files.py in from_patterns(cls, patterns, base_path, allowed_extensions, download_config)
    577             try:
    578                 data_files.extend(
--> 579                     resolve_pattern(
    580                         pattern,
    581                         base_path=base_path,

/usr/local/lib/python3.10/dist-packages/datasets/data_files.py in resolve_pattern(pattern, base_path, allowed_extensions, download_config)
    366         if allowed_extensions is not None:
    367             error_msg += f" with any supported extension {list(allowed_extensions)}"
--> 368         raise FileNotFoundError(error_msg)
    369     return out
    370 

FileNotFoundError: Unable to find 'https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst'

原因

URLが削除されている。GitHubにissueが立っており、その中で新しいURLが提示されているが、これも現在は使えなくなっている。
https://github.com/huggingface/datasets/issues/3504#issuecomment-1674613716

PubMedデータセットの2023年度版は.xml形式で公開されており、これをダウンロードすることは可能(全体はかなり大きいので、1166個に分割されているみたい)。
https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/

しかし、Hugging Faceのdatasetsライブラリはxml形式のデータをサポートしてないみたいなので、一旦使うことは断念した。

対応

https://the-eye.eu/public/AI/training_data/code_clippy_data/code_clippy_dedup_data/train/ に.jsonl.zst形式の別のデータセットが置かれていたため、適当に選んで使った。

from datasets import load_dataset

data_files = "https://the-eye.eu/public/AI/training_data/code_clippy_data/code_clippy_dedup_data/train/data_1825_time1626321824_default.jsonl.zst"
code_clippy_dataset1 = load_dataset("json", data_files=data_files, split="train")
code_clippy_dataset1

各10000件なので、本家のようなビッグデータではないが、学習を進めることは一応できるようになった。
余裕があれば、xml形式をダウンロードしてjsonに直す処理を組みたい。

2
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?