More than 3 years have passed since last update.

tensorflow-datasets の celeb_a が読めない件

Last updated at 2022-01-13Posted at 2022-01-11

概要

［第3版］Python機械学習プログラミング達人データサイエンティストによる理論と実践の15章のサンプルプログラムに従って tensorflow-datasets パッケージに入っている celeb_a データセットを読もうとするとエラーが出ます。

import tensorflow as tf
import tensorflow_datasets as tfds

celeba_bldr = tfds.builder('celeb_a')
celeba_bldr.download_and_prepare()
celeba = celeba_bldr.as_dataset(shuffle_files=False)

結果

NonMatchingChecksumError: Artifact https://drive.google.com/uc?export=download&id=0B7EVK8r0v71pZjFTYXZWM3FlRnM, downloaded to /home/kagimoto/tensorflow_datasets/downloads/ucexport_download_id_0B7EVK8r0v71pZjFTYXZWM3FlDDaXUAQO8EGH_a7VqGNLRtW52mva1LzDrb-V723OQN8.tmp.042afbaaffe1427db0772b3bbde42a0a/ServiceLogin, has wrong checksum:
* Expected: UrlInfo(size=1.34 GiB, checksum='46fb89443c578308acf364d7d379fe1b9efb793042c0af734b6112e4fd3a8c74', filename='img_align_celeba.zip')
* Got: UrlInfo(size=89.19 KiB, checksum='3ccf881fda9772147434f3be0e78eecb1123f374f8be108bc3e38e94b6f2f013', filename='ServiceLogin')
To debug, see: https://www.tensorflow.org/datasets/overview#fixing_nonmatchingchecksumerror

ダウンロード先のURLやチェックサムは celeb_a.txt に定義されていますが、どうやらここに記載されている Google Drive にアクセスするための権限がないのが原因のようです。

環境

以下の環境で実施しております。

ソフトウエア	バージョン
Ubuntu	20.04
Python	3.10.0
tensorflow	2.8.0rc0
tensorflow-datasets	4.4.0

※Python-3.10 系の場合、tensorflow-2.8.0rc0 しかインストールできません。その結果、2022年1月13日現在では tensorflow-text が利用不可能です。しばらくは Python-3.9 系で tensorflow-2.7.0 を利用するのがいいのかもしれません。

対処方法

tensorflow-datasets ビルダーで読む方法

既知の問題として Github Issue に報告されており、liqinglin54951 さんのコメントに記載されている Google Drive からデータを手動でダウンロードします。ファイルを全て選択してダウンロードすると drive-download-20220111T093108Z-001.zip のような名前でファイルができます。これをホームディレクトリに置きます。

mkdir -p ~/tensorflow_datasets/celeb_a/2.0.1
cd ~/tensorflow_datasets/celeb_a/2.0.1
unzip -x ~/drive-download-20220111T093108Z-001.zip

これで先述のコードで celeb_a データを読み込むことができるようになりました。

個々の画像ファイルを読む

celeb_aのホームページから zip 書庫ファイルになった画像ファイル一式と Anno フォルダに入っている list_attr_celeba.txt を適当な場所にダウンロードしてきて、@ToppaD さんの『CelebA データがtensorflow_datasetsから読み込めないので画像をローカルに落として取り込む』に従えば学習用に読み込むことができます。

個々の画像ファイルから TFRecord 形式に変換する

TFRecord 形式のデータセットは liqinglin54951 さんの Google Drive で公開されているので自分で変換する必要はありませんが、後学のために変換方法を知っておくのがいいかと思い調べてみました。

データセット情報の取得

dataset_info.json および image.image.json が必要となるため、Github から入手しておきます。ただしデータセットのバージョンは 2.0.1 のようなので、dataset_info.json のバージョン番号を修正しておきます。

--- dataset_info.json.orig      2022-01-12 16:04:13.750169171 +0900
+++ dataset_info.json   2022-01-12 16:04:18.368477306 +0900
@@ -117,5 +117,5 @@
       }
     }
   ],
-  "version": "2.0.0"
+  "version": "2.0.1"
 }

入手したファイルを作業ディレクトリ ~/work/celeb_a に置きます。

celeba データの取得

celeb_aのホームページから以下を入手します。

img_align_celeba.zip
list_attr_celeba.txt
list_landmarks_celeba.txt
list_eval_partition.txt

txt ファイルを作業ディレクトリに置き、zip ファイルは作業ディレクトリに展開します。

TFRecord への変換

以下のプログラムを使って変換します。

convert_to_tfrecord.py

import os
import json
import tensorflow as tf

# https://www.tensorflow.org/tutorials/load_data/tfrecord?hl=ja#tfexample
def _bytes_feature(value):
  """string / byte 型から byte_list を返す"""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _int64_feature(value):
  """bool / enum / int / uint 型から Int64_list を返す"""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

# tf.train.Features を作成する関数の定義
def create_features(image_string, attributes, landmarks):
    feature = {
        'image': _bytes_feature(image_string),
    }
    for k, v in attributes.items():
        feature['attributes/' + k] = _int64_feature(v)
    for k, v in landmarks.items():
        feature['landmarks/' + k] = _int64_feature(v)

    return tf.train.Features(feature=feature)


# データセット情報を読み込む
work_dir = os.path.join(os.environ['HOME'], 'work/celeb_a')
with open(os.path.join(work_dir, 'dataset_info.json'), 'r') as fd:
    dataset_info = json.load(fd)
with open(os.path.join(work_dir, 'image.image.json'), 'r') as fd:
    image_info = json.load(fd)

# 画像の属性情報を読み込む
attributes = {}
with open(os.path.join(work_dir, 'list_attr_celeba.txt'), 'r') as fd:
    n_records = int(fd.readline())
    header_rec = fd.readline()
    headers = header_rec.split()
    for _ in range(n_records):
        rec = fd.readline()
        cols = rec.split()
        attributes[cols[0]] = dict(zip(headers, [True if int(col) == 1 else False for col in cols[1:]]))

landmarks = {}
with open(os.path.join(work_dir, 'list_landmarks_celeba.txt'), 'r') as fd:
    n_records = int(fd.readline())
    header_rec = fd.readline()
    headers = header_rec.split()
    for _ in range(n_records):
        rec = fd.readline()
        cols = rec.split()
        landmarks[cols[0]] = dict(zip(headers, [int(col) for col in cols[1:]]))

# 分割情報を読み込む
split_info = {}
with open(os.path.join(work_dir, 'list_eval_partition.txt'), 'r') as fd:
    while True:
        rec = fd.readline()
        if rec == '':
            break
        cols = rec.split()
        split_info[cols[0]] = int(cols[1])

categories = {
    'train': 0,
    'validation': 1,
    'test': 2,
}

# 出力先ディレクトリの定義
dest_dir = os.path.join(os.environ['HOME'], 'tensorflow_datasets', 'celeb_a', dataset_info['version'])
os.makedirs(dest_dir)


# tfrecord 形式のデータ書き出し
for category, val in categories.items():
    image_file_list = [k for k, v in split_info.items() if v == val]
    idx = [idx for idx, d in enumerate(dataset_info['splits']) if d.get('name') == category][0]
    split_lengths = dataset_info['splits'][idx]["shardLengths"]
    n_splits = len(split_lengths)

    n_start, n_end = 0, 0
    for i in range(n_splits):
        output_file = os.path.join(dest_dir, f'celeb_a-{category}.tfrecord-{i:05d}-of-{n_splits:05d}')
        n_end += int(split_lengths[i])
        with tf.io.TFRecordWriter(output_file) as writer:
            for image in image_file_list[n_start:n_end]:
                image_string = open(os.path.join(work_dir, 'img_align_celeba', image), 'rb').read()
                ex = tf.train.Example(features=create_features(image_string, attributes[image], landmarks[image]))
                writer.write(ex.SerializeToString())
        n_start = n_end

python convert_to_tfrecord.py

データセット情報のコピー

cp ~/work/celeb_a/dataset_info.json ~/tensorflow_datasets/celeb_a/2.0.1/
cp ~/work/celeb_a/image.image.json ~/tensorflow_datasets/celeb_a/2.0.1/

これで tensorflow-datasets ビルダーを使って読み出すことができます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up