COCOデータセットの`instances_*2017.json`には、`int32`で表現できない大きな整数が`annotation_id`に含まれている

Last updated at 2025-08-20Posted at 2025-08-20

環境

Python 3.13

何が起きたのか

COCOデータセットのinstances_val2017.jsonを利用しています。
annotation_idをint32のデータに格納しようとしたら、int32で表現できない大きな整数900100448263に遭遇しました。

In [114]: instances = json.load(open("instances_val2017.json"))

In [115]: annotations = instances["annotations"]

In [118]: annotations[36335]
 'segmentation': {
  'counts': [272, 2, ...],
  'size': [240, 320]},
 'area': 18419,
 'iscrowd': 1,
 'image_id': 448263,
 'bbox': [1, 0, 276, 122],
 'category_id': 1,
 'id': 900100448263}

調べたこと

annotationsとimagesの個数

In [124]: images = instances["images"]

In [125]: len(images)
Out[125]: 5000

In [126]: len(annotations)
Out[126]: 36781

`annotation_id`の桁数ごとの分布

In [128]: import collections

In [129]: 
     ...: digit_counts = collections.Counter(len(str(x["id"])) for x in annotations)
     ...:
     ...: for digits in sorted(digit_counts):
     ...:     print(f"{digits}桁: {digit_counts[digits]} 件")
     ...:
3桁: 5 件
4桁: 102 件
5桁: 1879 件
6桁: 12315 件
7桁: 22034 件
12桁: 446 件

annotation_idは7桁の整数から一気に12桁に飛んでいます。

`iscrowd==1`でフィルタリング

In [131]: annotations_iscrowd1 = [a for i,a in enumerate(annotations) if a["iscrowd"]==1]

In [132]: indexes_iscrowd1 = [i for i,a in enumerate(annotations) if a["iscrowd"]==1]

In [133]: len(annotations_iscrowd1)
Out[133]: 446

In [134]: indexes_iscrowd1[0:3]
Out[134]: [36335, 36336, 36337]

In [138]: indexes_iscrowd1[-3:]
Out[138]: [36778, 36779, 36780]

In [136]: len(annotations)
Out[136]: 36781

iscrowd==1で絞り込んだ件数は、annotation_idが12桁である件数と一致していました。
したがって、iscrowd==1のannotation_idは、int32では表現できない大きな整数になる可能性があるようです。
なおimage_idと比較すると、annotations[36335]ではimage_id=448263に対して、annotation_id=900100448263で、annotation_idの末尾はimage_idになっていました。他のiscrows=1のannotationも同様でした。iscrowd=1のannotation_idは、image_idと連結して生成しているため、int32で表現できない大きな整数になるようです。

補足

annotations[36335]以降はすべてiscrowd=1のアノテーションでした
instances_train2017.jsonも同様で、int32で表現できない整数が含まれていました

まとめ

iscrowd==1のannotation_idは、int32で表現できない大きな整数になる可能性がある

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up