タイトル: DETR+RoBERTaを使ったマルチモーダル推論モデル「MDETR」を使ってみる
DETR+RoBERTaを使ったマルチモーダル推論モデル「MDETR」が公開されているので、Google Colab上で使ってみました。
しかし、手順通り、MDETRの「Visual Question Answering(視覚的質問応答)の推論」の部分を実行しても、結果出力データにnan値が多く含まれ、推論結果もnan値となり、推論が出来ませんでした。以降に試行結果を記載します。
下記の記事の通りにMDETRを実行
画像について何でも文章で質問して、その回答(該当する単語)が得られるという、MDETRの機能が面白そうなので、下記の記事にある、MDETRの「Visual Question Answering(視覚的質問応答)の推論」の部分を実行してみます。
複数モダリティを使う「マルチモーダルAI」の紹介
第1回 DETR+RoBERTaを使ったマルチモーダル推論モデル「MDETR」の紹介とColabでの推論
https://www.ogis-ri.co.jp/otc/hiroba/technical/multimodal-ai/part1.html
上記記事の元になっているチュートリアルファイル
https://colab.research.google.com/github/ashkamath/mdetr/blob/colab/notebooks/MDETR_demo.ipynb
Google Colabを用意・GPU利用へ設定変更
新規にGoogle Colabを開いて、GPU利用へ設定を変更
(記事通り) 4-1.必要なパッケージの準備と前処理・後処理の定義
記事の通りに実行してみます。(コード・実行結果のみ抜粋)
!pip install timm transformers
Requirement already satisfied: timm in /usr/local/lib/python3.11/dist-packages (1.0.15)
Requirement already satisfied: transformers in /usr/local/lib/python3.11/dist-packages (4.50.3)
Requirement already satisfied: torch in /usr/local/lib/python3.11/dist-packages (from timm) (2.6.0+cu124)
Requirement already satisfied: torchvision in /usr/local/lib/python3.11/dist-packages (from timm) (0.21.0+cu124)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.11/dist-packages (from timm) (6.0.2)
Requirement already satisfied: huggingface_hub in /usr/local/lib/python3.11/dist-packages (from timm) (0.30.1)
Requirement already satisfied: safetensors in /usr/local/lib/python3.11/dist-packages (from timm) (0.5.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from transformers) (3.18.0)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.11/dist-packages (from transformers) (2.0.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from transformers) (24.2)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.11/dist-packages (from transformers) (2024.11.6)
Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from transformers) (2.32.3)
Requirement already satisfied: tokenizers<0.22,>=0.21 in /usr/local/lib/python3.11/dist-packages (from transformers) (0.21.1)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.11/dist-packages (from transformers) (4.67.1)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.11/dist-packages (from huggingface_hub->timm) (2025.3.2)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.11/dist-packages (from huggingface_hub->timm) (4.13.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (2025.1.31)
Requirement already satisfied: networkx in /usr/local/lib/python3.11/dist-packages (from torch->timm) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from torch->timm) (3.1.6)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->timm)
Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->timm)
Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch->timm)
Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch->timm)
Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch->timm)
Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch->timm)
Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch->timm)
Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch->timm)
Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch->timm)
Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.2 in /usr/local/lib/python3.11/dist-packages (from torch->timm) (0.6.2)
Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.11/dist-packages (from torch->timm) (2.21.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch->timm) (12.4.127)
Collecting nvidia-nvjitlink-cu12==12.4.127 (from torch->timm)
Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Requirement already satisfied: triton==3.2.0 in /usr/local/lib/python3.11/dist-packages (from torch->timm) (3.2.0)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.11/dist-packages (from torch->timm) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from sympy==1.13.1->torch->timm) (1.3.0)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.11/dist-packages (from torchvision->timm) (11.1.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->torch->timm) (3.0.2)
Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m39.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl (211.5 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl (56.3 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl (207.5 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.5/207.5 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.1/21.1 MB[0m [31m98.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nvidia-nvjitlink-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, nvidia-cusparse-cu12, nvidia-cudnn-cu12, nvidia-cusolver-cu12
Attempting uninstall: nvidia-nvjitlink-cu12
Found existing installation: nvidia-nvjitlink-cu12 12.5.82
Uninstalling nvidia-nvjitlink-cu12-12.5.82:
Successfully uninstalled nvidia-nvjitlink-cu12-12.5.82
Attempting uninstall: nvidia-curand-cu12
Found existing installation: nvidia-curand-cu12 10.3.6.82
Uninstalling nvidia-curand-cu12-10.3.6.82:
Successfully uninstalled nvidia-curand-cu12-10.3.6.82
Attempting uninstall: nvidia-cufft-cu12
Found existing installation: nvidia-cufft-cu12 11.2.3.61
Uninstalling nvidia-cufft-cu12-11.2.3.61:
Successfully uninstalled nvidia-cufft-cu12-11.2.3.61
Attempting uninstall: nvidia-cuda-runtime-cu12
Found existing installation: nvidia-cuda-runtime-cu12 12.5.82
Uninstalling nvidia-cuda-runtime-cu12-12.5.82:
Successfully uninstalled nvidia-cuda-runtime-cu12-12.5.82
Attempting uninstall: nvidia-cuda-nvrtc-cu12
Found existing installation: nvidia-cuda-nvrtc-cu12 12.5.82
Uninstalling nvidia-cuda-nvrtc-cu12-12.5.82:
Successfully uninstalled nvidia-cuda-nvrtc-cu12-12.5.82
Attempting uninstall: nvidia-cuda-cupti-cu12
Found existing installation: nvidia-cuda-cupti-cu12 12.5.82
Uninstalling nvidia-cuda-cupti-cu12-12.5.82:
Successfully uninstalled nvidia-cuda-cupti-cu12-12.5.82
Attempting uninstall: nvidia-cublas-cu12
Found existing installation: nvidia-cublas-cu12 12.5.3.2
Uninstalling nvidia-cublas-cu12-12.5.3.2:
Successfully uninstalled nvidia-cublas-cu12-12.5.3.2
Attempting uninstall: nvidia-cusparse-cu12
Found existing installation: nvidia-cusparse-cu12 12.5.1.3
Uninstalling nvidia-cusparse-cu12-12.5.1.3:
Successfully uninstalled nvidia-cusparse-cu12-12.5.1.3
Attempting uninstall: nvidia-cudnn-cu12
Found existing installation: nvidia-cudnn-cu12 9.3.0.75
Uninstalling nvidia-cudnn-cu12-9.3.0.75:
Successfully uninstalled nvidia-cudnn-cu12-9.3.0.75
Attempting uninstall: nvidia-cusolver-cu12
Found existing installation: nvidia-cusolver-cu12 11.6.3.83
Uninstalling nvidia-cusolver-cu12-11.6.3.83:
Successfully uninstalled nvidia-cusolver-cu12-11.6.3.83
Successfully installed nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-nvjitlink-cu12-12.4.127
import torch
from PIL import Image
import requests
import json
import torchvision.transforms as T
import matplotlib.pyplot as plt
import numpy as np
import torch.nn.functional as F
from skimage.measure import find_contours
from matplotlib.patches import Polygon
from collections import defaultdict
torch.set_grad_enabled(False);
# カラーマップ
COLORS = [[0.000, 0.447, 0.741], [0.850, 0.325, 0.098], [0.929, 0.694, 0.125],
[0.494, 0.184, 0.556], [0.466, 0.674, 0.188], [0.301, 0.745, 0.933]]
# インプット画像の前処理
transform = T.Compose([
T.Resize(800),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
def box_cxcywh_to_xyxy(x):
"""
(center_x, center_y, width, height)から(xmin, ymin, xmax, ymax)に座標変換
"""
x_c, y_c, w, h = x.unbind(1)
b = [(x_c - 0.5 * w), (y_c - 0.5 * h),
(x_c + 0.5 * w), (y_c + 0.5 * h)]
return torch.stack(b, dim=1)
def rescale_bboxes(out_bbox, size):
"""
バウンディングボックスのリスケール
"""
img_w, img_h = size
b = box_cxcywh_to_xyxy(out_bbox)
b = b * torch.tensor([img_w, img_h, img_w, img_h], dtype=torch.float32)
return b
def apply_mask(image, mask, color, alpha=0.5):
"""
セグメンテーション用の領域を塗りつぶしたマスクを画像に適用
Parameters
----------
image : numpy.ndarray
適用元の画像
mask : tensor
適用するマスクの領域座標
color : list
カラーマップ
alpha : float
透過値
"""
for c in range(3):
image[:, :, c] = np.where(mask == 1,
image[:, :, c] *
(1 - alpha) + alpha * color[c] * 255,
image[:, :, c])
return image
def plot_results(pil_img, scores, boxes, labels, masks=None):
"""
結果表示
Parameters
----------
pil_img : PIL.Image
画像
scores : list
検出された物体の予測値のリスト
boxes : list
検出された物体のバウンディングボックス座標(center_x, center_y, width, height)のリスト
labels : list
検出された物体のラベルのリスト
masks : list
セグメンテーション用のマスクのリスト
"""
plt.figure(figsize=(16,10))
# PIL.Imageをnumpy.ndarrayに変換
np_image = np.array(pil_img)
ax = plt.gca()
colors = COLORS * 100
if masks is None:
# マスクが無い場合は、len(scores)のNoneのリストで埋める
masks = [None for _ in range(len(scores))]
# リストの長さが違う場合、例外をスロー
assert len(scores) == len(boxes) == len(labels) == len(masks)
for s, (xmin, ymin, xmax, ymax), l, mask, c in zip(scores, boxes.tolist(), labels, masks, colors):
ax.add_patch(plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin,
fill=False, color=c, linewidth=3))
text = f'{l}: {s:0.2f}'
ax.text(xmin, ymin, text, fontsize=15, bbox=dict(facecolor='white', alpha=0.8))
if mask is None:
continue
# マスク適用
np_image = apply_mask(np_image, mask, c)
# マスク部分の輪郭線を描画。maskデータはバイナリなので、中間の0.5の位置で輪郭線を描画する。
padded_mask = np.zeros((mask.shape[0] + 2, mask.shape[1] + 2), dtype=np.uint8)
padded_mask[1:-1, 1:-1] = mask
contours = find_contours(padded_mask, 0.5)
for verts in contours:
# (y, x)を(x, y)に反転
verts = np.fliplr(verts) - 1
# 輪郭線の内部を塗り潰し
p = Polygon(verts, facecolor="none", edgecolor=c)
ax.add_patch(p)
plt.imshow(np_image)
plt.axis('off')
plt.show()
(記事通り) 4-4. Visual Question Answering(視覚的質問応答)の推論
記事の通りに実行していきます。(コード・実行結果のみ抜粋)
import json, requests
answer2id_by_type = json.load(requests.get("https://nyu.box.com/shared/static/j4rnpo8ixn6v0iznno2pim6ffj3jyaj8.json", stream=True).raw)
id2answerbytype = {}
for ans_type in answer2id_by_type.keys():
curr_reversed_dict = {v: k for k, v in answer2id_by_type[ans_type].items()}
id2answerbytype[ans_type] = curr_reversed_dict
print(id2answerbytype.keys())
dict_keys(['answer_rel', 'answer_obj', 'answer_global', 'answer_attr', 'answer_cat'])
import pprint
pprint.pprint(id2answerbytype["answer_rel"])
{0: 'no',
1: 'yes',
2: 'towel',
3: 'man',
4: 'backpack',
5: 'cabinet',
6: 'left',
7: 'purse',
8: 'soda',
9: 'right',
10: 'bag',
11: 'bricks',
12: 'street',
13: 'sign',
14: 'train',
15: 'arrow',
16: 'coat',
17: 'table',
18: 'tree',
19: 'airplane',
20: 'bed',
21: 'wetsuit',
22: 'sky',
23: 'lettuce',
24: 'cake',
25: 'dog',
26: 'horse',
27: 'tiles',
28: 'broccoli',
29: 'fence',
30: 'van',
31: 'boats',
32: 'bathroom',
33: 'potato',
34: 'giraffe',
35: 'car',
36: 'end table',
37: 'wine',
38: 'shelf',
39: 'bus',
40: 'tray',
41: 'cabinets',
42: 'truck',
43: 'flowers',
44: 'computer mouse',
45: 'wall',
46: 'radio',
47: 'peppers',
48: 'phone',
49: 'lady',
50: 'elephant',
51: 'floor',
52: 'skier',
53: 'helmet',
54: 'blender',
55: 'cow',
56: 'mountain',
57: 'sofa',
58: 'coffee table',
59: 'grass',
60: 'orange',
61: 'behind',
62: 'surfboard',
63: 'water',
64: 'carrots',
65: 'tie',
66: 'pecan',
67: 'players',
68: 'plate',
69: 'umbrella',
70: 'frisbee',
71: 'soup',
72: 'pizza',
73: 'chicken',
74: 'home plate',
75: 'shirt',
76: 'building',
77: 'ground',
78: 'sandwich',
79: 'people',
80: 'fire hydrant',
81: 'trees',
82: 'bottle',
83: 'woman',
84: 'papers',
85: 'chair',
86: 'cup',
87: 'trailer',
88: 'desk',
89: 'celery',
90: 'pen',
91: 'bull',
92: 'roll',
93: 'sneakers',
94: 'pillow',
95: 'flamingo',
96: 'knee pads',
97: 'field',
98: 'side table',
99: 'beans',
100: 'monitor',
101: 'sidewalk',
102: 'zebra',
103: 'coffee',
104: 'giraffes',
105: 'cat',
106: 'stop sign',
107: 'laptop',
108: 'front',
109: 'onions',
110: 'pan',
111: 'tennis ball',
112: 'hat',
113: 'lid',
114: 'artichokes',
115: 'branch',
116: 'glasses',
117: 'spice',
118: 'printer',
119: 'leaves',
120: 'cap',
121: 'baseball bat',
122: 'batter',
123: 'boat',
124: 'bear',
125: 'rock',
126: 'lake',
127: 'jacket',
128: 'train car',
129: 'bacon',
130: 'rice',
131: 'bread',
132: 'bird',
133: 'television',
134: 'bicycle',
135: 'pants',
136: 'grapes',
137: 'beer',
138: 'spinach',
139: 'boy',
140: 'child',
141: 'tag',
142: 'suitcase',
143: 'cheese',
144: 'soap',
145: 'box',
146: 'magnets',
147: 'seat',
148: 'refrigerator',
149: 'bikini',
150: 'oranges',
151: 'zoo',
152: 'keyboard',
153: 'banana',
154: 'skateboard',
155: 'glass',
156: 'hillside',
157: 'street light',
158: 'book',
159: 'girl',
160: 'bun',
161: 'jeans',
162: 'snow',
163: 'stew',
164: 'fries',
165: 'guy',
166: 'carriage',
167: 'pepper',
168: 'paper',
169: 'carrot',
170: 'post',
171: 'clock',
172: 'boots',
173: 'ice',
174: 'jar',
175: 'couch',
176: 'cars',
177: 'clouds',
178: 'mirror',
179: 'onion',
180: 'cupboard',
181: 'suit',
182: 'wagon',
183: 'remote control',
184: 'donut',
185: 'sand',
186: 'faucet',
187: 'grapefruit',
188: 'bookshelf',
189: 'hot dog',
190: 'drum',
191: 'potatoes',
192: 'bookcase',
193: 'pole',
194: 'meat',
195: 'trunk',
196: 'drawer',
197: 'player',
198: 'cabbage',
199: 'words',
200: 'rocks',
201: 'beach',
202: 'road',
203: 'seagull',
204: 'skirt',
205: 'pastry',
206: 'bowl',
207: 'pasta',
208: 'tomato',
209: 'oven',
210: 'cucumber',
211: 'eggs',
212: 'flag',
213: 'hill',
214: 'speaker',
215: 'spots',
216: 'microwave oven',
217: 'bench',
218: 'luggage',
219: 'controller',
220: 'olive',
221: 'uniform',
222: 'straw',
223: 'container',
224: 'sheep',
225: 'gorilla',
226: 'blanket',
227: 'suv',
228: 'letters',
229: 'horses',
230: 'sausage',
231: 'bush',
232: 'plant',
233: 'shorts',
234: 'goat',
235: 'wii controller',
236: 'microwave',
237: 'bracelet',
238: 'salad',
239: 'doors',
240: 'dish',
241: 'sauce',
242: 'teddy bear',
243: 'ostrich',
244: 'path',
245: 'shrimp',
246: 'radiator',
247: 'tablecloth',
248: 'chairs',
249: 'shoe',
250: 'window',
251: 'baskets',
252: 'cereal',
253: 'computer',
254: 'office',
255: 'cutting board',
256: 'donuts',
257: 'apples',
258: 'pickles',
259: 'plants',
260: 'dress',
261: 'curtains',
262: 'spatula',
263: 'onion rings',
264: 'noodles',
265: 'scarf',
266: 'cows',
267: 'books',
268: 'stove',
269: 'envelope',
270: 'graffiti',
271: 'cell phone',
272: 'school bus',
273: 'wire',
274: 'tablet',
275: 'mother',
276: 'boot',
277: 'racket',
278: 'cds',
279: 'tofu',
280: 'above',
281: 'eye glasses',
282: 'doorway',
283: 'tv stand',
284: 'doll',
285: 'necklace',
286: 'street sign',
287: 'ball',
288: 'parking meter',
289: 'mushrooms',
290: 'donkey',
291: 'hay',
292: 'blueberries',
293: 'kettle',
294: 'screen',
295: 'bananas',
296: 'dvd player',
297: 'computer monitor',
298: 'toaster',
299: 'spoon',
300: 'ducks',
301: 'turtle',
302: 'nightstand',
303: 'kite',
304: 'headphones',
305: 'mickey mouse',
306: 'ski',
307: 'elephants',
308: 'strawberries',
309: 'bathtub',
310: 'bike',
311: 'number',
312: 'camera',
313: 'egg',
314: 'picture',
315: 'door',
316: 'painting',
317: 'barrier',
318: 'jeep',
319: 'dispenser',
320: 'platform',
321: 'powdered sugar',
322: 'letter',
323: 'cart',
324: 'headband',
325: 'canoe',
326: 'gentleman',
327: 'taxi',
328: 't-shirt',
329: 'jet',
330: 'avocado',
331: 'station',
332: 'tongs',
333: 'snack',
334: 'bookshelves',
335: 'roof',
336: 'magazine',
337: 'calculator',
338: 'planter',
339: 'pepperoni',
340: 'motorcycle',
341: 'action figure',
342: 'binder',
343: 'outlet',
344: 'lamp',
345: 'cage',
346: 'collar',
347: 'benches',
348: 'garment',
349: 'cookie',
350: 'baby',
351: 'office chair',
352: 'flower',
353: 'light switch',
354: 'toilet paper',
355: 'curtain',
356: 'dishwasher',
357: 'corn',
358: 'seed',
359: 'park',
360: 'nuts',
361: 'ocean',
362: 'mountains',
363: 'at the camera',
364: 'pavement',
365: 'ice cream',
366: 'coleslaw',
367: 'salad dressing',
368: 'vase',
369: 'dessert',
370: 'men',
371: 'canister',
372: 'baseball players',
373: 'stuffed animal',
374: 'basket',
375: 'lime',
376: 'goggles',
377: 'beverage',
378: 'calf',
379: 'shelves',
380: 'napkin',
381: 'placemat',
382: 'skis',
383: 'pastries',
384: 'lamb',
385: 'snail',
386: 'cauliflower',
387: 'cords',
388: 'ice cube',
389: 'toothbrush',
390: 'customers',
391: 'logs',
392: 'cross',
393: 'bleachers',
394: 'raspberries',
395: 'game controller',
396: 'living room',
397: 'pizza pan',
398: 'below',
399: 'catcher',
400: 'toilet',
401: 'frog',
402: 'cherries',
403: 'chain',
404: 'apple',
405: 'log',
406: 'pilot',
407: 'umbrellas',
408: 'peas',
409: 'stuffed bear',
410: 'burner',
411: 'chocolate',
412: 'surfer',
413: 'light bulbs',
414: 'bedroom',
415: 'tank top',
416: 'traffic light',
417: 'tennis',
418: 'spectator',
419: 'cupcakes',
420: 'airplanes',
421: 'sailboats',
422: 'guitar',
423: 'counter',
424: 'snowboard',
425: 'vest',
426: 'cooler',
427: 'bar stool',
428: 'palm tree',
429: 'hats',
430: 'closet',
431: 'speakers',
432: 'ceiling',
433: 'pipe',
434: 'pineapple',
435: 'dryer',
436: 'sun',
437: 'waffles',
438: 'shore',
439: 'desktop computer',
440: 'cats',
441: 'blouse',
442: 'butterfly',
443: 'costume',
444: 'coffee cup',
445: 'pot',
446: 'coffee maker',
447: 'towels',
448: 'children',
449: 'under',
450: 'air',
451: 'wii game',
452: 'garland',
453: 'dirt',
454: 'ketchup',
455: 'cookies',
456: 'hard drive',
457: 'bandana',
458: 'runway',
459: 'console',
460: 'mango',
461: 'piano',
462: 'wires',
463: 'tables',
464: 'rope',
465: 'sword',
466: 'grape',
467: 'bat',
468: 'knife',
469: 'air conditioner',
470: 'squash',
471: 'baseball',
472: 'cat food',
473: 'sticker',
474: 'fork',
475: 'buildings',
476: 'locomotive',
477: 'snacks',
478: 'armchair',
479: 'hallway',
480: 'chopsticks',
481: 'tomatoes',
482: 'dresser',
483: 'tower',
484: 'trash can',
485: 'ottoman',
486: 'berry',
487: 'trash bag',
488: 'umpire',
489: 'cupcake',
490: 'women',
491: 'sweet potato',
492: 'crown',
493: 'hot dogs',
494: 'ice maker',
495: 'moss',
496: 'officer',
497: 'ham',
498: 'milk',
499: 'beads',
500: 'cupboards',
501: 'pea',
502: 'kitchen',
503: 'word',
504: 'wii',
505: 'drawers',
506: 'strawberry',
507: 'goats',
508: 'coffee pot',
509: 'sink',
510: 'cyclist',
511: 'controllers',
512: 'popcorn',
513: 'frosting',
514: 'duck',
515: 'apron',
516: 'handbag',
517: 'countertop',
518: 'dip',
519: 'restroom',
520: 'cucumbers',
521: 'pipes',
522: 'parrot',
523: 'shoes',
524: 'paddle',
525: 'desk lamp',
526: 'poster',
527: 'juice',
528: 'champagne',
529: 'olives',
530: 'fish',
531: 'angry bird',
532: 'belt',
533: 'parking lot',
534: 'bagels',
535: 'bushes',
536: 'skateboarder',
537: 'mat',
538: 'cowboy',
539: 'bicycles',
540: 'sweater',
541: 'pineapples',
542: 'stroller',
543: 'dining table',
544: 'earphones',
545: 'rug',
546: 'tractor',
547: 'cherry',
548: 'wool',
549: 'skater',
550: 'chickpeas',
551: 'utensil',
552: 'swimsuit',
553: 'branches',
554: 'muffin',
555: 'beets',
556: 'steps',
557: 'glove',
558: 'sushi',
559: 'drink',
560: 'watch',
561: 'remote controls',
562: 'toast',
563: 'toppings',
564: 'logo',
565: 'salmon',
566: 'socks',
567: 'bridge',
568: 'antenna',
569: 'puppy',
570: 'down',
571: 'parsley',
572: 'ipod',
573: 'candies',
574: 'island',
575: 'birds',
576: 'watermelon',
577: 'xbox controller',
578: 'lamps',
579: 'wave',
580: 'policeman',
581: 'pear',
582: 'jersey',
583: 'water bottle',
584: 'bags',
585: 'blinds',
586: 'pears',
587: 'crowd',
588: 'match',
589: 'stadium',
590: 'baking sheet',
591: 'mozzarella',
592: 'sheet',
593: 'microphone',
594: 'polar bear',
595: 'croissant',
596: 'minivan',
597: 'bears',
598: 'mushroom',
599: 'computer desk',
600: 'chandelier',
601: 'packet',
602: 'game',
603: 'stick',
604: 'spices',
605: 'cable',
606: 'bus stop',
607: 'tissues',
608: 'tea kettle',
609: 'forest',
610: 'candle holder',
611: 'mixing bowl',
612: 'forks',
613: 'dragon',
614: 'wine glass',
615: 'fire truck',
616: 'panda',
617: 'wine bottle',
618: 'toy',
619: 'urinal',
620: 'gas stove',
621: 'magnet',
622: 'porch',
623: 'soldier',
624: 'pans',
625: 'outfit',
626: 'shower',
627: 'bucket',
628: 'buses',
629: 'mask',
630: 'lemons',
631: 'bunny',
632: 'beds',
633: 'artichoke',
634: 'spear',
635: 'tea',
636: 'robe',
637: 'laptops',
638: 'gate',
639: 'almonds',
640: 'dishes',
641: 'rose',
642: 'dry-erase board',
643: 'hotdog bun',
644: 'owl',
645: 'liquor',
646: 'girls',
647: 'heart',
648: 'houses',
649: 'gloves',
650: 'pocket',
651: 'yard',
652: 'drinks',
653: 'bagel',
654: 'pajamas',
655: 'crosswalk',
656: 'airport',
657: 'palm trees',
658: 'video camera',
659: 'statue',
660: 'bottles',
661: 'pedestrian',
662: 'wines',
663: 'luggage cart',
664: 'pens',
665: 'peach',
666: 'ski lift',
667: 'staircase',
668: 'comforter',
669: 'hair dryer',
670: 'tennis balls',
671: 'hamburger',
672: 'medicine cabinet',
673: 'mustard',
674: 'aquarium',
675: 'herbs',
676: 'cakes',
677: 'waffle',
678: 'beer bottle',
679: 'lion',
680: 'avocados',
681: 'foil',
682: 'fudge',
683: 'dumpster',
684: 'stuffed dog',
685: 'patio',
686: 'heater',
687: 'daughter',
688: 'passenger',
689: 'utensils',
690: 'bubbles',
691: 'safety jacket',
692: 'driver',
693: 'pigeon',
694: 'hangar',
695: 'garage',
696: 'lemon',
697: 'oil',
698: 'boys',
699: 'sandal',
700: 'leaf',
701: 'cards',
702: 'shop',
703: 'sandwiches',
704: 'whisk',
705: 'mixer',
706: 'waste basket',
707: 'skate park',
708: 'ladder',
709: 'coffee mug',
710: 'dining room',
711: 'ring',
712: 'kitten',
713: 'kites',
714: 'silverware',
715: 'squirrel',
716: 'flip flop',
717: 'power line',
718: 'stuffed animals',
719: 'asparagus',
720: 'cooker',
721: 'carpet',
722: 'scissors',
723: 'macaroni',
724: 'cracker',
725: 'stairs',
726: 'sack',
727: 'buns',
728: 'brush',
729: 'ship',
730: 'mashed potatoes',
731: 'cord',
732: 'pump',
733: 'frame',
734: 'toiletries',
735: 'music',
736: 'snowboarder',
737: 'liquid',
738: 'candy',
739: 'briefcase',
740: 'artwork',
741: 'backyard',
742: 'soccer player',
743: 'menu',
744: 'food truck',
745: 'sandals',
746: 'upward',
747: 'chimney',
748: 'fans',
749: 'deck',
750: 'icing',
751: 'steam',
752: 'dough',
753: 'spectators',
754: 'light fixture',
755: 'trucks',
756: 'peaches',
757: 'newspaper',
758: 'crate',
759: 'stickers',
760: 'mouse pad',
761: 'canopy',
762: 'pie',
763: 'dog food',
764: 'tomato sauce',
765: 'sculpture',
766: 'chili',
767: 'american flag',
768: 'zebras',
769: 'platter',
770: 'coke',
771: 'train station',
772: 'roadside',
773: 'earring',
774: 'river',
775: 'pitcher',
776: 'saucer',
777: 'soap dish',
778: 'beet',
779: 'cones',
780: 'wii controllers',
781: 'sailboat',
782: 'bowls',
783: 'fan',
784: 'dressing',
785: 'train tracks',
786: 'tissue box',
787: 'cables',
788: 'gadget',
789: 'house',
790: 'hedge',
791: 'pine trees',
792: 'swimming pool',
793: 'magazines',
794: 'trumpet',
795: 'turkey',
796: 'mud',
797: 'mugs',
798: 'farm',
799: 'jockey',
800: 'dock',
801: 'can',
802: 'crab',
803: 'athlete',
804: 'marker',
805: 'smoke',
806: 'power lines',
807: 'mattress',
808: 'watermelons',
809: 'couple',
810: 'windows',
811: 'soccer ball',
812: 'desert',
813: 'soda can',
814: 'trains',
815: 'net',
816: 'bird cage',
817: 'floor lamp',
818: 'omelette',
819: 'church',
820: 'washing machine',
821: 'granola',
822: 'shirts',
823: 'chains',
824: 'ornament',
825: 'parent',
826: 'walkway',
827: 'sausages',
828: 'mayonnaise',
829: 'seal',
830: 'roast beef',
831: 'water bottles',
832: 'tools',
833: 'puddle',
834: 'biscuit',
835: 'couches',
836: 'suitcases',
837: 'soft drink',
838: 'berries',
839: 'star',
840: 'symbol',
841: 'figure',
842: 'chicken breast',
843: 'hotel',
844: 'router',
845: 'gun',
846: 'bee',
847: 'rubber duck',
848: 'containers',
849: 'appetizers',
850: 'raisin',
851: 'eagle',
852: 'topping',
853: 'biker',
854: 'bomb',
855: 'decoration',
856: 'burrito',
857: 'pig',
858: 'nut',
859: 'steak',
860: 'hammer',
861: 'paint',
862: 'cream cheese',
863: 'tea pot',
864: 'name tag',
865: 'dugout',
866: 'mug',
867: 'pond',
868: 'computers',
869: 'ambulance',
870: 'railroad',
871: 'card',
872: 'aircraft',
873: 'pizza slice',
874: 'seagulls',
875: 'store',
876: 'tent',
877: 'sugar',
878: 'blackberries',
879: 'candle',
880: 'numbers',
881: 'sconce',
882: 'antelope',
883: 'bone',
884: 'coats',
885: 'cake stand',
886: 'pizzas',
887: 'bakery',
888: 'pine tree',
889: 'meal',
890: 'cd',
891: 'snowsuit',
892: 'wolf',
893: 'ceiling light',
894: 'skateboards',
895: 'gifts',
896: 'crane',
897: 'audience',
898: 'balcony',
899: 'fog',
900: 'gravy',
901: 'life jacket',
902: 'heels',
903: 'sweatshirt',
904: 'entertainment center',
905: 'scooter',
906: 'pillows',
907: 'walnuts',
908: 'candles',
909: 'tool',
910: 'machine',
911: 'eggplant',
912: 'breakfast',
913: 'basil',
914: 'vines',
915: 'passengers',
916: 'boxes',
917: 'muffins',
918: 'blueberry',
919: 'wolves',
920: 'paper container',
921: 'cockpit',
922: 'skillet',
923: 'alcohol',
924: 'label',
925: 'oatmeal',
926: 'melon',
927: 'bus driver',
928: 'guys',
929: 'helicopter',
930: 'projector',
931: 'chef',
932: 'cloths',
933: 'dogs',
934: 'mound',
935: 'jumpsuit',
936: 'monkey',
937: 'receipt',
938: 'swan',
939: 'tape',
940: 'twigs',
941: 'pancake',
942: 'wallet',
943: 'characters',
944: 'walls',
945: 'dvds',
946: 'lambs',
947: 'shopping bag',
948: 'fur',
949: 'soap dispenser',
950: 'sea',
951: 'burger',
952: 'hair',
953: 'flip flops',
954: 'beer can',
955: 'pouch',
956: 'trash',
957: 'package',
958: 'flatbread',
959: 'hurdle',
960: 'terminal',
961: 'hook',
962: 'grater',
963: 'classroom',
964: 'rooftop',
965: 'tunnel',
966: 'dome',
967: 'food',
968: 'ladles',
969: 'control panel',
970: 'football',
971: 'balls',
972: 'melons',
973: 'clock tower',
974: 'stone',
975: 'grill',
976: 'family',
977: 'picnic table',
978: 'toddler',
979: 'geese',
980: 'beef',
981: 'pancakes',
982: 'cafe',
983: 'marshmallow',
984: 'bouquet',
985: 'stir fry',
986: 'pizza pie',
987: 'up',
988: 'bracelets',
989: 'yogurt',
990: 'coconut',
991: 'charger',
992: 'wristband',
993: 'fireplace',
994: 'pigeons',
995: 'twig',
996: 'cowboy hat',
997: 'flags',
998: 'gas station',
999: 'loaf',
1000: 'visitor',
1001: 'drape',
1002: 'lions',
1003: 'comb',
1004: 'stump',
1005: 'tortilla',
1006: 'bedspread',
1007: 'games',
1008: 'heel',
1009: 'feeder',
1010: 'drain',
1011: 'bath towel',
1012: 'video games',
1013: 'plates',
1014: 'produce',
1015: 'display',
1016: 'highway',
1017: 'broom',
1018: 'map',
1019: 'salt shaker',
1020: 'parking sign',
1021: 'sprinkles',
1022: 'light bulb',
1023: 'broth',
1024: 'whipped cream',
1025: 'goose',
1026: 'pizza crust',
1027: 'photographer',
1028: 'garden',
1029: 'trunks',
1030: 'lighthouse',
1031: 'earrings',
1032: 'rain',
1033: 'lock',
1034: 'biscuits',
1035: 'electric toothbrush',
1036: 'traffic sign',
1037: 'tissue',
1038: 'ropes',
1039: 'flower pot',
1040: 'napkins',
1041: 'napkin dispenser',
1042: 'raspberry',
1043: 'drapes',
1044: 'cameras',
1045: 'banana peel',
1046: 'wheelchair',
1047: 'cups',
1048: 'cone',
1049: 'market',
1050: 'figurine',
1051: 'wildflowers',
1052: 'blind',
1053: 'hills',
1054: 'dish drainer',
1055: 'bread loaf',
1056: 'museum',
1057: 'picture frame',
1058: 'notebook',
1059: 'mall',
1060: 'castle',
1061: 'alligator',
1062: 'almond',
1063: 'face mask',
1064: 'dress shirt',
1065: 'meadow',
1066: 'tangerine',
1067: 'vacuum',
1068: 'gym',
1069: 'father',
1070: 'fountain',
1071: 'paper towel',
1072: 'bread box',
1073: 'soap bottle',
1074: 'crust',
1075: 'sponge',
1076: 'pillowcase',
1077: 'worker',
1078: 'clocks',
1079: 'pudding',
1080: 'balloons',
1081: 'courtyard',
1082: 'bats',
1083: 'crumbs',
1084: 'rhino',
1085: 'zucchini',
1086: 'helmets',
1087: 'butter',
1088: 'battery',
1089: 'keyboards',
1090: 'wardrobe',
1091: 'soccer',
1092: 'garlic',
1093: 'gravel',
1094: 'pumpkin',
1095: 'plantains',
1096: 'back',
1097: 'knee pad',
1098: 'deer',
1099: 'video game',
1100: 'decorations',
1101: 'cream',
1102: 'pikachu',
1103: 'uniforms',
1104: 'cans',
1105: 'cooking pot',
1106: 'weeds',
1107: 'money',
1108: 'syrup',
1109: 'bell',
1110: 'roadway',
1111: 'pedestrians',
1112: 'sweet potatoes',
1113: 'cell phones',
1114: 'beach chair',
1115: 'restaurant',
1116: 'waiter',
1117: 'trays',
1118: 'gas pump',
1119: 'motorcycles',
1120: 'telephone pole',
1121: 'intersection',
1122: 'engineer',
1123: 'spider',
1124: 'workers',
1125: 'onion ring',
1126: 'fire',
1127: 'mannequins',
1128: 'toothpaste',
1129: 'tents',
1130: 'herd',
1131: 'stage',
1132: 'drawing',
1133: 'traffic lights',
1134: 'pasture',
1135: 'daisy',
1136: 'vine',
1137: 'outlets',
1138: 'christmas light',
1139: 'employee',
1140: 'monitors',
1141: 'attic',
1142: 'kiosk',
1143: 'fire extinguisher',
1144: 'satellite dish',
1145: 'barn',
1146: 'potato salad',
1147: 'mailbox',
1148: 'sea foam',
1149: 'town',
1150: 'elevator',
1151: 'teddy bears',
1152: 'tiger',
1153: 'lego',
1154: 'smoothie',
1155: 'peanut butter',
1156: 'serving dish',
1157: 'pizza oven',
1158: 'dish soap',
1159: 'egg yolk',
1160: 'paintings',
1161: 'wine glasses',
1162: 'leggings',
1163: 'croissants',
1164: 'coach',
1165: 'table lamp',
1166: 'nest',
1167: 'snow pants',
1168: 'balloon',
1169: 'ribs',
1170: 'pizza box',
1171: 'shower curtain',
1172: 'hand soap',
1173: 'shampoo',
1174: 'toy car',
1175: 'buoy',
1176: 'smoke stack',
1177: 'hilltop',
1178: 'orchid',
1179: 'harbor',
1180: 'toaster oven',
1181: 'beneath',
1182: 'thermometer',
1183: 'cash register',
1184: 'plain',
1185: 'pots',
1186: 'ladle',
1187: 'camel',
1188: 'blossoms',
1189: 'life preserver',
1190: 'vending machine',
1191: 'officers',
1192: 'hotel room',
1193: 'step',
1194: 'ice cubes',
1195: 'shelter',
1196: 'raisins',
1197: 'fruit stand',
1198: 'away',
1199: 'scaffolding',
1200: 'character',
1201: 'mannequin',
1202: 'notepad',
1203: 'horse hoof',
1204: 'lunch',
1205: 'crackers',
1206: 'raincoat',
1207: 'hose',
1208: 'pork',
1209: 'alarm clock',
1210: 'knives',
1211: 'cabin',
1212: 'shoe lace',
1213: 'jars',
1214: 'mangoes',
1215: 'seafood',
1216: 'price tag',
1217: 'dolls',
1218: 'polo shirt',
1219: 'vitamins',
1220: 'cricket',
1221: 'sunglasses',
1222: 'taco',
1223: 'bug',
1224: 'suits',
1225: 'phones',
1226: 'tourists',
1227: 'forward',
1228: 'orchids',
1229: 'folding chair',
1230: 'cigarette',
1231: 'underneath',
1232: 'shark',
1233: 'butter knife',
1234: 'sour cream',
1235: 'shampoo bottle',
1236: 'kiwi',
1237: 'sheets',
1238: 'garnish',
1239: 'sticky notes',
1240: 'cactus',
1241: 'cereal box',
1242: 'donkeys',
1243: 'bikes',
1244: 'desserts',
1245: 'penguin',
1246: 'tree branch',
1247: 'pretzel',
1248: 'casserole',
1249: 'vests',
1250: 'rolling pin',
1251: 'chickens',
1252: 'snow flakes',
1253: 'brownie',
1254: 'dinosaur',
1255: 'roses',
1256: 'blazer',
1257: 'lawn',
1258: 'shopping cart',
1259: 'guacamole',
1260: 'lemonade',
1261: 'tuna',
1262: 'wok',
1263: 'potato chips',
1264: 'performer',
1265: 'pomegranate',
1266: 'printers',
1267: 'dumplings',
1268: 'skyscraper',
1269: 'gourd',
1270: 'pencil',
1271: 'water glass',
1272: 'milkshake',
1273: 'mountain side',
1274: 'cheetah',
1275: 'seeds',
1276: 'cliff',
1277: 'chocolate chips',
1278: 'spoons',
1279: 'cane',
1280: 'keypad',
1281: 'team',
1282: 'meats',
1283: 'library',
1284: 'merchandise',
1285: 'rabbit',
1286: 'gown',
1287: 'vases',
1288: 'rice cooker',
1289: 'baking pan',
1290: 'blankets',
1291: 'city',
1292: 'swans',
1293: 'hand dryer',
1294: 'glaze',
1295: 'canisters',
1296: 'eiffel tower',
1297: 'pizza cutter',
1298: 'cotton dessert',
1299: 'mexican food',
1300: 'kittens',
1301: 'feta cheese',
1302: 'anchovies',
1303: 'papaya',
1304: 'lobster',
1305: 'hummus',
1306: 'sunflower',
1307: 'bandage',
1308: 'beverages',
1309: 'meatballs',
1310: 'seaweed',
1311: 'elmo',
1312: 'peacock',
1313: 'antelopes',
1314: 'boulders',
1315: 'surfboards',
1316: 'wallpaper',
1317: 'cappuccino',
1318: 'mirrors',
1319: 'pillars',
1320: 'food processor',
1321: 'herb',
1322: 'ostriches',
1323: 'fishing pole',
1324: 'octopus',
1325: 'rackets',
1326: 'ovens',
1327: 'shield',
1328: 'toilet brush',
1329: 'temple',
1330: 'kiwis',
1331: 'cheesecake',
1332: 'sign post',
1333: 'parrots',
1334: 'student',
1335: 'french toast',
1336: 'paper towels',
1337: 'hamburgers',
1338: 'carts',
1339: 'stones',
1340: 'serving tray',
1341: 'moose',
1342: 'shoe laces',
1343: 'honey',
1344: 'vendor',
1345: 'stapler',
1346: 'food container',
1347: 'beach umbrella',
1348: 'chinese food',
1349: 'pita',
1350: 'peanuts',
1351: 'wristwatch',
1352: 'peanut',
1353: 'straight',
1354: 'son',
1355: 'knife block',
1356: 'soda bottle',
1357: 'stuffed bears',
1358: 'door frame',
1359: 'moon',
1360: 'alien',
1361: 'cake pan',
1362: 'hospital',
1363: 'panda bear',
1364: 'toys',
1365: 'jackets',
1366: 'lunch box',
1367: 'orchard',
1368: 'cheeseburger',
1369: 'oreo',
1370: 'pizza boxes',
1371: 'pencils',
1372: 'butterflies',
1373: 'rifle',
1374: 'gummy bear',
1375: 'dolphin',
1376: 'walnut',
1377: 'out',
1378: 'crates',
1379: 'olive oil',
1380: 'shops',
1381: 'hedges',
1382: 'vegetable',
1383: 'bubble',
1384: 'bison',
1385: 'window frame',
1386: 'hippo',
1387: 'sugar packets',
1388: 'pepper shaker',
1389: 'shuttle',
1390: 'drawings',
1391: 'tourist',
1392: 'cinnamon roll',
1393: 'christmas lights',
1394: 'parmesan cheese',
1395: 'brownies',
1396: 'ravioli',
1397: 'figurines',
1398: 'baseball mitt',
1399: 'bedding',
1400: 'pretzels',
1401: 'oak tree',
1402: 'chopstick',
1403: 'poodle',
1404: 'farmer',
1405: 'soldiers',
1406: 'pasta salad',
1407: 'spray bottle',
1408: 'penguins',
1409: 'pumpkins',
1410: 'lounge',
1411: 'fruit',
1412: 'stars',
1413: 'flames',
1414: 'bar stools',
1415: 'banana bunches',
1416: 'apple logo',
1417: 'fence post',
1418: 'cheese cube',
1419: 'caramel',
1420: 'skin',
1421: 'school',
1422: 'rhinos',
1423: 'sharks',
1424: 'life jackets',
1425: 'feathers',
1426: 'whale',
1427: 'sticks',
1428: 'milk carton',
1429: 'diaper',
1430: 'masks',
1431: 'entrance',
1432: 'wig',
1433: 'towel dispenser',
1434: 'powder',
1435: 'dinosaurs',
1436: 'tags',
1437: 'chalkboard',
1438: 'nutella',
1439: 'fisherman',
1440: 'grinder',
1441: 'tree branches',
1442: 'flour',
1443: 'snake',
1444: 'banana bunch',
1445: 'kimono',
1446: 'mint',
1447: 'toothbrushes',
1448: 'customer',
1449: 'backpacks',
1450: 'cathedral',
1451: 'sunflowers',
1452: 'auditorium',
1453: 'salt',
1454: 'cookbook',
1455: 'vinegar',
1456: 'mustard bottle',
1457: 'snakes',
1458: 'wine bottles',
1459: 'robot',
1460: 'toothpicks',
1461: 'lobby',
1462: 'dinner',
1463: 'hippos',
1464: 'televisions',
1465: 'students',
1466: 'marina',
1467: 'ahead',
1468: 'garage door',
1469: 'theater',
1470: 'baker',
1471: 'manhole cover',
1472: 'gift',
1473: 'parachute',
1474: 'ear buds',
1475: 'cotton candy',
1476: 'kitchen towel',
1477: 'swamp',
1478: 'coin',
1479: 'apartment building',
1480: 'picnic tables',
1481: 'cookie dough',
1482: 'groceries',
1483: 'pigs',
1484: 'necklaces',
1485: 'lizard',
1486: 'obstacle',
1487: 'cigar',
1488: 'pandas',
1489: 'leopard',
1490: 'bartender',
1491: 'village',
1492: 'dream catcher',
1493: 'pistachio',
1494: 'sideways',
1495: 'shopping center',
1496: 'ketchup bottle',
1497: 'baseball bats',
1498: 'lily',
1499: 'street lights',
1500: 'egg carton',
1501: 'buoys',
1502: 'outside',
1503: 'coconuts',
1504: 'ornaments',
1505: 'cages',
1506: 'undershirt',
1507: 'mountain peak',
1508: 'pesto',
1509: 'goal',
1510: 'cranberry',
1511: 'hearts',
1512: 'bell tower',
1513: 'owls',
1514: 'salon',
1515: 'paint brush',
1516: 'dresses',
1517: 'blossom',
1518: 'beer mug',
1519: 'cranberries',
1520: 'athletic shoe',
1521: 'kangaroo',
1522: 'lipstick',
1523: 'panda bears',
1524: 'shaving cream',
1525: 'cemetery',
1526: 'shopper',
1527: 'egg roll',
1528: 'juice box',
1529: 'coffee shop',
1530: 'coffee beans',
1531: 'ramekin',
1532: 'blood',
1533: 'storage box',
1534: 'underwear',
1535: 'beer cans',
1536: 'taxis',
1537: 'packages',
1538: 'scooters',
1539: 'waitress',
1540: 'upwards',
1541: 'waterfall',
1542: 'coffee cups',
1543: 'hair clip',
1544: 'policemen',
1545: 'batteries',
1546: 'homes',
1547: 'antennas',
1548: 'riding boots',
1549: 'in the mirror',
1550: 'snow boots',
1551: 'jewelry',
1552: 'pizza tray',
1553: 'sugar packet',
1554: 'downward',
1555: 'apartment',
1556: 'bird house',
1557: 'utensil holder',
1558: 'dragons',
1559: 'snoopy',
1560: 'hairbrush',
1561: 'cinnamon',
1562: 'amusement park',
1563: 'parachutes',
1564: 'boar',
1565: 'wedding',
1566: 'supermarket',
1567: 'tractors',
1568: 'cookie jar',
1569: 'blenders',
1570: 'paper dispenser',
1571: 'vegetables',
1572: 'cafeteria',
1573: 'armor',
1574: 'towers',
1575: 'waves',
1576: 'pocket watch',
1577: 'dolphins',
1578: 'spray can',
1579: 'nightstands',
1580: 'elbow pad',
1581: 'seat belt',
1582: 'scrub brush',
1583: 'tree leaves',
1584: 'dragonfly',
1585: 'mattresses',
1586: 'appetizer',
1587: 'dvd players',
1588: 'swimmer',
1589: 'stores',
1590: 'toolbox',
1591: 'lilies',
1592: 'lab coat',
1593: 'unknown'}
# Torch Hubからモデル読込
model_qa = torch.hub.load('ashkamath/mdetr:main', 'mdetr_efficientnetB5_gqa', pretrained=True, return_postprocessor=False)
model_qa = model_qa.cuda()
model_qa.eval();
/usr/local/lib/python3.11/dist-packages/torch/hub.py:330: UserWarning: You are about to download and run code from an untrusted repository. In a future release, this won't be allowed. To add the repository to your trusted list, change the command to {calling_fn}(..., trust_repo=False) and a command prompt will appear asking for an explicit confirmation of trust, or load(..., trust_repo=True), which will assume that the prompt is to be answered with 'yes'. You can also use load(..., trust_repo='check') which will only prompt for confirmation if the repo is not already trusted. This will eventually be the default behaviour
warnings.warn(
Downloading: "https://github.com/ashkamath/mdetr/zipball/main" to /root/.cache/torch/hub/main.zip
/usr/local/lib/python3.11/dist-packages/timm/models/_factory.py:126: UserWarning: Mapping deprecated model name tf_efficientnet_b5_ns to current tf_efficientnet_b5.ns_jft_in1k.
model = create_fn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
model.safetensors: 0%| | 0.00/122M [00:00<?, ?B/s]
WARNING:timm.models._builder:Unexpected keys (bn2.bias, bn2.num_batches_tracked, bn2.running_mean, bn2.running_var, bn2.weight, classifier.bias, classifier.weight, conv_head.weight) found while loading pretrained weights. This may be expected if model is being adapted.
tokenizer_config.json: 0%| | 0.00/25.0 [00:00<?, ?B/s]
vocab.json: 0%| | 0.00/899k [00:00<?, ?B/s]
merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/1.36M [00:00<?, ?B/s]
config.json: 0%| | 0.00/481 [00:00<?, ?B/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
WARNING:huggingface_hub.file_download:Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
model.safetensors: 0%| | 0.00/499M [00:00<?, ?B/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Downloading: "https://zenodo.org/record/4721981/files/gqa_EB5_checkpoint.pth" to /root/.cache/torch/hub/checkpoints/gqa_EB5_checkpoint.pth
100%|██████████| 2.54G/2.54G [03:43<00:00, 12.2MB/s]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-9-6faeb8f28838> in <cell line: 0>()
1 # Torch Hubからモデル読込
----> 2 model_qa = torch.hub.load('ashkamath/mdetr:main', 'mdetr_efficientnetB5_gqa', pretrained=True, return_postprocessor=False)
3 model_qa = model_qa.cuda()
4 model_qa.eval();
/usr/local/lib/python3.11/dist-packages/torch/hub.py in load(repo_or_dir, model, source, trust_repo, force_reload, verbose, skip_validation, *args, **kwargs)
645 )
646
--> 647 model = _load_local(repo_or_dir, model, *args, **kwargs)
648 return model
649
/usr/local/lib/python3.11/dist-packages/torch/hub.py in _load_local(hubconf_dir, model, *args, **kwargs)
674
675 entry = _load_entry_from_hubconf(hub_module, model)
--> 676 model = entry(*args, **kwargs)
677
678 return model
~/.cache/torch/hub/ashkamath_mdetr_main/hubconf.py in mdetr_efficientnetB5_gqa(pretrained, return_postprocessor)
180 url="https://zenodo.org/record/4721981/files/gqa_EB5_checkpoint.pth", map_location="cpu", check_hash=True
181 )
--> 182 model.load_state_dict(checkpoint["model"])
183 if return_postprocessor:
184 return model, PostProcess()
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict, assign)
2579
2580 if len(error_msgs) > 0:
-> 2581 raise RuntimeError(
2582 "Error(s) in loading state_dict for {}:\n\t{}".format(
2583 self.__class__.__name__, "\n\t".join(error_msgs)
RuntimeError: Error(s) in loading state_dict for MDETR:
Unexpected key(s) in state_dict: "transformer.text_encoder.embeddings.position_ids".
この「Torch Hubからモデル読込」部分でエラーが発生。
エラーが発生、デバッグ
「Unexpected key(s) in state_dict: "transformer.text_encoder.embeddings.position_ids". 」とは、モデル定義と、学習済みパラメータのデータとの間で、整合が取れないもの(position_ids)がある模様。
以降で色々調べてみます。
Debug: モデル側の変数を見る
学習済みパラメータの読み込み無し「pretrained=False」で、モデル定義のみ読み込んで、モデル側の該当変数の値を見てみます。
model_qa = torch.hub.load('ashkamath/mdetr:main', 'mdetr_efficientnetB5_gqa', pretrained=False, return_postprocessor=False)
Using cache found in /root/.cache/torch/hub/ashkamath_mdetr_main
WARNING:timm.models._builder:Unexpected keys (bn2.bias, bn2.num_batches_tracked, bn2.running_mean, bn2.running_var, bn2.weight, classifier.bias, classifier.weight, conv_head.weight) found while loading pretrained weights. This may be expected if model is being adapted.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
model_qa.transformer.text_encoder.embeddings.position_ids
tensor([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41,
42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,
70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,
140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153,
154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167,
168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195,
196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223,
224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237,
238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251,
252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265,
266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279,
280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293,
294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307,
308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321,
322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335,
336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349,
350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363,
364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377,
378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391,
392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405,
406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419,
420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433,
434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447,
448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461,
462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475,
476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489,
490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503,
504, 505, 506, 507, 508, 509, 510, 511, 512, 513]])
(学習パラメータではない)「tensor」型で、既に値が入っていました。
学習済みパラメータを別で読み込み、該当変数を見ると、
state_dict = torch.hub.load_state_dict_from_url("https://zenodo.org/record/4721981/files/gqa_EB5_checkpoint.pth", map_location='cpu')
state_dict.keys()
dict_keys(['model', 'model_ema', 'optimizer', 'epoch', 'args'])
state_dict["model"].keys()
odict_keys(['transformer.encoder.layers.0.self_attn.in_proj_weight', 'transformer.encoder.layers.0.self_attn.in_proj_bias', 'transformer.encoder.layers.0.self_attn.out_proj.weight', 'transformer.encoder.layers.0.self_attn.out_proj.bias', 'transformer.encoder.layers.0.linear1.weight', 'transformer.encoder.layers.0.linear1.bias', 'transformer.encoder.layers.0.linear2.weight', 'transformer.encoder.layers.0.linear2.bias', 'transformer.encoder.layers.0.norm1.weight', 'transformer.encoder.layers.0.norm1.bias', 'transformer.encoder.layers.0.norm2.weight', 'transformer.encoder.layers.0.norm2.bias', 'transformer.encoder.layers.1.self_attn.in_proj_weight', 'transformer.encoder.layers.1.self_attn.in_proj_bias', 'transformer.encoder.layers.1.self_attn.out_proj.weight', 'transformer.encoder.layers.1.self_attn.out_proj.bias', 'transformer.encoder.layers.1.linear1.weight', 'transformer.encoder.layers.1.linear1.bias', 'transformer.encoder.layers.1.linear2.weight', 'transformer.encoder.layers.1.linear2.bias', 'transformer.encoder.layers.1.norm1.weight', 'transformer.encoder.layers.1.norm1.bias', 'transformer.encoder.layers.1.norm2.weight', 'transformer.encoder.layers.1.norm2.bias', 'transformer.encoder.layers.2.self_attn.in_proj_weight', 'transformer.encoder.layers.2.self_attn.in_proj_bias', 'transformer.encoder.layers.2.self_attn.out_proj.weight', 'transformer.encoder.layers.2.self_attn.out_proj.bias', 'transformer.encoder.layers.2.linear1.weight', 'transformer.encoder.layers.2.linear1.bias', 'transformer.encoder.layers.2.linear2.weight', 'transformer.encoder.layers.2.linear2.bias', 'transformer.encoder.layers.2.norm1.weight', 'transformer.encoder.layers.2.norm1.bias', 'transformer.encoder.layers.2.norm2.weight', 'transformer.encoder.layers.2.norm2.bias', 'transformer.encoder.layers.3.self_attn.in_proj_weight', 'transformer.encoder.layers.3.self_attn.in_proj_bias', 'transformer.encoder.layers.3.self_attn.out_proj.weight', 'transformer.encoder.layers.3.self_attn.out_proj.bias', 'transformer.encoder.layers.3.linear1.weight', 'transformer.encoder.layers.3.linear1.bias', 'transformer.encoder.layers.3.linear2.weight', 'transformer.encoder.layers.3.linear2.bias', 'transformer.encoder.layers.3.norm1.weight', 'transformer.encoder.layers.3.norm1.bias', 'transformer.encoder.layers.3.norm2.weight', 'transformer.encoder.layers.3.norm2.bias', 'transformer.encoder.layers.4.self_attn.in_proj_weight', 'transformer.encoder.layers.4.self_attn.in_proj_bias', 'transformer.encoder.layers.4.self_attn.out_proj.weight', 'transformer.encoder.layers.4.self_attn.out_proj.bias', 'transformer.encoder.layers.4.linear1.weight', 'transformer.encoder.layers.4.linear1.bias', 'transformer.encoder.layers.4.linear2.weight', 'transformer.encoder.layers.4.linear2.bias', 'transformer.encoder.layers.4.norm1.weight', 'transformer.encoder.layers.4.norm1.bias', 'transformer.encoder.layers.4.norm2.weight', 'transformer.encoder.layers.4.norm2.bias', 'transformer.encoder.layers.5.self_attn.in_proj_weight', 'transformer.encoder.layers.5.self_attn.in_proj_bias', 'transformer.encoder.layers.5.self_attn.out_proj.weight', 'transformer.encoder.layers.5.self_attn.out_proj.bias', 'transformer.encoder.layers.5.linear1.weight', 'transformer.encoder.layers.5.linear1.bias', 'transformer.encoder.layers.5.linear2.weight', 'transformer.encoder.layers.5.linear2.bias', 'transformer.encoder.layers.5.norm1.weight', 'transformer.encoder.layers.5.norm1.bias', 'transformer.encoder.layers.5.norm2.weight', 'transformer.encoder.layers.5.norm2.bias', 'transformer.decoder.layers.0.self_attn.in_proj_weight', 'transformer.decoder.layers.0.self_attn.in_proj_bias', 'transformer.decoder.layers.0.self_attn.out_proj.weight', 'transformer.decoder.layers.0.self_attn.out_proj.bias', 'transformer.decoder.layers.0.cross_attn_image.in_proj_weight', 'transformer.decoder.layers.0.cross_attn_image.in_proj_bias', 'transformer.decoder.layers.0.cross_attn_image.out_proj.weight', 'transformer.decoder.layers.0.cross_attn_image.out_proj.bias', 'transformer.decoder.layers.0.linear1.weight', 'transformer.decoder.layers.0.linear1.bias', 'transformer.decoder.layers.0.linear2.weight', 'transformer.decoder.layers.0.linear2.bias', 'transformer.decoder.layers.0.norm1.weight', 'transformer.decoder.layers.0.norm1.bias', 'transformer.decoder.layers.0.norm3.weight', 'transformer.decoder.layers.0.norm3.bias', 'transformer.decoder.layers.0.norm4.weight', 'transformer.decoder.layers.0.norm4.bias', 'transformer.decoder.layers.1.self_attn.in_proj_weight', 'transformer.decoder.layers.1.self_attn.in_proj_bias', 'transformer.decoder.layers.1.self_attn.out_proj.weight', 'transformer.decoder.layers.1.self_attn.out_proj.bias', 'transformer.decoder.layers.1.cross_attn_image.in_proj_weight', 'transformer.decoder.layers.1.cross_attn_image.in_proj_bias', 'transformer.decoder.layers.1.cross_attn_image.out_proj.weight', 'transformer.decoder.layers.1.cross_attn_image.out_proj.bias', 'transformer.decoder.layers.1.linear1.weight', 'transformer.decoder.layers.1.linear1.bias', 'transformer.decoder.layers.1.linear2.weight', 'transformer.decoder.layers.1.linear2.bias', 'transformer.decoder.layers.1.norm1.weight', 'transformer.decoder.layers.1.norm1.bias', 'transformer.decoder.layers.1.norm3.weight', 'transformer.decoder.layers.1.norm3.bias', 'transformer.decoder.layers.1.norm4.weight', 'transformer.decoder.layers.1.norm4.bias', 'transformer.decoder.layers.2.self_attn.in_proj_weight', 'transformer.decoder.layers.2.self_attn.in_proj_bias', 'transformer.decoder.layers.2.self_attn.out_proj.weight', 'transformer.decoder.layers.2.self_attn.out_proj.bias', 'transformer.decoder.layers.2.cross_attn_image.in_proj_weight', 'transformer.decoder.layers.2.cross_attn_image.in_proj_bias', 'transformer.decoder.layers.2.cross_attn_image.out_proj.weight', 'transformer.decoder.layers.2.cross_attn_image.out_proj.bias', 'transformer.decoder.layers.2.linear1.weight', 'transformer.decoder.layers.2.linear1.bias', 'transformer.decoder.layers.2.linear2.weight', 'transformer.decoder.layers.2.linear2.bias', 'transformer.decoder.layers.2.norm1.weight', 'transformer.decoder.layers.2.norm1.bias', 'transformer.decoder.layers.2.norm3.weight', 'transformer.decoder.layers.2.norm3.bias', 'transformer.decoder.layers.2.norm4.weight', 'transformer.decoder.layers.2.norm4.bias', 'transformer.decoder.layers.3.self_attn.in_proj_weight', 'transformer.decoder.layers.3.self_attn.in_proj_bias', 'transformer.decoder.layers.3.self_attn.out_proj.weight', 'transformer.decoder.layers.3.self_attn.out_proj.bias', 'transformer.decoder.layers.3.cross_attn_image.in_proj_weight', 'transformer.decoder.layers.3.cross_attn_image.in_proj_bias', 'transformer.decoder.layers.3.cross_attn_image.out_proj.weight', 'transformer.decoder.layers.3.cross_attn_image.out_proj.bias', 'transformer.decoder.layers.3.linear1.weight', 'transformer.decoder.layers.3.linear1.bias', 'transformer.decoder.layers.3.linear2.weight', 'transformer.decoder.layers.3.linear2.bias', 'transformer.decoder.layers.3.norm1.weight', 'transformer.decoder.layers.3.norm1.bias', 'transformer.decoder.layers.3.norm3.weight', 'transformer.decoder.layers.3.norm3.bias', 'transformer.decoder.layers.3.norm4.weight', 'transformer.decoder.layers.3.norm4.bias', 'transformer.decoder.layers.4.self_attn.in_proj_weight', 'transformer.decoder.layers.4.self_attn.in_proj_bias', 'transformer.decoder.layers.4.self_attn.out_proj.weight', 'transformer.decoder.layers.4.self_attn.out_proj.bias', 'transformer.decoder.layers.4.cross_attn_image.in_proj_weight', 'transformer.decoder.layers.4.cross_attn_image.in_proj_bias', 'transformer.decoder.layers.4.cross_attn_image.out_proj.weight', 'transformer.decoder.layers.4.cross_attn_image.out_proj.bias', 'transformer.decoder.layers.4.linear1.weight', 'transformer.decoder.layers.4.linear1.bias', 'transformer.decoder.layers.4.linear2.weight', 'transformer.decoder.layers.4.linear2.bias', 'transformer.decoder.layers.4.norm1.weight', 'transformer.decoder.layers.4.norm1.bias', 'transformer.decoder.layers.4.norm3.weight', 'transformer.decoder.layers.4.norm3.bias', 'transformer.decoder.layers.4.norm4.weight', 'transformer.decoder.layers.4.norm4.bias', 'transformer.decoder.layers.5.self_attn.in_proj_weight', 'transformer.decoder.layers.5.self_attn.in_proj_bias', 'transformer.decoder.layers.5.self_attn.out_proj.weight', 'transformer.decoder.layers.5.self_attn.out_proj.bias', 'transformer.decoder.layers.5.cross_attn_image.in_proj_weight', 'transformer.decoder.layers.5.cross_attn_image.in_proj_bias', 'transformer.decoder.layers.5.cross_attn_image.out_proj.weight', 'transformer.decoder.layers.5.cross_attn_image.out_proj.bias', 'transformer.decoder.layers.5.linear1.weight', 'transformer.decoder.layers.5.linear1.bias', 'transformer.decoder.layers.5.linear2.weight', 'transformer.decoder.layers.5.linear2.bias', 'transformer.decoder.layers.5.norm1.weight', 'transformer.decoder.layers.5.norm1.bias', 'transformer.decoder.layers.5.norm3.weight', 'transformer.decoder.layers.5.norm3.bias', 'transformer.decoder.layers.5.norm4.weight', 'transformer.decoder.layers.5.norm4.bias', 'transformer.decoder.norm.weight', 'transformer.decoder.norm.bias', 'transformer.text_encoder.embeddings.position_ids', 'transformer.text_encoder.embeddings.word_embeddings.weight', 'transformer.text_encoder.embeddings.position_embeddings.weight', 'transformer.text_encoder.embeddings.token_type_embeddings.weight', 'transformer.text_encoder.embeddings.LayerNorm.weight', 'transformer.text_encoder.embeddings.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.0.attention.self.query.weight', 'transformer.text_encoder.encoder.layer.0.attention.self.query.bias', 'transformer.text_encoder.encoder.layer.0.attention.self.key.weight', 'transformer.text_encoder.encoder.layer.0.attention.self.key.bias', 'transformer.text_encoder.encoder.layer.0.attention.self.value.weight', 'transformer.text_encoder.encoder.layer.0.attention.self.value.bias', 'transformer.text_encoder.encoder.layer.0.attention.output.dense.weight', 'transformer.text_encoder.encoder.layer.0.attention.output.dense.bias', 'transformer.text_encoder.encoder.layer.0.attention.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.0.attention.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.0.intermediate.dense.weight', 'transformer.text_encoder.encoder.layer.0.intermediate.dense.bias', 'transformer.text_encoder.encoder.layer.0.output.dense.weight', 'transformer.text_encoder.encoder.layer.0.output.dense.bias', 'transformer.text_encoder.encoder.layer.0.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.0.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.1.attention.self.query.weight', 'transformer.text_encoder.encoder.layer.1.attention.self.query.bias', 'transformer.text_encoder.encoder.layer.1.attention.self.key.weight', 'transformer.text_encoder.encoder.layer.1.attention.self.key.bias', 'transformer.text_encoder.encoder.layer.1.attention.self.value.weight', 'transformer.text_encoder.encoder.layer.1.attention.self.value.bias', 'transformer.text_encoder.encoder.layer.1.attention.output.dense.weight', 'transformer.text_encoder.encoder.layer.1.attention.output.dense.bias', 'transformer.text_encoder.encoder.layer.1.attention.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.1.attention.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.1.intermediate.dense.weight', 'transformer.text_encoder.encoder.layer.1.intermediate.dense.bias', 'transformer.text_encoder.encoder.layer.1.output.dense.weight', 'transformer.text_encoder.encoder.layer.1.output.dense.bias', 'transformer.text_encoder.encoder.layer.1.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.1.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.2.attention.self.query.weight', 'transformer.text_encoder.encoder.layer.2.attention.self.query.bias', 'transformer.text_encoder.encoder.layer.2.attention.self.key.weight', 'transformer.text_encoder.encoder.layer.2.attention.self.key.bias', 'transformer.text_encoder.encoder.layer.2.attention.self.value.weight', 'transformer.text_encoder.encoder.layer.2.attention.self.value.bias', 'transformer.text_encoder.encoder.layer.2.attention.output.dense.weight', 'transformer.text_encoder.encoder.layer.2.attention.output.dense.bias', 'transformer.text_encoder.encoder.layer.2.attention.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.2.attention.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.2.intermediate.dense.weight', 'transformer.text_encoder.encoder.layer.2.intermediate.dense.bias', 'transformer.text_encoder.encoder.layer.2.output.dense.weight', 'transformer.text_encoder.encoder.layer.2.output.dense.bias', 'transformer.text_encoder.encoder.layer.2.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.2.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.3.attention.self.query.weight', 'transformer.text_encoder.encoder.layer.3.attention.self.query.bias', 'transformer.text_encoder.encoder.layer.3.attention.self.key.weight', 'transformer.text_encoder.encoder.layer.3.attention.self.key.bias', 'transformer.text_encoder.encoder.layer.3.attention.self.value.weight', 'transformer.text_encoder.encoder.layer.3.attention.self.value.bias', 'transformer.text_encoder.encoder.layer.3.attention.output.dense.weight', 'transformer.text_encoder.encoder.layer.3.attention.output.dense.bias', 'transformer.text_encoder.encoder.layer.3.attention.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.3.attention.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.3.intermediate.dense.weight', 'transformer.text_encoder.encoder.layer.3.intermediate.dense.bias', 'transformer.text_encoder.encoder.layer.3.output.dense.weight', 'transformer.text_encoder.encoder.layer.3.output.dense.bias', 'transformer.text_encoder.encoder.layer.3.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.3.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.4.attention.self.query.weight', 'transformer.text_encoder.encoder.layer.4.attention.self.query.bias', 'transformer.text_encoder.encoder.layer.4.attention.self.key.weight', 'transformer.text_encoder.encoder.layer.4.attention.self.key.bias', 'transformer.text_encoder.encoder.layer.4.attention.self.value.weight', 'transformer.text_encoder.encoder.layer.4.attention.self.value.bias', 'transformer.text_encoder.encoder.layer.4.attention.output.dense.weight', 'transformer.text_encoder.encoder.layer.4.attention.output.dense.bias', 'transformer.text_encoder.encoder.layer.4.attention.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.4.attention.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.4.intermediate.dense.weight', 'transformer.text_encoder.encoder.layer.4.intermediate.dense.bias', 'transformer.text_encoder.encoder.layer.4.output.dense.weight', 'transformer.text_encoder.encoder.layer.4.output.dense.bias', 'transformer.text_encoder.encoder.layer.4.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.4.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.5.attention.self.query.weight', 'transformer.text_encoder.encoder.layer.5.attention.self.query.bias', 'transformer.text_encoder.encoder.layer.5.attention.self.key.weight', 'transformer.text_encoder.encoder.layer.5.attention.self.key.bias', 'transformer.text_encoder.encoder.layer.5.attention.self.value.weight', 'transformer.text_encoder.encoder.layer.5.attention.self.value.bias', 'transformer.text_encoder.encoder.layer.5.attention.output.dense.weight', 'transformer.text_encoder.encoder.layer.5.attention.output.dense.bias', 'transformer.text_encoder.encoder.layer.5.attention.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.5.attention.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.5.intermediate.dense.weight', 'transformer.text_encoder.encoder.layer.5.intermediate.dense.bias', 'transformer.text_encoder.encoder.layer.5.output.dense.weight', 'transformer.text_encoder.encoder.layer.5.output.dense.bias', 'transformer.text_encoder.encoder.layer.5.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.5.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.6.attention.self.query.weight', 'transformer.text_encoder.encoder.layer.6.attention.self.query.bias', 'transformer.text_encoder.encoder.layer.6.attention.self.key.weight', 'transformer.text_encoder.encoder.layer.6.attention.self.key.bias', 'transformer.text_encoder.encoder.layer.6.attention.self.value.weight', 'transformer.text_encoder.encoder.layer.6.attention.self.value.bias', 'transformer.text_encoder.encoder.layer.6.attention.output.dense.weight', 'transformer.text_encoder.encoder.layer.6.attention.output.dense.bias', 'transformer.text_encoder.encoder.layer.6.attention.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.6.attention.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.6.intermediate.dense.weight', 'transformer.text_encoder.encoder.layer.6.intermediate.dense.bias', 'transformer.text_encoder.encoder.layer.6.output.dense.weight', 'transformer.text_encoder.encoder.layer.6.output.dense.bias', 'transformer.text_encoder.encoder.layer.6.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.6.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.7.attention.self.query.weight', 'transformer.text_encoder.encoder.layer.7.attention.self.query.bias', 'transformer.text_encoder.encoder.layer.7.attention.self.key.weight', 'transformer.text_encoder.encoder.layer.7.attention.self.key.bias', 'transformer.text_encoder.encoder.layer.7.attention.self.value.weight', 'transformer.text_encoder.encoder.layer.7.attention.self.value.bias', 'transformer.text_encoder.encoder.layer.7.attention.output.dense.weight', 'transformer.text_encoder.encoder.layer.7.attention.output.dense.bias', 'transformer.text_encoder.encoder.layer.7.attention.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.7.attention.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.7.intermediate.dense.weight', 'transformer.text_encoder.encoder.layer.7.intermediate.dense.bias', 'transformer.text_encoder.encoder.layer.7.output.dense.weight', 'transformer.text_encoder.encoder.layer.7.output.dense.bias', 'transformer.text_encoder.encoder.layer.7.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.7.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.8.attention.self.query.weight', 'transformer.text_encoder.encoder.layer.8.attention.self.query.bias', 'transformer.text_encoder.encoder.layer.8.attention.self.key.weight', 'transformer.text_encoder.encoder.layer.8.attention.self.key.bias', 'transformer.text_encoder.encoder.layer.8.attention.self.value.weight', 'transformer.text_encoder.encoder.layer.8.attention.self.value.bias', 'transformer.text_encoder.encoder.layer.8.attention.output.dense.weight', 'transformer.text_encoder.encoder.layer.8.attention.output.dense.bias', 'transformer.text_encoder.encoder.layer.8.attention.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.8.attention.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.8.intermediate.dense.weight', 'transformer.text_encoder.encoder.layer.8.intermediate.dense.bias', 'transformer.text_encoder.encoder.layer.8.output.dense.weight', 'transformer.text_encoder.encoder.layer.8.output.dense.bias', 'transformer.text_encoder.encoder.layer.8.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.8.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.9.attention.self.query.weight', 'transformer.text_encoder.encoder.layer.9.attention.self.query.bias', 'transformer.text_encoder.encoder.layer.9.attention.self.key.weight', 'transformer.text_encoder.encoder.layer.9.attention.self.key.bias', 'transformer.text_encoder.encoder.layer.9.attention.self.value.weight', 'transformer.text_encoder.encoder.layer.9.attention.self.value.bias', 'transformer.text_encoder.encoder.layer.9.attention.output.dense.weight', 'transformer.text_encoder.encoder.layer.9.attention.output.dense.bias', 'transformer.text_encoder.encoder.layer.9.attention.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.9.attention.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.9.intermediate.dense.weight', 'transformer.text_encoder.encoder.layer.9.intermediate.dense.bias', 'transformer.text_encoder.encoder.layer.9.output.dense.weight', 'transformer.text_encoder.encoder.layer.9.output.dense.bias', 'transformer.text_encoder.encoder.layer.9.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.9.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.10.attention.self.query.weight', 'transformer.text_encoder.encoder.layer.10.attention.self.query.bias', 'transformer.text_encoder.encoder.layer.10.attention.self.key.weight', 'transformer.text_encoder.encoder.layer.10.attention.self.key.bias', 'transformer.text_encoder.encoder.layer.10.attention.self.value.weight', 'transformer.text_encoder.encoder.layer.10.attention.self.value.bias', 'transformer.text_encoder.encoder.layer.10.attention.output.dense.weight', 'transformer.text_encoder.encoder.layer.10.attention.output.dense.bias', 'transformer.text_encoder.encoder.layer.10.attention.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.10.attention.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.10.intermediate.dense.weight', 'transformer.text_encoder.encoder.layer.10.intermediate.dense.bias', 'transformer.text_encoder.encoder.layer.10.output.dense.weight', 'transformer.text_encoder.encoder.layer.10.output.dense.bias', 'transformer.text_encoder.encoder.layer.10.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.10.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.11.attention.self.query.weight', 'transformer.text_encoder.encoder.layer.11.attention.self.query.bias', 'transformer.text_encoder.encoder.layer.11.attention.self.key.weight', 'transformer.text_encoder.encoder.layer.11.attention.self.key.bias', 'transformer.text_encoder.encoder.layer.11.attention.self.value.weight', 'transformer.text_encoder.encoder.layer.11.attention.self.value.bias', 'transformer.text_encoder.encoder.layer.11.attention.output.dense.weight', 'transformer.text_encoder.encoder.layer.11.attention.output.dense.bias', 'transformer.text_encoder.encoder.layer.11.attention.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.11.attention.output.LayerNorm.bias', 'transformer.text_encoder.encoder.layer.11.intermediate.dense.weight', 'transformer.text_encoder.encoder.layer.11.intermediate.dense.bias', 'transformer.text_encoder.encoder.layer.11.output.dense.weight', 'transformer.text_encoder.encoder.layer.11.output.dense.bias', 'transformer.text_encoder.encoder.layer.11.output.LayerNorm.weight', 'transformer.text_encoder.encoder.layer.11.output.LayerNorm.bias', 'transformer.text_encoder.pooler.dense.weight', 'transformer.text_encoder.pooler.dense.bias', 'transformer.resizer.fc.weight', 'transformer.resizer.fc.bias', 'transformer.resizer.layer_norm.weight', 'transformer.resizer.layer_norm.bias', 'class_embed.weight', 'class_embed.bias', 'bbox_embed.layers.0.weight', 'bbox_embed.layers.0.bias', 'bbox_embed.layers.1.weight', 'bbox_embed.layers.1.bias', 'bbox_embed.layers.2.weight', 'bbox_embed.layers.2.bias', 'query_embed.weight', 'qa_embed.weight', 'input_proj.weight', 'input_proj.bias', 'backbone.0.body.conv_stem.weight', 'backbone.0.body.bn1.weight', 'backbone.0.body.bn1.bias', 'backbone.0.body.bn1.running_mean', 'backbone.0.body.bn1.running_var', 'backbone.0.body.blocks.0.0.conv_dw.weight', 'backbone.0.body.blocks.0.0.bn1.weight', 'backbone.0.body.blocks.0.0.bn1.bias', 'backbone.0.body.blocks.0.0.bn1.running_mean', 'backbone.0.body.blocks.0.0.bn1.running_var', 'backbone.0.body.blocks.0.0.se.conv_reduce.weight', 'backbone.0.body.blocks.0.0.se.conv_reduce.bias', 'backbone.0.body.blocks.0.0.se.conv_expand.weight', 'backbone.0.body.blocks.0.0.se.conv_expand.bias', 'backbone.0.body.blocks.0.0.conv_pw.weight', 'backbone.0.body.blocks.0.0.bn2.weight', 'backbone.0.body.blocks.0.0.bn2.bias', 'backbone.0.body.blocks.0.0.bn2.running_mean', 'backbone.0.body.blocks.0.0.bn2.running_var', 'backbone.0.body.blocks.0.1.conv_dw.weight', 'backbone.0.body.blocks.0.1.bn1.weight', 'backbone.0.body.blocks.0.1.bn1.bias', 'backbone.0.body.blocks.0.1.bn1.running_mean', 'backbone.0.body.blocks.0.1.bn1.running_var', 'backbone.0.body.blocks.0.1.se.conv_reduce.weight', 'backbone.0.body.blocks.0.1.se.conv_reduce.bias', 'backbone.0.body.blocks.0.1.se.conv_expand.weight', 'backbone.0.body.blocks.0.1.se.conv_expand.bias', 'backbone.0.body.blocks.0.1.conv_pw.weight', 'backbone.0.body.blocks.0.1.bn2.weight', 'backbone.0.body.blocks.0.1.bn2.bias', 'backbone.0.body.blocks.0.1.bn2.running_mean', 'backbone.0.body.blocks.0.1.bn2.running_var', 'backbone.0.body.blocks.0.2.conv_dw.weight', 'backbone.0.body.blocks.0.2.bn1.weight', 'backbone.0.body.blocks.0.2.bn1.bias', 'backbone.0.body.blocks.0.2.bn1.running_mean', 'backbone.0.body.blocks.0.2.bn1.running_var', 'backbone.0.body.blocks.0.2.se.conv_reduce.weight', 'backbone.0.body.blocks.0.2.se.conv_reduce.bias', 'backbone.0.body.blocks.0.2.se.conv_expand.weight', 'backbone.0.body.blocks.0.2.se.conv_expand.bias', 'backbone.0.body.blocks.0.2.conv_pw.weight', 'backbone.0.body.blocks.0.2.bn2.weight', 'backbone.0.body.blocks.0.2.bn2.bias', 'backbone.0.body.blocks.0.2.bn2.running_mean', 'backbone.0.body.blocks.0.2.bn2.running_var', 'backbone.0.body.blocks.1.0.conv_pw.weight', 'backbone.0.body.blocks.1.0.bn1.weight', 'backbone.0.body.blocks.1.0.bn1.bias', 'backbone.0.body.blocks.1.0.bn1.running_mean', 'backbone.0.body.blocks.1.0.bn1.running_var', 'backbone.0.body.blocks.1.0.conv_dw.weight', 'backbone.0.body.blocks.1.0.bn2.weight', 'backbone.0.body.blocks.1.0.bn2.bias', 'backbone.0.body.blocks.1.0.bn2.running_mean', 'backbone.0.body.blocks.1.0.bn2.running_var', 'backbone.0.body.blocks.1.0.se.conv_reduce.weight', 'backbone.0.body.blocks.1.0.se.conv_reduce.bias', 'backbone.0.body.blocks.1.0.se.conv_expand.weight', 'backbone.0.body.blocks.1.0.se.conv_expand.bias', 'backbone.0.body.blocks.1.0.conv_pwl.weight', 'backbone.0.body.blocks.1.0.bn3.weight', 'backbone.0.body.blocks.1.0.bn3.bias', 'backbone.0.body.blocks.1.0.bn3.running_mean', 'backbone.0.body.blocks.1.0.bn3.running_var', 'backbone.0.body.blocks.1.1.conv_pw.weight', 'backbone.0.body.blocks.1.1.bn1.weight', 'backbone.0.body.blocks.1.1.bn1.bias', 'backbone.0.body.blocks.1.1.bn1.running_mean', 'backbone.0.body.blocks.1.1.bn1.running_var', 'backbone.0.body.blocks.1.1.conv_dw.weight', 'backbone.0.body.blocks.1.1.bn2.weight', 'backbone.0.body.blocks.1.1.bn2.bias', 'backbone.0.body.blocks.1.1.bn2.running_mean', 'backbone.0.body.blocks.1.1.bn2.running_var', 'backbone.0.body.blocks.1.1.se.conv_reduce.weight', 'backbone.0.body.blocks.1.1.se.conv_reduce.bias', 'backbone.0.body.blocks.1.1.se.conv_expand.weight', 'backbone.0.body.blocks.1.1.se.conv_expand.bias', 'backbone.0.body.blocks.1.1.conv_pwl.weight', 'backbone.0.body.blocks.1.1.bn3.weight', 'backbone.0.body.blocks.1.1.bn3.bias', 'backbone.0.body.blocks.1.1.bn3.running_mean', 'backbone.0.body.blocks.1.1.bn3.running_var', 'backbone.0.body.blocks.1.2.conv_pw.weight', 'backbone.0.body.blocks.1.2.bn1.weight', 'backbone.0.body.blocks.1.2.bn1.bias', 'backbone.0.body.blocks.1.2.bn1.running_mean', 'backbone.0.body.blocks.1.2.bn1.running_var', 'backbone.0.body.blocks.1.2.conv_dw.weight', 'backbone.0.body.blocks.1.2.bn2.weight', 'backbone.0.body.blocks.1.2.bn2.bias', 'backbone.0.body.blocks.1.2.bn2.running_mean', 'backbone.0.body.blocks.1.2.bn2.running_var', 'backbone.0.body.blocks.1.2.se.conv_reduce.weight', 'backbone.0.body.blocks.1.2.se.conv_reduce.bias', 'backbone.0.body.blocks.1.2.se.conv_expand.weight', 'backbone.0.body.blocks.1.2.se.conv_expand.bias', 'backbone.0.body.blocks.1.2.conv_pwl.weight', 'backbone.0.body.blocks.1.2.bn3.weight', 'backbone.0.body.blocks.1.2.bn3.bias', 'backbone.0.body.blocks.1.2.bn3.running_mean', 'backbone.0.body.blocks.1.2.bn3.running_var', 'backbone.0.body.blocks.1.3.conv_pw.weight', 'backbone.0.body.blocks.1.3.bn1.weight', 'backbone.0.body.blocks.1.3.bn1.bias', 'backbone.0.body.blocks.1.3.bn1.running_mean', 'backbone.0.body.blocks.1.3.bn1.running_var', 'backbone.0.body.blocks.1.3.conv_dw.weight', 'backbone.0.body.blocks.1.3.bn2.weight', 'backbone.0.body.blocks.1.3.bn2.bias', 'backbone.0.body.blocks.1.3.bn2.running_mean', 'backbone.0.body.blocks.1.3.bn2.running_var', 'backbone.0.body.blocks.1.3.se.conv_reduce.weight', 'backbone.0.body.blocks.1.3.se.conv_reduce.bias', 'backbone.0.body.blocks.1.3.se.conv_expand.weight', 'backbone.0.body.blocks.1.3.se.conv_expand.bias', 'backbone.0.body.blocks.1.3.conv_pwl.weight', 'backbone.0.body.blocks.1.3.bn3.weight', 'backbone.0.body.blocks.1.3.bn3.bias', 'backbone.0.body.blocks.1.3.bn3.running_mean', 'backbone.0.body.blocks.1.3.bn3.running_var', 'backbone.0.body.blocks.1.4.conv_pw.weight', 'backbone.0.body.blocks.1.4.bn1.weight', 'backbone.0.body.blocks.1.4.bn1.bias', 'backbone.0.body.blocks.1.4.bn1.running_mean', 'backbone.0.body.blocks.1.4.bn1.running_var', 'backbone.0.body.blocks.1.4.conv_dw.weight', 'backbone.0.body.blocks.1.4.bn2.weight', 'backbone.0.body.blocks.1.4.bn2.bias', 'backbone.0.body.blocks.1.4.bn2.running_mean', 'backbone.0.body.blocks.1.4.bn2.running_var', 'backbone.0.body.blocks.1.4.se.conv_reduce.weight', 'backbone.0.body.blocks.1.4.se.conv_reduce.bias', 'backbone.0.body.blocks.1.4.se.conv_expand.weight', 'backbone.0.body.blocks.1.4.se.conv_expand.bias', 'backbone.0.body.blocks.1.4.conv_pwl.weight', 'backbone.0.body.blocks.1.4.bn3.weight', 'backbone.0.body.blocks.1.4.bn3.bias', 'backbone.0.body.blocks.1.4.bn3.running_mean', 'backbone.0.body.blocks.1.4.bn3.running_var', 'backbone.0.body.blocks.2.0.conv_pw.weight', 'backbone.0.body.blocks.2.0.bn1.weight', 'backbone.0.body.blocks.2.0.bn1.bias', 'backbone.0.body.blocks.2.0.bn1.running_mean', 'backbone.0.body.blocks.2.0.bn1.running_var', 'backbone.0.body.blocks.2.0.conv_dw.weight', 'backbone.0.body.blocks.2.0.bn2.weight', 'backbone.0.body.blocks.2.0.bn2.bias', 'backbone.0.body.blocks.2.0.bn2.running_mean', 'backbone.0.body.blocks.2.0.bn2.running_var', 'backbone.0.body.blocks.2.0.se.conv_reduce.weight', 'backbone.0.body.blocks.2.0.se.conv_reduce.bias', 'backbone.0.body.blocks.2.0.se.conv_expand.weight', 'backbone.0.body.blocks.2.0.se.conv_expand.bias', 'backbone.0.body.blocks.2.0.conv_pwl.weight', 'backbone.0.body.blocks.2.0.bn3.weight', 'backbone.0.body.blocks.2.0.bn3.bias', 'backbone.0.body.blocks.2.0.bn3.running_mean', 'backbone.0.body.blocks.2.0.bn3.running_var', 'backbone.0.body.blocks.2.1.conv_pw.weight', 'backbone.0.body.blocks.2.1.bn1.weight', 'backbone.0.body.blocks.2.1.bn1.bias', 'backbone.0.body.blocks.2.1.bn1.running_mean', 'backbone.0.body.blocks.2.1.bn1.running_var', 'backbone.0.body.blocks.2.1.conv_dw.weight', 'backbone.0.body.blocks.2.1.bn2.weight', 'backbone.0.body.blocks.2.1.bn2.bias', 'backbone.0.body.blocks.2.1.bn2.running_mean', 'backbone.0.body.blocks.2.1.bn2.running_var', 'backbone.0.body.blocks.2.1.se.conv_reduce.weight', 'backbone.0.body.blocks.2.1.se.conv_reduce.bias', 'backbone.0.body.blocks.2.1.se.conv_expand.weight', 'backbone.0.body.blocks.2.1.se.conv_expand.bias', 'backbone.0.body.blocks.2.1.conv_pwl.weight', 'backbone.0.body.blocks.2.1.bn3.weight', 'backbone.0.body.blocks.2.1.bn3.bias', 'backbone.0.body.blocks.2.1.bn3.running_mean', 'backbone.0.body.blocks.2.1.bn3.running_var', 'backbone.0.body.blocks.2.2.conv_pw.weight', 'backbone.0.body.blocks.2.2.bn1.weight', 'backbone.0.body.blocks.2.2.bn1.bias', 'backbone.0.body.blocks.2.2.bn1.running_mean', 'backbone.0.body.blocks.2.2.bn1.running_var', 'backbone.0.body.blocks.2.2.conv_dw.weight', 'backbone.0.body.blocks.2.2.bn2.weight', 'backbone.0.body.blocks.2.2.bn2.bias', 'backbone.0.body.blocks.2.2.bn2.running_mean', 'backbone.0.body.blocks.2.2.bn2.running_var', 'backbone.0.body.blocks.2.2.se.conv_reduce.weight', 'backbone.0.body.blocks.2.2.se.conv_reduce.bias', 'backbone.0.body.blocks.2.2.se.conv_expand.weight', 'backbone.0.body.blocks.2.2.se.conv_expand.bias', 'backbone.0.body.blocks.2.2.conv_pwl.weight', 'backbone.0.body.blocks.2.2.bn3.weight', 'backbone.0.body.blocks.2.2.bn3.bias', 'backbone.0.body.blocks.2.2.bn3.running_mean', 'backbone.0.body.blocks.2.2.bn3.running_var', 'backbone.0.body.blocks.2.3.conv_pw.weight', 'backbone.0.body.blocks.2.3.bn1.weight', 'backbone.0.body.blocks.2.3.bn1.bias', 'backbone.0.body.blocks.2.3.bn1.running_mean', 'backbone.0.body.blocks.2.3.bn1.running_var', 'backbone.0.body.blocks.2.3.conv_dw.weight', 'backbone.0.body.blocks.2.3.bn2.weight', 'backbone.0.body.blocks.2.3.bn2.bias', 'backbone.0.body.blocks.2.3.bn2.running_mean', 'backbone.0.body.blocks.2.3.bn2.running_var', 'backbone.0.body.blocks.2.3.se.conv_reduce.weight', 'backbone.0.body.blocks.2.3.se.conv_reduce.bias', 'backbone.0.body.blocks.2.3.se.conv_expand.weight', 'backbone.0.body.blocks.2.3.se.conv_expand.bias', 'backbone.0.body.blocks.2.3.conv_pwl.weight', 'backbone.0.body.blocks.2.3.bn3.weight', 'backbone.0.body.blocks.2.3.bn3.bias', 'backbone.0.body.blocks.2.3.bn3.running_mean', 'backbone.0.body.blocks.2.3.bn3.running_var', 'backbone.0.body.blocks.2.4.conv_pw.weight', 'backbone.0.body.blocks.2.4.bn1.weight', 'backbone.0.body.blocks.2.4.bn1.bias', 'backbone.0.body.blocks.2.4.bn1.running_mean', 'backbone.0.body.blocks.2.4.bn1.running_var', 'backbone.0.body.blocks.2.4.conv_dw.weight', 'backbone.0.body.blocks.2.4.bn2.weight', 'backbone.0.body.blocks.2.4.bn2.bias', 'backbone.0.body.blocks.2.4.bn2.running_mean', 'backbone.0.body.blocks.2.4.bn2.running_var', 'backbone.0.body.blocks.2.4.se.conv_reduce.weight', 'backbone.0.body.blocks.2.4.se.conv_reduce.bias', 'backbone.0.body.blocks.2.4.se.conv_expand.weight', 'backbone.0.body.blocks.2.4.se.conv_expand.bias', 'backbone.0.body.blocks.2.4.conv_pwl.weight', 'backbone.0.body.blocks.2.4.bn3.weight', 'backbone.0.body.blocks.2.4.bn3.bias', 'backbone.0.body.blocks.2.4.bn3.running_mean', 'backbone.0.body.blocks.2.4.bn3.running_var', 'backbone.0.body.blocks.3.0.conv_pw.weight', 'backbone.0.body.blocks.3.0.bn1.weight', 'backbone.0.body.blocks.3.0.bn1.bias', 'backbone.0.body.blocks.3.0.bn1.running_mean', 'backbone.0.body.blocks.3.0.bn1.running_var', 'backbone.0.body.blocks.3.0.conv_dw.weight', 'backbone.0.body.blocks.3.0.bn2.weight', 'backbone.0.body.blocks.3.0.bn2.bias', 'backbone.0.body.blocks.3.0.bn2.running_mean', 'backbone.0.body.blocks.3.0.bn2.running_var', 'backbone.0.body.blocks.3.0.se.conv_reduce.weight', 'backbone.0.body.blocks.3.0.se.conv_reduce.bias', 'backbone.0.body.blocks.3.0.se.conv_expand.weight', 'backbone.0.body.blocks.3.0.se.conv_expand.bias', 'backbone.0.body.blocks.3.0.conv_pwl.weight', 'backbone.0.body.blocks.3.0.bn3.weight', 'backbone.0.body.blocks.3.0.bn3.bias', 'backbone.0.body.blocks.3.0.bn3.running_mean', 'backbone.0.body.blocks.3.0.bn3.running_var', 'backbone.0.body.blocks.3.1.conv_pw.weight', 'backbone.0.body.blocks.3.1.bn1.weight', 'backbone.0.body.blocks.3.1.bn1.bias', 'backbone.0.body.blocks.3.1.bn1.running_mean', 'backbone.0.body.blocks.3.1.bn1.running_var', 'backbone.0.body.blocks.3.1.conv_dw.weight', 'backbone.0.body.blocks.3.1.bn2.weight', 'backbone.0.body.blocks.3.1.bn2.bias', 'backbone.0.body.blocks.3.1.bn2.running_mean', 'backbone.0.body.blocks.3.1.bn2.running_var', 'backbone.0.body.blocks.3.1.se.conv_reduce.weight', 'backbone.0.body.blocks.3.1.se.conv_reduce.bias', 'backbone.0.body.blocks.3.1.se.conv_expand.weight', 'backbone.0.body.blocks.3.1.se.conv_expand.bias', 'backbone.0.body.blocks.3.1.conv_pwl.weight', 'backbone.0.body.blocks.3.1.bn3.weight', 'backbone.0.body.blocks.3.1.bn3.bias', 'backbone.0.body.blocks.3.1.bn3.running_mean', 'backbone.0.body.blocks.3.1.bn3.running_var', 'backbone.0.body.blocks.3.2.conv_pw.weight', 'backbone.0.body.blocks.3.2.bn1.weight', 'backbone.0.body.blocks.3.2.bn1.bias', 'backbone.0.body.blocks.3.2.bn1.running_mean', 'backbone.0.body.blocks.3.2.bn1.running_var', 'backbone.0.body.blocks.3.2.conv_dw.weight', 'backbone.0.body.blocks.3.2.bn2.weight', 'backbone.0.body.blocks.3.2.bn2.bias', 'backbone.0.body.blocks.3.2.bn2.running_mean', 'backbone.0.body.blocks.3.2.bn2.running_var', 'backbone.0.body.blocks.3.2.se.conv_reduce.weight', 'backbone.0.body.blocks.3.2.se.conv_reduce.bias', 'backbone.0.body.blocks.3.2.se.conv_expand.weight', 'backbone.0.body.blocks.3.2.se.conv_expand.bias', 'backbone.0.body.blocks.3.2.conv_pwl.weight', 'backbone.0.body.blocks.3.2.bn3.weight', 'backbone.0.body.blocks.3.2.bn3.bias', 'backbone.0.body.blocks.3.2.bn3.running_mean', 'backbone.0.body.blocks.3.2.bn3.running_var', 'backbone.0.body.blocks.3.3.conv_pw.weight', 'backbone.0.body.blocks.3.3.bn1.weight', 'backbone.0.body.blocks.3.3.bn1.bias', 'backbone.0.body.blocks.3.3.bn1.running_mean', 'backbone.0.body.blocks.3.3.bn1.running_var', 'backbone.0.body.blocks.3.3.conv_dw.weight', 'backbone.0.body.blocks.3.3.bn2.weight', 'backbone.0.body.blocks.3.3.bn2.bias', 'backbone.0.body.blocks.3.3.bn2.running_mean', 'backbone.0.body.blocks.3.3.bn2.running_var', 'backbone.0.body.blocks.3.3.se.conv_reduce.weight', 'backbone.0.body.blocks.3.3.se.conv_reduce.bias', 'backbone.0.body.blocks.3.3.se.conv_expand.weight', 'backbone.0.body.blocks.3.3.se.conv_expand.bias', 'backbone.0.body.blocks.3.3.conv_pwl.weight', 'backbone.0.body.blocks.3.3.bn3.weight', 'backbone.0.body.blocks.3.3.bn3.bias', 'backbone.0.body.blocks.3.3.bn3.running_mean', 'backbone.0.body.blocks.3.3.bn3.running_var', 'backbone.0.body.blocks.3.4.conv_pw.weight', 'backbone.0.body.blocks.3.4.bn1.weight', 'backbone.0.body.blocks.3.4.bn1.bias', 'backbone.0.body.blocks.3.4.bn1.running_mean', 'backbone.0.body.blocks.3.4.bn1.running_var', 'backbone.0.body.blocks.3.4.conv_dw.weight', 'backbone.0.body.blocks.3.4.bn2.weight', 'backbone.0.body.blocks.3.4.bn2.bias', 'backbone.0.body.blocks.3.4.bn2.running_mean', 'backbone.0.body.blocks.3.4.bn2.running_var', 'backbone.0.body.blocks.3.4.se.conv_reduce.weight', 'backbone.0.body.blocks.3.4.se.conv_reduce.bias', 'backbone.0.body.blocks.3.4.se.conv_expand.weight', 'backbone.0.body.blocks.3.4.se.conv_expand.bias', 'backbone.0.body.blocks.3.4.conv_pwl.weight', 'backbone.0.body.blocks.3.4.bn3.weight', 'backbone.0.body.blocks.3.4.bn3.bias', 'backbone.0.body.blocks.3.4.bn3.running_mean', 'backbone.0.body.blocks.3.4.bn3.running_var', 'backbone.0.body.blocks.3.5.conv_pw.weight', 'backbone.0.body.blocks.3.5.bn1.weight', 'backbone.0.body.blocks.3.5.bn1.bias', 'backbone.0.body.blocks.3.5.bn1.running_mean', 'backbone.0.body.blocks.3.5.bn1.running_var', 'backbone.0.body.blocks.3.5.conv_dw.weight', 'backbone.0.body.blocks.3.5.bn2.weight', 'backbone.0.body.blocks.3.5.bn2.bias', 'backbone.0.body.blocks.3.5.bn2.running_mean', 'backbone.0.body.blocks.3.5.bn2.running_var', 'backbone.0.body.blocks.3.5.se.conv_reduce.weight', 'backbone.0.body.blocks.3.5.se.conv_reduce.bias', 'backbone.0.body.blocks.3.5.se.conv_expand.weight', 'backbone.0.body.blocks.3.5.se.conv_expand.bias', 'backbone.0.body.blocks.3.5.conv_pwl.weight', 'backbone.0.body.blocks.3.5.bn3.weight', 'backbone.0.body.blocks.3.5.bn3.bias', 'backbone.0.body.blocks.3.5.bn3.running_mean', 'backbone.0.body.blocks.3.5.bn3.running_var', 'backbone.0.body.blocks.3.6.conv_pw.weight', 'backbone.0.body.blocks.3.6.bn1.weight', 'backbone.0.body.blocks.3.6.bn1.bias', 'backbone.0.body.blocks.3.6.bn1.running_mean', 'backbone.0.body.blocks.3.6.bn1.running_var', 'backbone.0.body.blocks.3.6.conv_dw.weight', 'backbone.0.body.blocks.3.6.bn2.weight', 'backbone.0.body.blocks.3.6.bn2.bias', 'backbone.0.body.blocks.3.6.bn2.running_mean', 'backbone.0.body.blocks.3.6.bn2.running_var', 'backbone.0.body.blocks.3.6.se.conv_reduce.weight', 'backbone.0.body.blocks.3.6.se.conv_reduce.bias', 'backbone.0.body.blocks.3.6.se.conv_expand.weight', 'backbone.0.body.blocks.3.6.se.conv_expand.bias', 'backbone.0.body.blocks.3.6.conv_pwl.weight', 'backbone.0.body.blocks.3.6.bn3.weight', 'backbone.0.body.blocks.3.6.bn3.bias', 'backbone.0.body.blocks.3.6.bn3.running_mean', 'backbone.0.body.blocks.3.6.bn3.running_var', 'backbone.0.body.blocks.4.0.conv_pw.weight', 'backbone.0.body.blocks.4.0.bn1.weight', 'backbone.0.body.blocks.4.0.bn1.bias', 'backbone.0.body.blocks.4.0.bn1.running_mean', 'backbone.0.body.blocks.4.0.bn1.running_var', 'backbone.0.body.blocks.4.0.conv_dw.weight', 'backbone.0.body.blocks.4.0.bn2.weight', 'backbone.0.body.blocks.4.0.bn2.bias', 'backbone.0.body.blocks.4.0.bn2.running_mean', 'backbone.0.body.blocks.4.0.bn2.running_var', 'backbone.0.body.blocks.4.0.se.conv_reduce.weight', 'backbone.0.body.blocks.4.0.se.conv_reduce.bias', 'backbone.0.body.blocks.4.0.se.conv_expand.weight', 'backbone.0.body.blocks.4.0.se.conv_expand.bias', 'backbone.0.body.blocks.4.0.conv_pwl.weight', 'backbone.0.body.blocks.4.0.bn3.weight', 'backbone.0.body.blocks.4.0.bn3.bias', 'backbone.0.body.blocks.4.0.bn3.running_mean', 'backbone.0.body.blocks.4.0.bn3.running_var', 'backbone.0.body.blocks.4.1.conv_pw.weight', 'backbone.0.body.blocks.4.1.bn1.weight', 'backbone.0.body.blocks.4.1.bn1.bias', 'backbone.0.body.blocks.4.1.bn1.running_mean', 'backbone.0.body.blocks.4.1.bn1.running_var', 'backbone.0.body.blocks.4.1.conv_dw.weight', 'backbone.0.body.blocks.4.1.bn2.weight', 'backbone.0.body.blocks.4.1.bn2.bias', 'backbone.0.body.blocks.4.1.bn2.running_mean', 'backbone.0.body.blocks.4.1.bn2.running_var', 'backbone.0.body.blocks.4.1.se.conv_reduce.weight', 'backbone.0.body.blocks.4.1.se.conv_reduce.bias', 'backbone.0.body.blocks.4.1.se.conv_expand.weight', 'backbone.0.body.blocks.4.1.se.conv_expand.bias', 'backbone.0.body.blocks.4.1.conv_pwl.weight', 'backbone.0.body.blocks.4.1.bn3.weight', 'backbone.0.body.blocks.4.1.bn3.bias', 'backbone.0.body.blocks.4.1.bn3.running_mean', 'backbone.0.body.blocks.4.1.bn3.running_var', 'backbone.0.body.blocks.4.2.conv_pw.weight', 'backbone.0.body.blocks.4.2.bn1.weight', 'backbone.0.body.blocks.4.2.bn1.bias', 'backbone.0.body.blocks.4.2.bn1.running_mean', 'backbone.0.body.blocks.4.2.bn1.running_var', 'backbone.0.body.blocks.4.2.conv_dw.weight', 'backbone.0.body.blocks.4.2.bn2.weight', 'backbone.0.body.blocks.4.2.bn2.bias', 'backbone.0.body.blocks.4.2.bn2.running_mean', 'backbone.0.body.blocks.4.2.bn2.running_var', 'backbone.0.body.blocks.4.2.se.conv_reduce.weight', 'backbone.0.body.blocks.4.2.se.conv_reduce.bias', 'backbone.0.body.blocks.4.2.se.conv_expand.weight', 'backbone.0.body.blocks.4.2.se.conv_expand.bias', 'backbone.0.body.blocks.4.2.conv_pwl.weight', 'backbone.0.body.blocks.4.2.bn3.weight', 'backbone.0.body.blocks.4.2.bn3.bias', 'backbone.0.body.blocks.4.2.bn3.running_mean', 'backbone.0.body.blocks.4.2.bn3.running_var', 'backbone.0.body.blocks.4.3.conv_pw.weight', 'backbone.0.body.blocks.4.3.bn1.weight', 'backbone.0.body.blocks.4.3.bn1.bias', 'backbone.0.body.blocks.4.3.bn1.running_mean', 'backbone.0.body.blocks.4.3.bn1.running_var', 'backbone.0.body.blocks.4.3.conv_dw.weight', 'backbone.0.body.blocks.4.3.bn2.weight', 'backbone.0.body.blocks.4.3.bn2.bias', 'backbone.0.body.blocks.4.3.bn2.running_mean', 'backbone.0.body.blocks.4.3.bn2.running_var', 'backbone.0.body.blocks.4.3.se.conv_reduce.weight', 'backbone.0.body.blocks.4.3.se.conv_reduce.bias', 'backbone.0.body.blocks.4.3.se.conv_expand.weight', 'backbone.0.body.blocks.4.3.se.conv_expand.bias', 'backbone.0.body.blocks.4.3.conv_pwl.weight', 'backbone.0.body.blocks.4.3.bn3.weight', 'backbone.0.body.blocks.4.3.bn3.bias', 'backbone.0.body.blocks.4.3.bn3.running_mean', 'backbone.0.body.blocks.4.3.bn3.running_var', 'backbone.0.body.blocks.4.4.conv_pw.weight', 'backbone.0.body.blocks.4.4.bn1.weight', 'backbone.0.body.blocks.4.4.bn1.bias', 'backbone.0.body.blocks.4.4.bn1.running_mean', 'backbone.0.body.blocks.4.4.bn1.running_var', 'backbone.0.body.blocks.4.4.conv_dw.weight', 'backbone.0.body.blocks.4.4.bn2.weight', 'backbone.0.body.blocks.4.4.bn2.bias', 'backbone.0.body.blocks.4.4.bn2.running_mean', 'backbone.0.body.blocks.4.4.bn2.running_var', 'backbone.0.body.blocks.4.4.se.conv_reduce.weight', 'backbone.0.body.blocks.4.4.se.conv_reduce.bias', 'backbone.0.body.blocks.4.4.se.conv_expand.weight', 'backbone.0.body.blocks.4.4.se.conv_expand.bias', 'backbone.0.body.blocks.4.4.conv_pwl.weight', 'backbone.0.body.blocks.4.4.bn3.weight', 'backbone.0.body.blocks.4.4.bn3.bias', 'backbone.0.body.blocks.4.4.bn3.running_mean', 'backbone.0.body.blocks.4.4.bn3.running_var', 'backbone.0.body.blocks.4.5.conv_pw.weight', 'backbone.0.body.blocks.4.5.bn1.weight', 'backbone.0.body.blocks.4.5.bn1.bias', 'backbone.0.body.blocks.4.5.bn1.running_mean', 'backbone.0.body.blocks.4.5.bn1.running_var', 'backbone.0.body.blocks.4.5.conv_dw.weight', 'backbone.0.body.blocks.4.5.bn2.weight', 'backbone.0.body.blocks.4.5.bn2.bias', 'backbone.0.body.blocks.4.5.bn2.running_mean', 'backbone.0.body.blocks.4.5.bn2.running_var', 'backbone.0.body.blocks.4.5.se.conv_reduce.weight', 'backbone.0.body.blocks.4.5.se.conv_reduce.bias', 'backbone.0.body.blocks.4.5.se.conv_expand.weight', 'backbone.0.body.blocks.4.5.se.conv_expand.bias', 'backbone.0.body.blocks.4.5.conv_pwl.weight', 'backbone.0.body.blocks.4.5.bn3.weight', 'backbone.0.body.blocks.4.5.bn3.bias', 'backbone.0.body.blocks.4.5.bn3.running_mean', 'backbone.0.body.blocks.4.5.bn3.running_var', 'backbone.0.body.blocks.4.6.conv_pw.weight', 'backbone.0.body.blocks.4.6.bn1.weight', 'backbone.0.body.blocks.4.6.bn1.bias', 'backbone.0.body.blocks.4.6.bn1.running_mean', 'backbone.0.body.blocks.4.6.bn1.running_var', 'backbone.0.body.blocks.4.6.conv_dw.weight', 'backbone.0.body.blocks.4.6.bn2.weight', 'backbone.0.body.blocks.4.6.bn2.bias', 'backbone.0.body.blocks.4.6.bn2.running_mean', 'backbone.0.body.blocks.4.6.bn2.running_var', 'backbone.0.body.blocks.4.6.se.conv_reduce.weight', 'backbone.0.body.blocks.4.6.se.conv_reduce.bias', 'backbone.0.body.blocks.4.6.se.conv_expand.weight', 'backbone.0.body.blocks.4.6.se.conv_expand.bias', 'backbone.0.body.blocks.4.6.conv_pwl.weight', 'backbone.0.body.blocks.4.6.bn3.weight', 'backbone.0.body.blocks.4.6.bn3.bias', 'backbone.0.body.blocks.4.6.bn3.running_mean', 'backbone.0.body.blocks.4.6.bn3.running_var', 'backbone.0.body.blocks.5.0.conv_pw.weight', 'backbone.0.body.blocks.5.0.bn1.weight', 'backbone.0.body.blocks.5.0.bn1.bias', 'backbone.0.body.blocks.5.0.bn1.running_mean', 'backbone.0.body.blocks.5.0.bn1.running_var', 'backbone.0.body.blocks.5.0.conv_dw.weight', 'backbone.0.body.blocks.5.0.bn2.weight', 'backbone.0.body.blocks.5.0.bn2.bias', 'backbone.0.body.blocks.5.0.bn2.running_mean', 'backbone.0.body.blocks.5.0.bn2.running_var', 'backbone.0.body.blocks.5.0.se.conv_reduce.weight', 'backbone.0.body.blocks.5.0.se.conv_reduce.bias', 'backbone.0.body.blocks.5.0.se.conv_expand.weight', 'backbone.0.body.blocks.5.0.se.conv_expand.bias', 'backbone.0.body.blocks.5.0.conv_pwl.weight', 'backbone.0.body.blocks.5.0.bn3.weight', 'backbone.0.body.blocks.5.0.bn3.bias', 'backbone.0.body.blocks.5.0.bn3.running_mean', 'backbone.0.body.blocks.5.0.bn3.running_var', 'backbone.0.body.blocks.5.1.conv_pw.weight', 'backbone.0.body.blocks.5.1.bn1.weight', 'backbone.0.body.blocks.5.1.bn1.bias', 'backbone.0.body.blocks.5.1.bn1.running_mean', 'backbone.0.body.blocks.5.1.bn1.running_var', 'backbone.0.body.blocks.5.1.conv_dw.weight', 'backbone.0.body.blocks.5.1.bn2.weight', 'backbone.0.body.blocks.5.1.bn2.bias', 'backbone.0.body.blocks.5.1.bn2.running_mean', 'backbone.0.body.blocks.5.1.bn2.running_var', 'backbone.0.body.blocks.5.1.se.conv_reduce.weight', 'backbone.0.body.blocks.5.1.se.conv_reduce.bias', 'backbone.0.body.blocks.5.1.se.conv_expand.weight', 'backbone.0.body.blocks.5.1.se.conv_expand.bias', 'backbone.0.body.blocks.5.1.conv_pwl.weight', 'backbone.0.body.blocks.5.1.bn3.weight', 'backbone.0.body.blocks.5.1.bn3.bias', 'backbone.0.body.blocks.5.1.bn3.running_mean', 'backbone.0.body.blocks.5.1.bn3.running_var', 'backbone.0.body.blocks.5.2.conv_pw.weight', 'backbone.0.body.blocks.5.2.bn1.weight', 'backbone.0.body.blocks.5.2.bn1.bias', 'backbone.0.body.blocks.5.2.bn1.running_mean', 'backbone.0.body.blocks.5.2.bn1.running_var', 'backbone.0.body.blocks.5.2.conv_dw.weight', 'backbone.0.body.blocks.5.2.bn2.weight', 'backbone.0.body.blocks.5.2.bn2.bias', 'backbone.0.body.blocks.5.2.bn2.running_mean', 'backbone.0.body.blocks.5.2.bn2.running_var', 'backbone.0.body.blocks.5.2.se.conv_reduce.weight', 'backbone.0.body.blocks.5.2.se.conv_reduce.bias', 'backbone.0.body.blocks.5.2.se.conv_expand.weight', 'backbone.0.body.blocks.5.2.se.conv_expand.bias', 'backbone.0.body.blocks.5.2.conv_pwl.weight', 'backbone.0.body.blocks.5.2.bn3.weight', 'backbone.0.body.blocks.5.2.bn3.bias', 'backbone.0.body.blocks.5.2.bn3.running_mean', 'backbone.0.body.blocks.5.2.bn3.running_var', 'backbone.0.body.blocks.5.3.conv_pw.weight', 'backbone.0.body.blocks.5.3.bn1.weight', 'backbone.0.body.blocks.5.3.bn1.bias', 'backbone.0.body.blocks.5.3.bn1.running_mean', 'backbone.0.body.blocks.5.3.bn1.running_var', 'backbone.0.body.blocks.5.3.conv_dw.weight', 'backbone.0.body.blocks.5.3.bn2.weight', 'backbone.0.body.blocks.5.3.bn2.bias', 'backbone.0.body.blocks.5.3.bn2.running_mean', 'backbone.0.body.blocks.5.3.bn2.running_var', 'backbone.0.body.blocks.5.3.se.conv_reduce.weight', 'backbone.0.body.blocks.5.3.se.conv_reduce.bias', 'backbone.0.body.blocks.5.3.se.conv_expand.weight', 'backbone.0.body.blocks.5.3.se.conv_expand.bias', 'backbone.0.body.blocks.5.3.conv_pwl.weight', 'backbone.0.body.blocks.5.3.bn3.weight', 'backbone.0.body.blocks.5.3.bn3.bias', 'backbone.0.body.blocks.5.3.bn3.running_mean', 'backbone.0.body.blocks.5.3.bn3.running_var', 'backbone.0.body.blocks.5.4.conv_pw.weight', 'backbone.0.body.blocks.5.4.bn1.weight', 'backbone.0.body.blocks.5.4.bn1.bias', 'backbone.0.body.blocks.5.4.bn1.running_mean', 'backbone.0.body.blocks.5.4.bn1.running_var', 'backbone.0.body.blocks.5.4.conv_dw.weight', 'backbone.0.body.blocks.5.4.bn2.weight', 'backbone.0.body.blocks.5.4.bn2.bias', 'backbone.0.body.blocks.5.4.bn2.running_mean', 'backbone.0.body.blocks.5.4.bn2.running_var', 'backbone.0.body.blocks.5.4.se.conv_reduce.weight', 'backbone.0.body.blocks.5.4.se.conv_reduce.bias', 'backbone.0.body.blocks.5.4.se.conv_expand.weight', 'backbone.0.body.blocks.5.4.se.conv_expand.bias', 'backbone.0.body.blocks.5.4.conv_pwl.weight', 'backbone.0.body.blocks.5.4.bn3.weight', 'backbone.0.body.blocks.5.4.bn3.bias', 'backbone.0.body.blocks.5.4.bn3.running_mean', 'backbone.0.body.blocks.5.4.bn3.running_var', 'backbone.0.body.blocks.5.5.conv_pw.weight', 'backbone.0.body.blocks.5.5.bn1.weight', 'backbone.0.body.blocks.5.5.bn1.bias', 'backbone.0.body.blocks.5.5.bn1.running_mean', 'backbone.0.body.blocks.5.5.bn1.running_var', 'backbone.0.body.blocks.5.5.conv_dw.weight', 'backbone.0.body.blocks.5.5.bn2.weight', 'backbone.0.body.blocks.5.5.bn2.bias', 'backbone.0.body.blocks.5.5.bn2.running_mean', 'backbone.0.body.blocks.5.5.bn2.running_var', 'backbone.0.body.blocks.5.5.se.conv_reduce.weight', 'backbone.0.body.blocks.5.5.se.conv_reduce.bias', 'backbone.0.body.blocks.5.5.se.conv_expand.weight', 'backbone.0.body.blocks.5.5.se.conv_expand.bias', 'backbone.0.body.blocks.5.5.conv_pwl.weight', 'backbone.0.body.blocks.5.5.bn3.weight', 'backbone.0.body.blocks.5.5.bn3.bias', 'backbone.0.body.blocks.5.5.bn3.running_mean', 'backbone.0.body.blocks.5.5.bn3.running_var', 'backbone.0.body.blocks.5.6.conv_pw.weight', 'backbone.0.body.blocks.5.6.bn1.weight', 'backbone.0.body.blocks.5.6.bn1.bias', 'backbone.0.body.blocks.5.6.bn1.running_mean', 'backbone.0.body.blocks.5.6.bn1.running_var', 'backbone.0.body.blocks.5.6.conv_dw.weight', 'backbone.0.body.blocks.5.6.bn2.weight', 'backbone.0.body.blocks.5.6.bn2.bias', 'backbone.0.body.blocks.5.6.bn2.running_mean', 'backbone.0.body.blocks.5.6.bn2.running_var', 'backbone.0.body.blocks.5.6.se.conv_reduce.weight', 'backbone.0.body.blocks.5.6.se.conv_reduce.bias', 'backbone.0.body.blocks.5.6.se.conv_expand.weight', 'backbone.0.body.blocks.5.6.se.conv_expand.bias', 'backbone.0.body.blocks.5.6.conv_pwl.weight', 'backbone.0.body.blocks.5.6.bn3.weight', 'backbone.0.body.blocks.5.6.bn3.bias', 'backbone.0.body.blocks.5.6.bn3.running_mean', 'backbone.0.body.blocks.5.6.bn3.running_var', 'backbone.0.body.blocks.5.7.conv_pw.weight', 'backbone.0.body.blocks.5.7.bn1.weight', 'backbone.0.body.blocks.5.7.bn1.bias', 'backbone.0.body.blocks.5.7.bn1.running_mean', 'backbone.0.body.blocks.5.7.bn1.running_var', 'backbone.0.body.blocks.5.7.conv_dw.weight', 'backbone.0.body.blocks.5.7.bn2.weight', 'backbone.0.body.blocks.5.7.bn2.bias', 'backbone.0.body.blocks.5.7.bn2.running_mean', 'backbone.0.body.blocks.5.7.bn2.running_var', 'backbone.0.body.blocks.5.7.se.conv_reduce.weight', 'backbone.0.body.blocks.5.7.se.conv_reduce.bias', 'backbone.0.body.blocks.5.7.se.conv_expand.weight', 'backbone.0.body.blocks.5.7.se.conv_expand.bias', 'backbone.0.body.blocks.5.7.conv_pwl.weight', 'backbone.0.body.blocks.5.7.bn3.weight', 'backbone.0.body.blocks.5.7.bn3.bias', 'backbone.0.body.blocks.5.7.bn3.running_mean', 'backbone.0.body.blocks.5.7.bn3.running_var', 'backbone.0.body.blocks.5.8.conv_pw.weight', 'backbone.0.body.blocks.5.8.bn1.weight', 'backbone.0.body.blocks.5.8.bn1.bias', 'backbone.0.body.blocks.5.8.bn1.running_mean', 'backbone.0.body.blocks.5.8.bn1.running_var', 'backbone.0.body.blocks.5.8.conv_dw.weight', 'backbone.0.body.blocks.5.8.bn2.weight', 'backbone.0.body.blocks.5.8.bn2.bias', 'backbone.0.body.blocks.5.8.bn2.running_mean', 'backbone.0.body.blocks.5.8.bn2.running_var', 'backbone.0.body.blocks.5.8.se.conv_reduce.weight', 'backbone.0.body.blocks.5.8.se.conv_reduce.bias', 'backbone.0.body.blocks.5.8.se.conv_expand.weight', 'backbone.0.body.blocks.5.8.se.conv_expand.bias', 'backbone.0.body.blocks.5.8.conv_pwl.weight', 'backbone.0.body.blocks.5.8.bn3.weight', 'backbone.0.body.blocks.5.8.bn3.bias', 'backbone.0.body.blocks.5.8.bn3.running_mean', 'backbone.0.body.blocks.5.8.bn3.running_var', 'backbone.0.body.blocks.6.0.conv_pw.weight', 'backbone.0.body.blocks.6.0.bn1.weight', 'backbone.0.body.blocks.6.0.bn1.bias', 'backbone.0.body.blocks.6.0.bn1.running_mean', 'backbone.0.body.blocks.6.0.bn1.running_var', 'backbone.0.body.blocks.6.0.conv_dw.weight', 'backbone.0.body.blocks.6.0.bn2.weight', 'backbone.0.body.blocks.6.0.bn2.bias', 'backbone.0.body.blocks.6.0.bn2.running_mean', 'backbone.0.body.blocks.6.0.bn2.running_var', 'backbone.0.body.blocks.6.0.se.conv_reduce.weight', 'backbone.0.body.blocks.6.0.se.conv_reduce.bias', 'backbone.0.body.blocks.6.0.se.conv_expand.weight', 'backbone.0.body.blocks.6.0.se.conv_expand.bias', 'backbone.0.body.blocks.6.0.conv_pwl.weight', 'backbone.0.body.blocks.6.0.bn3.weight', 'backbone.0.body.blocks.6.0.bn3.bias', 'backbone.0.body.blocks.6.0.bn3.running_mean', 'backbone.0.body.blocks.6.0.bn3.running_var', 'backbone.0.body.blocks.6.1.conv_pw.weight', 'backbone.0.body.blocks.6.1.bn1.weight', 'backbone.0.body.blocks.6.1.bn1.bias', 'backbone.0.body.blocks.6.1.bn1.running_mean', 'backbone.0.body.blocks.6.1.bn1.running_var', 'backbone.0.body.blocks.6.1.conv_dw.weight', 'backbone.0.body.blocks.6.1.bn2.weight', 'backbone.0.body.blocks.6.1.bn2.bias', 'backbone.0.body.blocks.6.1.bn2.running_mean', 'backbone.0.body.blocks.6.1.bn2.running_var', 'backbone.0.body.blocks.6.1.se.conv_reduce.weight', 'backbone.0.body.blocks.6.1.se.conv_reduce.bias', 'backbone.0.body.blocks.6.1.se.conv_expand.weight', 'backbone.0.body.blocks.6.1.se.conv_expand.bias', 'backbone.0.body.blocks.6.1.conv_pwl.weight', 'backbone.0.body.blocks.6.1.bn3.weight', 'backbone.0.body.blocks.6.1.bn3.bias', 'backbone.0.body.blocks.6.1.bn3.running_mean', 'backbone.0.body.blocks.6.1.bn3.running_var', 'backbone.0.body.blocks.6.2.conv_pw.weight', 'backbone.0.body.blocks.6.2.bn1.weight', 'backbone.0.body.blocks.6.2.bn1.bias', 'backbone.0.body.blocks.6.2.bn1.running_mean', 'backbone.0.body.blocks.6.2.bn1.running_var', 'backbone.0.body.blocks.6.2.conv_dw.weight', 'backbone.0.body.blocks.6.2.bn2.weight', 'backbone.0.body.blocks.6.2.bn2.bias', 'backbone.0.body.blocks.6.2.bn2.running_mean', 'backbone.0.body.blocks.6.2.bn2.running_var', 'backbone.0.body.blocks.6.2.se.conv_reduce.weight', 'backbone.0.body.blocks.6.2.se.conv_reduce.bias', 'backbone.0.body.blocks.6.2.se.conv_expand.weight', 'backbone.0.body.blocks.6.2.se.conv_expand.bias', 'backbone.0.body.blocks.6.2.conv_pwl.weight', 'backbone.0.body.blocks.6.2.bn3.weight', 'backbone.0.body.blocks.6.2.bn3.bias', 'backbone.0.body.blocks.6.2.bn3.running_mean', 'backbone.0.body.blocks.6.2.bn3.running_var', 'answer_type_head.weight', 'answer_type_head.bias', 'answer_rel_head.weight', 'answer_rel_head.bias', 'answer_obj_head.weight', 'answer_obj_head.bias', 'answer_global_head.weight', 'answer_global_head.bias', 'answer_attr_head.weight', 'answer_attr_head.bias', 'answer_cat_head.weight', 'answer_cat_head.bias'])
state_dict["model"]["transformer.text_encoder.embeddings.position_ids"]
tensor([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41,
42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,
70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,
140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153,
154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167,
168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195,
196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223,
224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237,
238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251,
252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265,
266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279,
280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293,
294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307,
308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321,
322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335,
336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349,
350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363,
364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377,
378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391,
392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405,
406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419,
420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433,
434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447,
448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461,
462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475,
476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489,
490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503,
504, 505, 506, 507, 508, 509, 510, 511, 512, 513]])
学習済みパラメータ側には、学習パラメータとして「position_ids」の定義がありました。
他のパラメータの値を見ると、
state_dict["model"]["transformer.text_encoder.encoder.layer.0.attention.self.query.weight"]
tensor([[ 0.0024, -0.0432, -0.1008, ..., 0.1183, 0.0301, -0.0913],
[ 0.0030, 0.2174, 0.0536, ..., 0.0234, 0.0858, 0.1278],
[ 0.0735, 0.0854, 0.0009, ..., -0.0129, -0.0278, 0.0802],
...,
[-0.1637, 0.0062, -0.0330, ..., 0.0058, 0.0643, -0.1156],
[-0.2991, 0.0544, 0.0946, ..., -0.0035, -0.1324, 0.0509],
[-0.0649, -0.1165, 0.1456, ..., -0.2730, -0.0320, -0.1308]])
他のパラメータには、データが読み込まれている模様。
変数の型を見ると、
type(model_qa.transformer.text_encoder.embeddings.position_ids)
torch.Tensor
type(model_qa.transformer.text_encoder.encoder.layer[0].attention.self.query.weight)
torch.nn.parameter.Parameter
「position_ids」は、Parameter型ではなくtensor型なので、読み込みエラーが発生していた模様。
このままモデルに学習済みパラメータを読み込むと、
model_qa.load_state_dict(state_dict["model"])
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-19-5c885fe1a66b> in <cell line: 0>()
----> 1 model_qa.load_state_dict(state_dict["model"])
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict, assign)
2579
2580 if len(error_msgs) > 0:
-> 2581 raise RuntimeError(
2582 "Error(s) in loading state_dict for {}:\n\t{}".format(
2583 self.__class__.__name__, "\n\t".join(error_msgs)
RuntimeError: Error(s) in loading state_dict for MDETR:
Unexpected key(s) in state_dict: "transformer.text_encoder.embeddings.position_ids".
前回同様に「position_ids」のエラーが発生。
モデル側では、「position_ids」には(学習パラメータではない)「tensor」型で既に値が入っている所に、学習済みパラメータ側では、学習パラメータとして「position_ids」の定義があり、学習パラメータとして「position_ids」を読み込もうとしたことに起因して、エラーが発生している模様。
Debug: 「strict=False」でモデル読込
モデル定義と、学習済みパラメータのデータとの間で、整合が取れないもの(position_ids)があるので、それを無視する「strict=False」でモデルを読み込むと、
import torch
torch.__version__
'2.6.0+cu124'
state_dict = torch.hub.load_state_dict_from_url("https://zenodo.org/record/4721981/files/gqa_EB5_checkpoint.pth", map_location='cpu')
model_qa = torch.hub.load('ashkamath/mdetr:main', 'mdetr_efficientnetB5_gqa', pretrained=False, return_postprocessor=False)
Using cache found in /root/.cache/torch/hub/ashkamath_mdetr_main
WARNING:timm.models._builder:Unexpected keys (bn2.bias, bn2.num_batches_tracked, bn2.running_mean, bn2.running_var, bn2.weight, classifier.bias, classifier.weight, conv_head.weight) found while loading pretrained weights. This may be expected if model is being adapted.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
model_qa.load_state_dict(state_dict["model"], strict=False)
_IncompatibleKeys(missing_keys=[], unexpected_keys=['transformer.text_encoder.embeddings.position_ids'])
エラーなし、読み込みは成功。
読込が成功したモデルを用いて、続きの推論処理を続行
# Torch Hubからモデル読込
# model_qa = torch.hub.load('ashkamath/mdetr:main', 'mdetr_efficientnetB5_gqa', pretrained=True, return_postprocessor=False)
model_qa = model_qa.cuda()
model_qa.eval();
model_qa
MDETR(
(transformer): Transformer(
(encoder): TransformerEncoder(
(layers): ModuleList(
(0-5): 6 x TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
)
)
(decoder): TransformerDecoder(
(layers): ModuleList(
(0-5): 6 x TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(cross_attn_image): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm4): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
(dropout4): Dropout(p=0.1, inplace=False)
)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(text_encoder): RobertaModel(
(embeddings): RobertaEmbeddings(
(word_embeddings): Embedding(50265, 768, padding_idx=1)
(position_embeddings): Embedding(514, 768, padding_idx=1)
(token_type_embeddings): Embedding(1, 768)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): RobertaEncoder(
(layer): ModuleList(
(0-11): 12 x RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSdpaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): RobertaPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(resizer): FeatureResizer(
(fc): Linear(in_features=768, out_features=256, bias=True)
(layer_norm): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(class_embed): Linear(in_features=256, out_features=256, bias=True)
(bbox_embed): MLP(
(layers): ModuleList(
(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=4, bias=True)
)
)
(query_embed): Embedding(100, 256)
(qa_embed): Embedding(6, 256)
(input_proj): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
(backbone): Joiner(
(0): TimmBackbone(
(body): EfficientNetFeatures(
(conv_stem): Conv2dSame(3, 48, kernel_size=(3, 3), stride=(2, 2), bias=False)
(bn1): FrozenBatchNorm2d()
(blocks): Sequential(
(0): Sequential(
(0): DepthwiseSeparableConv(
(conv_dw): Conv2d(48, 48, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=48, bias=False)
(bn1): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(48, 12, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(12, 48, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pw): Conv2d(48, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d()
(drop_path): Identity()
)
(1): DepthwiseSeparableConv(
(conv_dw): Conv2d(24, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=24, bias=False)
(bn1): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(24, 6, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(6, 24, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pw): Conv2d(24, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d()
(drop_path): Identity()
)
(2): DepthwiseSeparableConv(
(conv_dw): Conv2d(24, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=24, bias=False)
(bn1): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(24, 6, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(6, 24, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pw): Conv2d(24, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d()
(drop_path): Identity()
)
)
(1): Sequential(
(0): InvertedResidual(
(conv_pw): Conv2d(24, 144, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2dSame(144, 144, kernel_size=(3, 3), stride=(2, 2), groups=144, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(144, 6, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(6, 144, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(144, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(1): InvertedResidual(
(conv_pw): Conv2d(40, 240, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(240, 240, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=240, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(240, 10, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(10, 240, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(240, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(2): InvertedResidual(
(conv_pw): Conv2d(40, 240, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(240, 240, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=240, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(240, 10, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(10, 240, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(240, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(3): InvertedResidual(
(conv_pw): Conv2d(40, 240, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(240, 240, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=240, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(240, 10, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(10, 240, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(240, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(4): InvertedResidual(
(conv_pw): Conv2d(40, 240, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(240, 240, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=240, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(240, 10, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(10, 240, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(240, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
)
(2): Sequential(
(0): InvertedResidual(
(conv_pw): Conv2d(40, 240, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2dSame(240, 240, kernel_size=(5, 5), stride=(2, 2), groups=240, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(240, 10, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(10, 240, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(240, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(1): InvertedResidual(
(conv_pw): Conv2d(64, 384, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(384, 384, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=384, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(384, 16, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(16, 384, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(384, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(2): InvertedResidual(
(conv_pw): Conv2d(64, 384, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(384, 384, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=384, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(384, 16, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(16, 384, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(384, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(3): InvertedResidual(
(conv_pw): Conv2d(64, 384, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(384, 384, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=384, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(384, 16, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(16, 384, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(384, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(4): InvertedResidual(
(conv_pw): Conv2d(64, 384, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(384, 384, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=384, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(384, 16, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(16, 384, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(384, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
)
(3): Sequential(
(0): InvertedResidual(
(conv_pw): Conv2d(64, 384, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2dSame(384, 384, kernel_size=(3, 3), stride=(2, 2), groups=384, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(384, 16, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(16, 384, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(384, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(1): InvertedResidual(
(conv_pw): Conv2d(128, 768, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(768, 768, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=768, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(768, 32, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(32, 768, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(768, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(2): InvertedResidual(
(conv_pw): Conv2d(128, 768, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(768, 768, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=768, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(768, 32, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(32, 768, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(768, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(3): InvertedResidual(
(conv_pw): Conv2d(128, 768, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(768, 768, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=768, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(768, 32, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(32, 768, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(768, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(4): InvertedResidual(
(conv_pw): Conv2d(128, 768, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(768, 768, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=768, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(768, 32, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(32, 768, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(768, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(5): InvertedResidual(
(conv_pw): Conv2d(128, 768, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(768, 768, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=768, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(768, 32, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(32, 768, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(768, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(6): InvertedResidual(
(conv_pw): Conv2d(128, 768, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(768, 768, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=768, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(768, 32, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(32, 768, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(768, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
)
(4): Sequential(
(0): InvertedResidual(
(conv_pw): Conv2d(128, 768, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(768, 768, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=768, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(768, 32, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(32, 768, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(768, 176, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(1): InvertedResidual(
(conv_pw): Conv2d(176, 1056, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1056, 1056, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=1056, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1056, 44, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(44, 1056, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1056, 176, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(2): InvertedResidual(
(conv_pw): Conv2d(176, 1056, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1056, 1056, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=1056, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1056, 44, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(44, 1056, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1056, 176, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(3): InvertedResidual(
(conv_pw): Conv2d(176, 1056, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1056, 1056, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=1056, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1056, 44, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(44, 1056, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1056, 176, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(4): InvertedResidual(
(conv_pw): Conv2d(176, 1056, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1056, 1056, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=1056, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1056, 44, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(44, 1056, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1056, 176, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(5): InvertedResidual(
(conv_pw): Conv2d(176, 1056, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1056, 1056, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=1056, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1056, 44, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(44, 1056, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1056, 176, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(6): InvertedResidual(
(conv_pw): Conv2d(176, 1056, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1056, 1056, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=1056, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1056, 44, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(44, 1056, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1056, 176, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
)
(5): Sequential(
(0): InvertedResidual(
(conv_pw): Conv2d(176, 1056, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2dSame(1056, 1056, kernel_size=(5, 5), stride=(2, 2), groups=1056, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1056, 44, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(44, 1056, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1056, 304, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(1): InvertedResidual(
(conv_pw): Conv2d(304, 1824, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1824, 1824, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=1824, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1824, 76, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(76, 1824, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1824, 304, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(2): InvertedResidual(
(conv_pw): Conv2d(304, 1824, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1824, 1824, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=1824, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1824, 76, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(76, 1824, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1824, 304, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(3): InvertedResidual(
(conv_pw): Conv2d(304, 1824, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1824, 1824, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=1824, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1824, 76, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(76, 1824, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1824, 304, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(4): InvertedResidual(
(conv_pw): Conv2d(304, 1824, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1824, 1824, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=1824, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1824, 76, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(76, 1824, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1824, 304, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(5): InvertedResidual(
(conv_pw): Conv2d(304, 1824, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1824, 1824, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=1824, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1824, 76, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(76, 1824, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1824, 304, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(6): InvertedResidual(
(conv_pw): Conv2d(304, 1824, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1824, 1824, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=1824, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1824, 76, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(76, 1824, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1824, 304, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(7): InvertedResidual(
(conv_pw): Conv2d(304, 1824, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1824, 1824, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=1824, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1824, 76, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(76, 1824, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1824, 304, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(8): InvertedResidual(
(conv_pw): Conv2d(304, 1824, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1824, 1824, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=1824, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1824, 76, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(76, 1824, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1824, 304, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
)
(6): Sequential(
(0): InvertedResidual(
(conv_pw): Conv2d(304, 1824, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(1824, 1824, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=1824, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(1824, 76, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(76, 1824, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(1824, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(1): InvertedResidual(
(conv_pw): Conv2d(512, 3072, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(3072, 3072, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=3072, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(3072, 128, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(128, 3072, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(3072, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
(2): InvertedResidual(
(conv_pw): Conv2d(512, 3072, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d()
(conv_dw): Conv2d(3072, 3072, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=3072, bias=False)
(bn2): FrozenBatchNorm2d()
(aa): Identity()
(se): SqueezeExcite(
(conv_reduce): Conv2d(3072, 128, kernel_size=(1, 1), stride=(1, 1))
(act1): SiLU(inplace=True)
(conv_expand): Conv2d(128, 3072, kernel_size=(1, 1), stride=(1, 1))
(gate): Sigmoid()
)
(conv_pwl): Conv2d(3072, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d()
(drop_path): Identity()
)
)
)
)
)
(1): PositionEmbeddingSine()
)
(answer_type_head): Linear(in_features=256, out_features=5, bias=True)
(answer_rel_head): Linear(in_features=256, out_features=1594, bias=True)
(answer_obj_head): Linear(in_features=256, out_features=3, bias=True)
(answer_global_head): Linear(in_features=256, out_features=111, bias=True)
(answer_attr_head): Linear(in_features=256, out_features=403, bias=True)
(answer_cat_head): Linear(in_features=256, out_features=678, bias=True)
)
def plot_inference_qa(im, caption):
img = transform(im).unsqueeze(0).cuda()
# モデル伝搬
memory_cache = model_qa(img, [caption], encode_and_save=True)
outputs = model_qa(img, [caption], encode_and_save=False, memory_cache=memory_cache)
probas = 1 - outputs['pred_logits'].softmax(-1)[0, :, -1].cpu()
keep = (probas > 0.7).cpu()
bboxes_scaled = rescale_bboxes(outputs['pred_boxes'].cpu()[0, keep], im.size)
# 各ボックスで予測されるテキストスパンを抽出
positive_tokens = (outputs["pred_logits"].cpu()[0, keep].softmax(-1) > 0.1).nonzero().tolist()
predicted_spans = defaultdict(str)
for tok in positive_tokens:
item, pos = tok
if pos < 255:
# Encoderで処理したテキスト情報から、テキストスパンを取得
span = memory_cache["tokenized"].token_to_chars(0, pos)
# 入力文字(caption)から対応する文字を取得
predicted_spans [item] += " " + caption[span.start:span.end]
labels = [predicted_spans [k] for k in sorted(list(predicted_spans .keys()))]
plot_results(im, probas[keep], bboxes_scaled, labels)
# 質問タイプの分類
type_conf, type_pred = outputs["pred_answer_type"].softmax(-1).max(-1)
ans_type = type_pred.item()
types = ["obj", "attr", "rel", "global", "cat"]
# 質問タイプから一番マッチする回答をIDからマッピングして取得
ans_conf, ans = outputs[f"pred_answer_{types[ans_type]}"][0].softmax(-1).max(-1)
answer = id2answerbytype[f"answer_{types[ans_type]}"][ans.item()]
print(f"Predicted answer: {answer}\t confidence={round(100 * type_conf.item() * ans_conf.item(), 2)}")
下記で推論を実行
url = "https://s3.us-east-1.amazonaws.com/images.cocodataset.org/val2017/000000076547.jpg"
im3 = Image.open(requests.get(url, stream=True).raw)
plot_inference_qa(im3, "What is on the table?")
Predicted answer: no confidence=nan
推論した結果、「answer」は「no」、「confidence」はnan値となり、処理は正しく行われない。
入力データの「img」、出力の「memory_cache」と「outputs」の中身を見ると、
img = transform(im3).unsqueeze(0).cuda()
caption = "What is on the table?"
# モデル伝搬
memory_cache = model_qa(img, [caption], encode_and_save=True)
outputs = model_qa(img, [caption], encode_and_save=False, memory_cache=memory_cache)
img
tensor([[[[-1.7754, -1.7412, -1.6727, ..., 0.3309, 0.1939, 0.0912],
[-1.7412, -1.7069, -1.6727, ..., 0.4851, 0.3823, 0.3138],
[-1.6727, -1.6727, -1.6727, ..., 0.7248, 0.6734, 0.6392],
...,
[-1.3302, -1.3987, -1.5014, ..., -0.7993, -0.7822, -0.7650],
[-1.4500, -1.4672, -1.5014, ..., -0.7822, -0.7993, -0.7993],
[-1.5185, -1.5185, -1.5014, ..., -0.7822, -0.7993, -0.8164]],
[[-1.6856, -1.6331, -1.5455, ..., 0.4853, 0.4153, 0.3627],
[-1.6681, -1.6331, -1.5630, ..., 0.6604, 0.6254, 0.6078],
[-1.6506, -1.6155, -1.5805, ..., 0.9230, 0.9580, 0.9930],
...,
[-1.1954, -1.2304, -1.3004, ..., -0.3200, -0.2850, -0.2500],
[-1.3004, -1.3004, -1.2829, ..., -0.3025, -0.3200, -0.3200],
[-1.3704, -1.3354, -1.2829, ..., -0.3025, -0.3375, -0.3725]],
[[-1.4559, -1.4036, -1.3339, ..., 0.7751, 0.7751, 0.7751],
[-1.4384, -1.4036, -1.3339, ..., 0.9668, 0.9494, 0.9145],
[-1.4036, -1.3861, -1.3513, ..., 1.2631, 1.1934, 1.1411],
...,
[-0.7587, -0.7936, -0.8633, ..., -0.6715, -0.5844, -0.5321],
[-0.9156, -0.8981, -0.8807, ..., -0.6367, -0.6367, -0.6367],
[-1.0201, -0.9678, -0.8807, ..., -0.6193, -0.6715, -0.7064]]]],
device='cuda:0')
memory_cache
{'text_memory_resized': tensor([[[ 0.0463, 0.0675, 0.1477, ..., 0.0164, 0.2151, 0.0931]],
[[-0.5830, 0.0047, 0.8694, ..., -0.5799, -0.6114, -0.4202]],
[[-0.5820, -0.0295, 0.8522, ..., -0.6351, -0.6301, -0.4519]],
...,
[[-0.7565, 0.5123, -0.3289, ..., -0.4070, 0.3803, -0.2708]],
[[ 0.0463, 0.0675, 0.1477, ..., 0.0164, 0.2151, 0.0931]],
[[ 0.0463, 0.0675, 0.1477, ..., 0.0164, 0.2151, 0.0931]]],
device='cuda:0'),
'text_memory': tensor([[[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan]],
...,
[[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan]]], device='cuda:0'),
'img_memory': tensor([[[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan]],
...,
[[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan]]], device='cuda:0'),
'text_pooled_op': None,
'img_pooled_op': None,
'mask': tensor([[False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False]],
device='cuda:0'),
'text_attention_mask': tensor([[False, False, False, False, False, False, False, False]],
device='cuda:0'),
'pos_embed': tensor([[[2.4869e-01, 9.6858e-01, 2.1593e-01, ..., 1.0000e+00,
2.1340e-05, 1.0000e+00]],
[[2.4869e-01, 9.6858e-01, 2.1593e-01, ..., 1.0000e+00,
4.2681e-05, 1.0000e+00]],
[[2.4869e-01, 9.6858e-01, 2.1593e-01, ..., 1.0000e+00,
6.4021e-05, 1.0000e+00]],
...,
[[0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00]],
[[0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00]],
[[0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00]]], device='cuda:0'),
'query_embed': tensor([[[ 0.0491, 1.1184, -0.2196, ..., 0.4614, -0.0693, 0.6637]],
[[-0.6772, -0.7437, -0.5867, ..., 0.1200, 1.8001, 0.5810]],
[[ 1.9782, 0.2545, -0.6935, ..., 0.1258, -1.6969, -0.6355]],
...,
[[ 0.4497, 0.5344, 0.3115, ..., -0.6557, 0.1694, -0.1874]],
[[-2.3338, -0.9938, -0.3027, ..., 0.2182, 0.9463, 0.0895]],
[[-1.5761, -0.5586, -0.4865, ..., -1.8347, -1.2846, 0.0172]]],
device='cuda:0'),
'tokenized': {'input_ids': tensor([[ 0, 2264, 16, 15, 5, 2103, 116, 2]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}}
outputs
{'pred_answer_type': tensor([[nan, nan, nan, nan, nan]], device='cuda:0'),
'pred_answer_obj': tensor([[nan, nan, nan]], device='cuda:0'),
'pred_answer_rel': tensor([[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0'),
'pred_answer_attr': tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]],
device='cuda:0'),
'pred_answer_cat': tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan]], device='cuda:0'),
'pred_answer_global': tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]],
device='cuda:0'),
'pred_logits': tensor([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]], device='cuda:0'),
'pred_boxes': tensor([[[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]]], device='cuda:0')}
nan値がかなり多い結果となっています。
「memory_cache」の時点で、既にnan値が入っているので、画像・言語から中間ベクトルを算出する始めのEncoderの所で、うまく動作していない模様。
推論がうまく動作しない
諦める。