Chainer(v1.4.1)のMNIST exampleをGPU上で実行した際にOverflowErrorが発生したときのメモ.
load MNIST dataset
epoch 1
graph generated
train mean loss=0.189192938904, accuracy=0.9427833369
test mean loss=0.0907700556988, accuracy=0.971400004625
epoch 2
train mean loss=0.0753641784944, accuracy=0.977200009624
test mean loss=0.0728686049528, accuracy=0.977100006938
epoch 3
train mean loss=0.0482070475052, accuracy=0.984483343263
test mean loss=0.0654721940898, accuracy=0.981700006723
epoch 4
train mean loss=0.0369998926547, accuracy=0.98763334175
test mean loss=0.0861884641147, accuracy=0.976300006509
epoch 5
train mean loss=0.028689230929, accuracy=0.990883341332
test mean loss=0.0656151846884, accuracy=0.982000008225
epoch 6
train mean loss=0.0242170301934, accuracy=0.991766673426
test mean loss=0.092080684273, accuracy=0.976500005126
epoch 7
train mean loss=0.0214745011753, accuracy=0.993416672448
test mean loss=0.0633907641986, accuracy=0.984200007915
epoch 8
train mean loss=0.0201626637221, accuracy=0.993866672119
test mean loss=0.0928991907273, accuracy=0.979100006819
epoch 9
train mean loss=0.0160159983088, accuracy=0.994700004657
test mean loss=0.0739389384927, accuracy=0.982000007629
epoch 10
train mean loss=0.0127062624457, accuracy=0.99568333745
test mean loss=0.0723534981629, accuracy=0.983200008869
epoch 11
train mean loss=0.0158423399801, accuracy=0.994966671367
test mean loss=0.0910608516628, accuracy=0.980800007582
epoch 12
Traceback (most recent call last):
File "train_mnist.py", line 88, in <module>
optimizer.update(model, x, t)
File "/***/.pyenv/versions/anaconda-2.4.0/lib/python2.7/site-packages/chainer-1.4.1-py2.7-linux-x86_64.egg/chainer/optimizer.py", line 386, in update
self.update_one(param, states[name])
File "/***/.pyenv/versions/anaconda-2.4.0/lib/python2.7/site-packages/chainer-1.4.1-py2.7-linux-x86_64.egg/chainer/optimizer.py", line 402, in update_one
self.update_one_gpu(param, state)
File "/***/.pyenv/versions/anaconda-2.4.0/lib/python2.7/site-packages/chainer-1.4.1-py2.7-linux-x86_64.egg/chainer/optimizers/adam.py", line 43, in update_one_gpu
'adam')(param.grad, self.lr, 1 - self.beta1, 1 - self.beta2,
File "/***/.pyenv/versions/anaconda-2.4.0/lib/python2.7/site-packages/chainer-1.4.1-py2.7-linux-x86_64.egg/chainer/optimizers/adam.py", line 56, in lr
fix1 = 1. - self.beta1 ** self.t
OverflowError: (34, 'Numerical result out of range')
adam.pyの中で$\beta_1^t$を計算している時にOverflowを出しているみたい.
self.beta1 ** self.tの計算結果を監視すると,self.beta1 ** self.t = 2.3571703009247695e-308となる時の次のイテレーションでOverflowErrorが発生することを確認した.
pythonのfloat型に関する情報を確認してみる
Python 2.7.10 |Anaconda 2.4.0 (64-bit)| (default, Oct 19 2015, 18:04:42)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> import sys
>>> sys.float_info.min
2.2250738585072014e-308
確かにfloat型で取り扱える値よりも小さくなってしまうため,OverflowErrorが発生していることがわかる.
ここで,python(2.7.10 | Anaconda 2.4.0(64-bit))の少し奇妙な動作を確認した.
>>> 10 ** -307
1e-307
>>> 10 ** -308
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: (34, 'Numerical result out of range')
>>> 10 ** -350
0.0
10 ** -308を計算するとOverflowErrorが発生するものの,10 ** -350は0.0と計算してくれる.
adam.pyについては,応急処置的に以下のように書き換えを行った.
@property
def lr(self):
if -self.t <= sys.float_info.min_10_exp:
fix1 = 1.
fix2 = 1.
else:
fix1 = 1. - self.beta1 ** self.t
fix2 = 1. - self.beta2 ** self.t
"""
fix1 = 1. - self.beta1 ** self.t
fix2 = 1. - self.beta2 ** self.t
"""
return self.alpha * math.sqrt(fix2) / fix1
OverflowErrorは発生しなくなるものの,うーん……
$\beta_1$の値を0.99など少し大きめに設定することも考えられるが,本質的な解決にはならないし……