How large should the batch size be for stochastic gradient descent?
https://stats.stackexchange.com/questions/140811/how-large-should-the-batch-size-be-for-stochastic-gradient-descent
sabalabaによる回答の中で知らなかった単語など
be coalesced
your algorithm will run faster if the memory accesses are coalesced, i.e. when you read the memory in order and don't jump around randomly.
the learning rate schedule
I like to think of epsilon as a function from the epoch count to a learning rate. This function is called the learning rate schedule.
関連: http://machinelearningmastery.com/using-learning-rate-schedules-deep-learning-models-python-keras/
advantage of the speed-up
Antoineによるコメント (Mar 24 at 9:44)
e.g. B = 32 is a good default value, with values above 10 taking advantage of the speed-up of matrix-matrix products over matrix-vector products.