100 input, 100 outputの学習を試行していて、あまり進展がよろしくない。
関連論文をDeep Learning中。
その中で、気になったのが以下。
http://blog.otoro.net/2015/11/24/mixture-density-networks-with-tensorflow/
キーワード
- the classic Mixture Density Networks (Bishop ’94) model
- a Mixture Density Network (MDN)
- to fit a noisy sinusoidal data
- epsilon: standard gaussian random noise
- the fancy RMSProp gradient descent optimisation method
- a one-to-one, or many-to-one function
- What we want is a model that has the capacity to predict a range of different output values for each input
ここまで読んで概要として掴んだことは、
1つの入力値に対して複数の出力値がある場合(例として、sine curveを横転したもの)では従来のneural networkでは学習が失敗する。
Mixture Density Networksだと「1つの入力値に対して複数の出力値がある場合」の学習ができているようだ。
- the network is to predict an entire probability distribution for the output
- MDNs can also used to model handwriting, where the next stroke is drawn from a probability distribution of multiple possibilities, rather than sticking to one prediction.
- Mixture Gaussian distributions
- the output value is modelled as a sum of many gaussian random values
- each with different means and standard deviations
- a probability distribution function (pdf) of
P(Y = y | X = z)
- a restriction that the sum of
PI_k(x)
add up to one - we will use a neural network of one hidden later with 24 nodes, and also generate 24 mixtures, hence there will be 72 actual outputs of our neural network of a single input.
- Z is a vector of 72 values
- into three equal parts, Z{0to23}, Z{24to43}, Z{44to71}
- the softmax and exponential terms have some theoretical interpretations from a Bayesian framework way of looking at probability
- loss function
- A more suitable loss function is to minimise the logarithm of the likelihood of the distribution vs the training data:
- for every (x, y)point in the training data set, we can compute a cost function
- a similar approach, but with non-discretised states.
def get_lossfunc(
- there would be room for performance improvements by building a custom operator into TensorFlow with the pre-optimised gradient formulas for this loss function
ここまでの理解としては(間違っているかもしれない)
- 1つのxに対して複数のyが与えられる場合に対して、MDNを使った
- MDNにより、(x,y)座標の各点に対してloss functionを最小化するように学習する
座標の分解能をどれくらい取るかで結果の精度も変わるのかもしれない。