Paper > link > Mixture Density Networks with TensorFlow > the capacity to predict a range of different output values for each input #TensorFlow

100 input, 100 outputの学習を試行していて、あまり進展がよろしくない。

キーワード

the classic Mixture Density Networks (Bishop ’94) model
- a Mixture Density Network (MDN)
to fit a noisy sinusoidal data
epsilon: standard gaussian random noise
the fancy RMSProp gradient descent optimisation method
a one-to-one, or many-to-one function
What we want is a model that has the capacity to predict a range of different output values for each input

ここまで読んで概要として掴んだことは、
1つの入力値に対して複数の出力値がある場合(例として、sine curveを横転したもの）では従来のneural networkでは学習が失敗する。
Mixture Density Networksだと「1つの入力値に対して複数の出力値がある場合」の学習ができているようだ。

the network is to predict an entire probability distribution for the output
MDNs can also used to model handwriting, where the next stroke is drawn from a probability distribution of multiple possibilities, rather than sticking to one prediction.
Mixture Gaussian distributions
- the output value is modelled as a sum of many gaussian random values
- each with different means and standard deviations
a probability distribution function (pdf) of P(Y = y | X = z)
a restriction that the sum of PI_k(x) add up to one
we will use a neural network of one hidden later with 24 nodes, and also generate 24 mixtures, hence there will be 72 actual outputs of our neural network of a single input.
Z is a vector of 72 values
- into three equal parts, Z{0to23}, Z{24to43}, Z{44to71}
the softmax and exponential terms have some theoretical interpretations from a Bayesian framework way of looking at probability
loss function
- A more suitable loss function is to minimise the logarithm of the likelihood of the distribution vs the training data:
- for every (x, y)point in the training data set, we can compute a cost function
- a similar approach, but with non-discretised states.
- def get_lossfunc(
there would be room for performance improvements by building a custom operator into TensorFlow with the pre-optimised gradient formulas for this loss function

ここまでの理解としては（間違っているかもしれない）

1つのxに対して複数のyが与えられる場合に対して、MDNを使った
MDNにより、(x,y)座標の各点に対してloss functionを最小化するように学習する

座標の分解能をどれくらい取るかで結果の精度も変わるのかもしれない。