More than 5 years have passed since last update.

Goodfellow本第５章機械学習の基礎

深層学習

Posted at 2018-05-01

社内でGoodfellow「Deep Learning」の日本語訳(松尾研)を週一で勉強会する事にしました。その際、式変形でつまづくことが多かったため、丁寧に書き下しました。
http://www.deeplearningbook.org/

今回は5.6.1章からです

5.6.1 Maximum a Posteriori (MAP) Estimation

\begin{eqnarray}
\theta_{MAP} &=& argmax_\theta p(\theta | \bf x) \\
  &=&  argmax_\theta \frac{P(D|\theta) P(\theta)}{P(x)}  \\
P(x)は\thetaに関係ない\\

  &=&  argmax_\theta P(D|\theta) P(\theta) \\ 
  &=&  argmax_\theta log P(D|\theta)+ logP(\theta) \tag{5.79}
\end{eqnarray}

例として重みwに対する概す事前分布...

\begin{eqnarray}
\mathcal N(\vec w; 0, \frac{1}{\lambda} \mathbf{I^2})=
\sqrt{\frac{1}{2\pi (\frac{1}{\lambda} \mathbf{I^2})^2}} exp(-\frac{\bf w^2}{2\frac{1}{\lambda} \mathbf{I^2}} )\\

log\mathcal N(\vec w; 0, \frac{1}{\lambda} \mathbf{I^2}) \propto -\frac{\bf w^2}{2\frac{1}{\lambda} \mathbf{I^2}}
\end{eqnarray}

\lambda w^T w に比例

完全ベイズ推定、MAPベイズ推定
https://qiita.com/jyori112/items/80b21422b15753a1c5a4

さて、298対局して199勝した棋士と、４対局して3勝した棋士がいた>とします。二人の勝利確率をβ=2β=2のMAP推定すると、ベテラン棋士は

θ=199+2−1298+2(2−1)=23
θ=199+2−1298+2(2−1)=23
新人棋士は

θ=3+2−14+2(2−1)=23
θ=3+2−14+2(2−1)=23
となり二人の勝利確率は同じになります。しかし、ベテラン棋士の値の方が信用できそうです。
しかし、この二つはMAP推定上では同じになってしまいます。これはargmaxをとったときに一番大きな値だけが考えられて、その値に対する信頼度は考慮されなくなってしまうからです。

?バイアスの増大という代償が伴う

\theta_{MAP}

を求める際にlogp(θ)の項で事前分布が適用できるので、バイアスは事前分布により大きくなる。

? 正則化によってlogp(θ)に対応する目的関数に新たなこうが追加される場合

[Nowlan, S. J. and Hinton, G. E.(1992). Simplifying neural networks by soft weight-sharing. Neural Computation, 4(4) 473-493][1]
[1]:http://www.cs.toronto.edu/~hinton/absps/sunspots.pdf

5.7 Supervised Learning Algorithms

5.7.1

(5.80)わからない

(5.81)

p(y=1| \bf{x}; \bf \theta) = \sigma(\theta^T \bf x)

x;θについてy=1となる確率はσ(θ^T x)と言う意味
[gihyo.jp ロジスティック回帰][2]
[2]:http://gihyo.jp/dev/serial/01/machine-learning/0018

5.7.2 Support Vector Machines

[Boser, B. E.., Guyon, I. M., and Vapnik, V.N.(1992). A training algorithm for optimal margin classifiers. ][3]
[3]:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.3818&rep=rep1&type=pdf

5.7.3 Other Simple Supervised Learning Algorithms

?一近傍法は、訓練事例が無限に増えるに連れてベイズ誤差の2倍に収束

5.8 Unsupervised Learning Algorithms

[Barlow, 1989 Unsupervised learning][4]
[4]:http://mlg.eng.cam.ac.uk/zoubin/course02/lect7mcmc.pdf

[Olshausen, B. A. (1996)Emergence of simple-cell receptive field properties by learning a sparse code for natural images][5]
[5]:http://www.cns.nyu.edu/~tony/vns/readings/olshausen-field-1996.pdf

[Hinton and Ghahramani,(1997) Generative models for discovering sparse distributed representations][6]
[6]:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1692002/pdf/9304685.pdf

5.8.1 Principal Components Analysis

[次元が低い表現を学習する ][7]
[7]:https://qiita.com/NoriakiOshita/items/460247bb57c22973a5f0

Var[\vec{x}] = \frac{1}{m-1} {\bf{X^T} \bf{X}}

m-1で割ってる:標本の共分散や分散を求める時は自由度の関係から、m-1の方が精度が良いらしい...??
http://heycere.com/statistics/covariance/

対角化されている:共分散が0になる(疎になるので)

\bf X^T X = W\Lambda W^T \tag{5.86}

左辺は対象行列なので、直行行列Wと対角行列lambdaを用いて固有値分解できる

\bf X = U \Sigma W^T  \tag{5.87}

とすると

\begin{eqnarray}
\bf X^T X &=& (U \Sigma W^T)^T U \Sigma W^T \\
&=& W\Sigma U^T U \Sigma W^T \\

Uは直行行列
U^T = U^{-1} \\
U^T U = U^{-1} U = I \\

&=& W \Sigma^2 W^T \tag{5.87}
\end{eqnarray}

(5.88)-(5.95) 普通に

Sigma^2は、対角行列なので、(5.95)式の共分散行列の共分散成分は0であり、zの個々の表現が無相関

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Goodfellow本 第５章 機械学習の基礎