LoginSignup
0
0

最尤推定, MAP推定

Last updated at Posted at 2024-04-02

maximum likelihood estimation

Basic consideration


label $y_n$ of sample $n$ is given by

$$y_n=f(\boldsymbol{x}_n ; \boldsymbol{w}) + \epsilon_n$$

where,
$$\epsilon_n \sim \mathcal{N}(0,\sigma_y^2)$$

therefore, $y_n$ is

$$y_n \sim \mathcal{N}(f(\boldsymbol{x}_n ; \boldsymbol{w}), \sigma_y^2)$$


Now, training data $\mathcal{D} = \lbrace \boldsymbol{X}, \boldsymbol{Y} \rbrace $ was geven. likelihood function of this model is
$$
p(\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{w}) = \prod_{i=1}^{N} p(y_n |\boldsymbol{x}_n, \boldsymbol{w})
$$

$$
= \prod_{i=1}^{N} \mathcal{N}(f(\boldsymbol{x}_n ; \boldsymbol{w}), \sigma_y^2)
$$

$$
= \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma_y^2}}\exp(-\frac{(y_n - f(\boldsymbol{x}_n ; \boldsymbol{w}))^2}{2\sigma_y^2})
$$

At maximum likelihood estimation, we think $\boldsymbol{w}_{ML}$ that gives maximum value of the likelihood.

$$
\boldsymbol{w}_{ML} = \text{argmax} _{\boldsymbol{w}} \ \ln{p(\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{w})}
$$

where,
$$
\ln{p(\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{w})} = -\frac{1}{2}\sigma_y^{-2} \sum_{n=1}\lbrace y_n - f(\boldsymbol{x}_n ; \boldsymbol{w})\rbrace ^2 + c
$$

when $f(\boldsymbol{x}_n ; \boldsymbol{w})=\boldsymbol{w}^T\phi{(\boldsymbol{x}_n)}$,

$$
\ln{p(\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{w})} = -\frac{1}{2}\sigma_y^{-2} \sum_{n=1}\lbrace y_n - \boldsymbol{w}^T\phi{(\boldsymbol{x}_n)}\rbrace ^2 + c
$$

this fomulation matches to squared error function.

gradient decent

Optimal weight $\boldsymbol{w}_{ML}$ was given by a condideration of $\boldsymbol{w}$ that maximize logit likelihood.

$$
\nabla_{\boldsymbol{w}}\ln{p(\boldsymbol{Y}|\boldsymbol{X}, \boldsymbol{w})} = -\frac{1}{2}\sigma_y^{-2} \nabla_{\boldsymbol{w}} \sum_{n=1}\lbrace y_n - f(\boldsymbol{x}_n ; \boldsymbol{w})\rbrace^2
$$

$$
=-\sigma_y^{-2}\nabla_{\boldsymbol{w}}E(\boldsymbol{w})
$$

$$
E(\boldsymbol{w})= \frac{1}{2}\sum_{n=1} \lbrace y_n - f(\boldsymbol{x}_n ; \boldsymbol{w})\rbrace ^2
$$

using gradient decent method, parameter $\boldsymbol{w}$ is decided.

$$
\boldsymbol{w}_{new} =
\boldsymbol{w} - \alpha \sigma_y^{-2} \nabla _{\boldsymbol{w}} \ E(\boldsymbol{w})| _{\boldsymbol{w}=\boldsymbol{w} _{old}}
$$

where, $\alpha$ is learning rate.



maximum a posteriori estimation

basic consideration

$$
p(\boldsymbol{w}|\boldsymbol{Y}, \boldsymbol{X}) = \frac{p(\boldsymbol{Y} | \boldsymbol{X,w})p(\boldsymbol{w})}{p(\boldsymbol{Y}| \boldsymbol{X})}
$$

$$
\ln{p(\boldsymbol{w}|\boldsymbol{Y}, \boldsymbol{X})} = \ln{p(\boldsymbol{Y} | \boldsymbol{X,w})} + \ln{p(\boldsymbol{w})} + c
$$

$$
= - \frac{1}{2} \sigma_y^{-2} \sum_{n=1} \lbrace y_n - f(\boldsymbol{x}_n ; \boldsymbol{w}) \rbrace ^2
-\frac{1}{2}\sigma _{\boldsymbol{w}}^{-2}\boldsymbol{w}^T\boldsymbol{w} + c
$$

where,

$$
p(\boldsymbol{w}) \sim \mathcal{N}(0, \sigma_{\boldsymbol{w}^2})
$$

At MAP estimation, we consider $\boldsymbol{w}$ that gives maximum value of $\ln{p(\boldsymbol{w}|\boldsymbol{Y}, \boldsymbol{X})}$.

$\ln{p(\boldsymbol{w}|\boldsymbol{Y}, \boldsymbol{X})}$ is also, written by

$$
\ln{p(\boldsymbol{w}|\boldsymbol{Y}, \boldsymbol{X})} = -\sigma_y^2 \lbrace \frac{1}{2} \sum_{n=1}^N \lbrace {y_n - f(\boldsymbol{x}_n ; \boldsymbol{w}) \rbrace}^2 +
\frac{\sigma _{\boldsymbol{w}}^{-2}}{\sigma_y^{-2}}\frac{1}{2}\boldsymbol{w}^T\boldsymbol{w}
\rbrace + c
$$

$$
\ln{p(\boldsymbol{w}|\boldsymbol{Y}, \boldsymbol{X})} \propto E(\boldsymbol{w}) +
\lambda \Omega _{L2}
$$

where,
$$
E(\boldsymbol{w})= \frac{1}{2}\sum_{n=1} \lbrace y_n - f(\boldsymbol{x}_n ; \boldsymbol{w})\rbrace ^2
$$

$$
\Omega_{L2}(\boldsymbol{w}) = -\frac{1}{2}\boldsymbol{w}^T\boldsymbol{w}
$$

$$
\lambda = \frac{\sigma_{\boldsymbol{w}}^{-2}}{\sigma_y^{-2}}
$$

When we use Laplace distribution as following, $\ln{p(\boldsymbol{w}|\boldsymbol{Y}, \boldsymbol{X})}$ has item of $L1$ regularization.

$$
Lap(w|\mu,b) = \frac{1}{2b}\exp(-\frac{|w-\mu|}{b})
$$



Error Functions

Bernoulli distribution gives binary cross entropy loss

$$
y_n \sim Bern(\mu_n)
$$

$$
y_n \in (0, 1)
$$

In this case, likelihood is

$$
p(\boldsymbol{Y}|\mu) = \prod_{n=1}^N \mu_n^{y_n}(1- \mu_n)^{1-y_n}
$$

When $\mu_n = Sig(\eta_n)$ and $\eta_n = \boldsymbol{w}^T\phi{(\boldsymbol{x}_n)} $, this model is called logistic regression model.

$$
p(\boldsymbol{Y}|\mu) = \ln{p(\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{w})}
= \sum_{n=1}^N y_n\ln{\mu_n} + (1-y_n)\ln{(1-\mu_n)}
$$


Multi variable distribution gives categorical cross entropy loss

$y_{n,d}$ satisfies

$$
\boldsymbol{y}_n \in \lbrace 0, 1 \rbrace^D
$$

$$
\sum_{d=1}^D y_{n,d} = 1
$$

$\boldsymbol{y}_{n}$ is categorical distribution such as $\boldsymbol{y}_n = (0, 0, 1, 0)^T$.

$$
p(\boldsymbol{y} _n | \boldsymbol{\mu}) = \prod _{d=1}^D \mu _{n,d}^{y _{n,d}}
$$



When $\pi$ is softmax function,
$$
\boldsymbol{\pi} _n = \frac{\exp(\eta _{n,d} )}{\sum _{d=1}^D \exp(\eta _{n,d})}
$$

$$
\boldsymbol{\eta}_n \sim \mathcal{N}(f(\boldsymbol{x}_n ; \boldsymbol{w}), \sigma _{\eta}^2\boldsymbol{I})
$$

$\boldsymbol{y}_n$ is described as

$$
\boldsymbol{y}_n \sim Cat(\boldsymbol{\pi}_n)
$$

In this case, likelihood is

$$
\ln{p(\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{W})} = \sum_{n=1}^N \ln{p(\boldsymbol{y}_n | \boldsymbol{x}_n, \boldsymbol{W})}
$$

$$
= \sum_{n=1}^N \sum_{d=1}^D y_{n,d} \ln{\pi_{n,d}}
$$

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0