maximum likelihood estimation
Basic consideration
label $y_n$ of sample $n$ is given by
$$y_n=f(\boldsymbol{x}_n ; \boldsymbol{w}) + \epsilon_n$$
where,
$$\epsilon_n \sim \mathcal{N}(0,\sigma_y^2)$$
therefore, $y_n$ is
$$y_n \sim \mathcal{N}(f(\boldsymbol{x}_n ; \boldsymbol{w}), \sigma_y^2)$$
Now, training data $\mathcal{D} = \lbrace \boldsymbol{X}, \boldsymbol{Y} \rbrace $ was geven. likelihood function of this model is
$$
p(\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{w}) = \prod_{i=1}^{N} p(y_n |\boldsymbol{x}_n, \boldsymbol{w})
$$
$$
= \prod_{i=1}^{N} \mathcal{N}(f(\boldsymbol{x}_n ; \boldsymbol{w}), \sigma_y^2)
$$
$$
= \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma_y^2}}\exp(-\frac{(y_n - f(\boldsymbol{x}_n ; \boldsymbol{w}))^2}{2\sigma_y^2})
$$
At maximum likelihood estimation, we think $\boldsymbol{w}_{ML}$ that gives maximum value of the likelihood.
$$
\boldsymbol{w}_{ML} = \text{argmax} _{\boldsymbol{w}} \ \ln{p(\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{w})}
$$
where,
$$
\ln{p(\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{w})} = -\frac{1}{2}\sigma_y^{-2} \sum_{n=1}\lbrace y_n - f(\boldsymbol{x}_n ; \boldsymbol{w})\rbrace ^2 + c
$$
when $f(\boldsymbol{x}_n ; \boldsymbol{w})=\boldsymbol{w}^T\phi{(\boldsymbol{x}_n)}$,
$$
\ln{p(\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{w})} = -\frac{1}{2}\sigma_y^{-2} \sum_{n=1}\lbrace y_n - \boldsymbol{w}^T\phi{(\boldsymbol{x}_n)}\rbrace ^2 + c
$$
this fomulation matches to squared error function.
gradient decent
Optimal weight $\boldsymbol{w}_{ML}$ was given by a condideration of $\boldsymbol{w}$ that maximize logit likelihood.
$$
\nabla_{\boldsymbol{w}}\ln{p(\boldsymbol{Y}|\boldsymbol{X}, \boldsymbol{w})} = -\frac{1}{2}\sigma_y^{-2} \nabla_{\boldsymbol{w}} \sum_{n=1}\lbrace y_n - f(\boldsymbol{x}_n ; \boldsymbol{w})\rbrace^2
$$
$$
=-\sigma_y^{-2}\nabla_{\boldsymbol{w}}E(\boldsymbol{w})
$$
$$
E(\boldsymbol{w})= \frac{1}{2}\sum_{n=1} \lbrace y_n - f(\boldsymbol{x}_n ; \boldsymbol{w})\rbrace ^2
$$
using gradient decent method, parameter $\boldsymbol{w}$ is decided.
$$
\boldsymbol{w}_{new} =
\boldsymbol{w} - \alpha \sigma_y^{-2} \nabla _{\boldsymbol{w}} \ E(\boldsymbol{w})| _{\boldsymbol{w}=\boldsymbol{w} _{old}}
$$
where, $\alpha$ is learning rate.
maximum a posteriori estimation
basic consideration
$$
p(\boldsymbol{w}|\boldsymbol{Y}, \boldsymbol{X}) = \frac{p(\boldsymbol{Y} | \boldsymbol{X,w})p(\boldsymbol{w})}{p(\boldsymbol{Y}| \boldsymbol{X})}
$$
$$
\ln{p(\boldsymbol{w}|\boldsymbol{Y}, \boldsymbol{X})} = \ln{p(\boldsymbol{Y} | \boldsymbol{X,w})} + \ln{p(\boldsymbol{w})} + c
$$
$$
= - \frac{1}{2} \sigma_y^{-2} \sum_{n=1} \lbrace y_n - f(\boldsymbol{x}_n ; \boldsymbol{w}) \rbrace ^2
-\frac{1}{2}\sigma _{\boldsymbol{w}}^{-2}\boldsymbol{w}^T\boldsymbol{w} + c
$$
where,
$$
p(\boldsymbol{w}) \sim \mathcal{N}(0, \sigma_{\boldsymbol{w}^2})
$$
At MAP estimation, we consider $\boldsymbol{w}$ that gives maximum value of $\ln{p(\boldsymbol{w}|\boldsymbol{Y}, \boldsymbol{X})}$.
$\ln{p(\boldsymbol{w}|\boldsymbol{Y}, \boldsymbol{X})}$ is also, written by
$$
\ln{p(\boldsymbol{w}|\boldsymbol{Y}, \boldsymbol{X})} = -\sigma_y^2 \lbrace \frac{1}{2} \sum_{n=1}^N \lbrace {y_n - f(\boldsymbol{x}_n ; \boldsymbol{w}) \rbrace}^2 +
\frac{\sigma _{\boldsymbol{w}}^{-2}}{\sigma_y^{-2}}\frac{1}{2}\boldsymbol{w}^T\boldsymbol{w}
\rbrace + c
$$
$$
\ln{p(\boldsymbol{w}|\boldsymbol{Y}, \boldsymbol{X})} \propto E(\boldsymbol{w}) +
\lambda \Omega _{L2}
$$
where,
$$
E(\boldsymbol{w})= \frac{1}{2}\sum_{n=1} \lbrace y_n - f(\boldsymbol{x}_n ; \boldsymbol{w})\rbrace ^2
$$
$$
\Omega_{L2}(\boldsymbol{w}) = -\frac{1}{2}\boldsymbol{w}^T\boldsymbol{w}
$$
$$
\lambda = \frac{\sigma_{\boldsymbol{w}}^{-2}}{\sigma_y^{-2}}
$$
When we use Laplace distribution as following, $\ln{p(\boldsymbol{w}|\boldsymbol{Y}, \boldsymbol{X})}$ has item of $L1$ regularization.
$$
Lap(w|\mu,b) = \frac{1}{2b}\exp(-\frac{|w-\mu|}{b})
$$
Error Functions
Bernoulli distribution gives binary cross entropy loss
$$
y_n \sim Bern(\mu_n)
$$
$$
y_n \in (0, 1)
$$
In this case, likelihood is
$$
p(\boldsymbol{Y}|\mu) = \prod_{n=1}^N \mu_n^{y_n}(1- \mu_n)^{1-y_n}
$$
When $\mu_n = Sig(\eta_n)$ and $\eta_n = \boldsymbol{w}^T\phi{(\boldsymbol{x}_n)} $, this model is called logistic regression model.
$$
p(\boldsymbol{Y}|\mu) = \ln{p(\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{w})}
= \sum_{n=1}^N y_n\ln{\mu_n} + (1-y_n)\ln{(1-\mu_n)}
$$
Multi variable distribution gives categorical cross entropy loss
$y_{n,d}$ satisfies
$$
\boldsymbol{y}_n \in \lbrace 0, 1 \rbrace^D
$$
$$
\sum_{d=1}^D y_{n,d} = 1
$$
$\boldsymbol{y}_{n}$ is categorical distribution such as $\boldsymbol{y}_n = (0, 0, 1, 0)^T$.
$$
p(\boldsymbol{y} _n | \boldsymbol{\mu}) = \prod _{d=1}^D \mu _{n,d}^{y _{n,d}}
$$
When $\pi$ is softmax function,
$$
\boldsymbol{\pi} _n = \frac{\exp(\eta _{n,d} )}{\sum _{d=1}^D \exp(\eta _{n,d})}
$$
$$
\boldsymbol{\eta}_n \sim \mathcal{N}(f(\boldsymbol{x}_n ; \boldsymbol{w}), \sigma _{\eta}^2\boldsymbol{I})
$$
$\boldsymbol{y}_n$ is described as
$$
\boldsymbol{y}_n \sim Cat(\boldsymbol{\pi}_n)
$$
In this case, likelihood is
$$
\ln{p(\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{W})} = \sum_{n=1}^N \ln{p(\boldsymbol{y}_n | \boldsymbol{x}_n, \boldsymbol{W})}
$$
$$
= \sum_{n=1}^N \sum_{d=1}^D y_{n,d} \ln{\pi_{n,d}}
$$