More than 5 years have passed since last update.

Machines Learning 学习笔记(Week9)

MachineLearning

Last updated at 2020-06-12Posted at 2020-06-11

Anomaly Detection

Algorithm (based on Gaussian Distribution)

1, Choose features $x_i$ that you think might be indicative of anomalous examples

2, Fit parameters $\mu_1,...,\mu_n, \sigma_1^2,...,\sigma_n^2$

\mu_j = \frac{1}{m}\sum_{i=1}^mx_j^{(i)}\\
\sigma_j^2 = \frac{1}{m}\sum_{i=1}^m(x_j^{(i)}-\mu_j)^2

3, Given new examples x, compute $p(x)$:

p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2) = \prod_{j=1}^n \frac{1}{\sqrt{2\pi}\sigma_j}exp(-\frac{(x_j-\mu_j)^2}{2\sigma_j^2})

Anomaly if $p(x)<\varepsilon$

Developing and Evaluating

Assume we have some labeled data, of anomalous and non-anomalous examples (y=0 if normal, y=1 if anomalous).

Training set(60%): $x^{(1)},x^{(2)},...,x^{(m)}$(assume normal examples/not anomalous)

Cross validation set(20%):$(x_{cv}^{(1)},y_{cv}^{(1)}),...,(x_{cv}^{(m_{cv})},y_{cv}^{(m_{cv})})$

Test set(20%):$(x_{test}^{(1)},y_{test}^{(1)}),...,(x_{test}^{(m_{test})},y_{test}^{(m_{test})})$

Aircraft engines example

10000 good (normal) engines, 20 flawed engines (anomalous)

Training set: 6000 good engines
CV set: 2000 good engines, 10 anomalous
Test set: 2000 good engines, 10 anomalous

Evaluation

Fit model p(x) on the training set
On a cross validation set, predict:

y=1 if p(x)<ε (anomaly)
y=0 if p(x)>=ε (normal)
Possible evaluation metrics:
Precision/Recall
F1 Score
Can also use cross validation set to choose parameter ε

Anomaly Detection vs. Supervised Learning

Anomaly Detection	Supervised Learning
Very small number of positive examples(y=1). (0-20 is common). Large number of negative examples(y=0).	Large number of positive and negative examples.
Many different "types" of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look like nothing like any of the anomalous examples we've seen so far.	Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.

Non-gaussian features

Transform the data: $log(x),,log(x+c),,x^{\frac{1}{2}},,x^{\frac{1}{3}}...$

Multivariate Gaussian Distribution (Optional)

Purpose: capture correlations between different features in case of unusual combinations of features

Anomaly detection with the multivariate Gaussian

1, Fit model p(x) by setting

\mu = \frac{1}{m}\sum_{i=1}^mx^{(i)}\\
\Sigma = \frac{1}{m}\sum_{i=1}^m(x^{(i)}-\mu)(x^{(i)}-\mu)^T\\
(\mu \in R^n,\,\Sigma\in R^{n*n})

2, Given a new example x, compute

p(x) = \frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}exp\Bigl(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\Bigr)\\
(|\Sigma|=determinant\,of\,\Sigma)

Flag an anomaly if $p(x) < \varepsilon$

Relationship with orignial model

$\Sigma$ measures the correlations between features. If Sigma is all 0s except the diagonal, then it's identical to the original Gaussian model.
Change the 0s to positive numbers, means 正相关
Similarly, change the 0s to negative numbers, means 负相关

Original Model vs. Multivariate Gaussian

Recommender Systems

$r(i,j)=1$ if user j has rated movie i (0 otherwise)
$y^{(i,j)}=$ rating by user j on movie i (if defined)
$\theta^{(j)}=$ parameter vector for user j
$x^{(i)}=$ feature vector for movie i
For user j, movie i, predicted rating: $(\theta^{(j)})^T(x^{(i)})$
$n_u=$ no. users
$n_m=$ no. movies
$m^{(j)}=$ no. of movies rated by user j

Content-based Recommendations

We know what the content is and how the content is. We are trying to predict users' preferences.

Optimization algorithm:

\min_{\theta^{(1)},...,\theta^{(n_u)}} \frac{1}{2}\sum_{j=1}^{n_u} \sum_{i:r(i,j)=1} \Bigl((\theta^{(j)})^Tx^{(i)}-y^{(i,j)}\Bigr)^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2\\
(\theta^{(j)} \in R^{n+1})