Anomaly Detection
Algorithm (based on Gaussian Distribution)
1, Choose features $x_i$ that you think might be indicative of anomalous examples
2, Fit parameters $\mu_1,...,\mu_n, \sigma_1^2,...,\sigma_n^2$
\mu_j = \frac{1}{m}\sum_{i=1}^mx_j^{(i)}\\
\sigma_j^2 = \frac{1}{m}\sum_{i=1}^m(x_j^{(i)}-\mu_j)^2
3, Given new examples x, compute $p(x)$:
p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2) = \prod_{j=1}^n \frac{1}{\sqrt{2\pi}\sigma_j}exp(-\frac{(x_j-\mu_j)^2}{2\sigma_j^2})
Anomaly if $p(x)<\varepsilon$
Developing and Evaluating
Assume we have some labeled data, of anomalous and non-anomalous examples (y=0 if normal, y=1 if anomalous).
Training set(60%): $x^{(1)},x^{(2)},...,x^{(m)}$(assume normal examples/not anomalous)
Cross validation set(20%):$(x_{cv}^{(1)},y_{cv}^{(1)}),...,(x_{cv}^{(m_{cv})},y_{cv}^{(m_{cv})})$
Test set(20%):$(x_{test}^{(1)},y_{test}^{(1)}),...,(x_{test}^{(m_{test})},y_{test}^{(m_{test})})$
Aircraft engines example
10000 good (normal) engines, 20 flawed engines (anomalous)
Training set: 6000 good engines
CV set: 2000 good engines, 10 anomalous
Test set: 2000 good engines, 10 anomalous
####Evaluation
Fit model p(x) on the training set
On a cross validation set, predict:
- y=1 if p(x)<ε (anomaly)
- y=0 if p(x)>=ε (normal)
Possible evaluation metrics: - Precision/Recall
- F1 Score
Can also use cross validation set to choose parameter ε
Anomaly Detection vs. Supervised Learning
Anomaly Detection | Supervised Learning |
---|---|
Very small number of positive examples(y=1). (0-20 is common). Large number of negative examples(y=0). | Large number of positive and negative examples. |
Many different "types" of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look like nothing like any of the anomalous examples we've seen so far. | Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set. |
Non-gaussian features
Transform the data: $log(x),,log(x+c),,x^{\frac{1}{2}},,x^{\frac{1}{3}}...$
Multivariate Gaussian Distribution (Optional)
Purpose: capture correlations between different features in case of unusual combinations of features
Anomaly detection with the multivariate Gaussian
1, Fit model p(x) by setting
\mu = \frac{1}{m}\sum_{i=1}^mx^{(i)}\\
\Sigma = \frac{1}{m}\sum_{i=1}^m(x^{(i)}-\mu)(x^{(i)}-\mu)^T\\
(\mu \in R^n,\,\Sigma\in R^{n*n})
2, Given a new example x, compute
p(x) = \frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}exp\Bigl(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\Bigr)\\
(|\Sigma|=determinant\,of\,\Sigma)
Flag an anomaly if $p(x) < \varepsilon$
Relationship with orignial model
-
$\Sigma$ measures the correlations between features. If Sigma is all 0s except the diagonal, then it's identical to the original Gaussian model.
Original Model vs. Multivariate Gaussian
#Recommender Systems
$r(i,j)=1$ if user j has rated movie i (0 otherwise)
$y^{(i,j)}=$ rating by user j on movie i (if defined)
$\theta^{(j)}=$ parameter vector for user j
$x^{(i)}=$ feature vector for movie i
For user j, movie i, predicted rating: $(\theta^{(j)})^T(x^{(i)})$
$n_u=$ no. users
$n_m=$ no. movies
$m^{(j)}=$ no. of movies rated by user j
Content-based Recommendations
We know what the content is and how the content is. We are trying to predict users' preferences.
###Optimization algorithm:
\min_{\theta^{(1)},...,\theta^{(n_u)}} \frac{1}{2}\sum_{j=1}^{n_u} \sum_{i:r(i,j)=1} \Bigl((\theta^{(j)})^Tx^{(i)}-y^{(i,j)}\Bigr)^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2\\
(\theta^{(j)} \in R^{n+1})
Gradient descent update:
Collaborative Filtering
We don't know the content or the users' preferences. But we are trying to:
$x \in R^n$
$\theta \in R^n$
Don't need $x_0$ and $\theta_0$
Collaborative filtering algorithm
####Vectorization: Low rank matrix factorization
Find related products (movies)
For each product i, we learn a feature vector $x^{(i)} \in R^n$.
Find the products (movies) j with the smallest $||x^{(i)}-x^{(j)}||$.
Mean Normalization
For users who have not rated any movies, cannot predict initial values (all 0s):