Mixture of Gaussians-VAE (MoG-VAE)

Last updated at 2025-02-27Posted at 2025-02-12

1. Firstly

Thank you for reading!
I have recently studied a generative shape method for Point Clouds.
As part of my study, I read the paper “VAE with VampPrior”.
I wanted to compare the performance of MoG-VAE with the standard VAE.
So, I wrote this article in English to practice and improve my English.

2. Differences between VAE and MoG-VAE

VAE (Variational Autoencoder)

In VAE, the prior distribution $ p(z) $ is a simple Gaussian distribution.

$$
p(z) = \mathcal{N}(0, I) \tag1
$$

The latent space follows a standard normal distribution.
This makes the latent space continuous and smooth but less flexible.

MoG-VAE (Mixture of Gaussians VAE)

In MoG-VAE, the prior distribution $ p(z) $ is a Mixture of Gaussians (MoG).

$$
p(z) = \sum_{k=1}^{K} \pi_k \mathcal{N}(z | \mu_k, \Sigma_k) \tag2
$$

$ K $ is the number of Gaussian components.
$ \pi_k $ are mixture weights ($ \sum_{k=1}^{K} \pi_k = 1 $).
$ \mu_k $ and $ \Sigma_k $ are the mean and covariance of each Gaussian component.
This makes the latent space more flexible and diverse.

Key Difference in Latent Space

VAE: Single Gaussian → Limited capacity to capture complex structures.
MoG-VAE: Mixture of Gaussians → Better at capturing multimodal and complex latent distributions.

3. KL Divergence in VAE and MoG-VAE

In both VAE and MoG-VAE, the KL divergence term in the loss function measures how close the approximate posterior $ q(z|x) $ is to the prior distribution $ p(z) $. However, the calculation differs due to the difference in the prior distributions.

3-1. VAE (Standard Gaussian Prior)

The prior $ p(z) $ is a standard Gaussian distribution $ \mathcal{N}(0, I) $.
Thus, the KL divergence has a closed-form solution:

$$
\text{KL}(q(z|x) \parallel p(z)) = \frac{1}{2} \sum_{i=1}^{z} \left( 1 + \log(\sigma_i^2) - \mu_i^2 - \sigma_i^2 \right) \tag3
$$

$z$ is Latent dimension.
$ \mu_i $ and $ \sigma_i $ are the mean and standard deviation of $ q(z|x) $.
This term penalizes large deviations of $ q(z|x) $ from the standard Gaussian.

3-2. MoG-VAE (Mixture of Gaussians Prior)

For MoG-VAE, the prior $ p(z) $ is a Mixture of Gaussians (MoG):

$$
p(z) = \sum_{k=1}^{K} \pi_k \mathcal{N}(z | \mu_k, \Sigma_k) \tag4
$$

Thus, the KL divergence for MoG-VAE can be expressed as:

$$
\text{KL}(q(z|x) \parallel p(z)) = \frac{1}{2} \sum_{k=1}^{K} \pi_k \left( \mu_k^2 + \sigma_k^2 - \log \sigma_k^2 - 1 \right) \tag5
$$

$ \mu_k $ and $ \sigma_k^2 $ are the mean and variance of the $ k $-th Gaussian component.
The mixture weights $ \pi_k $ are computed using the Softmax function:

$$
\pi_k = \frac{\exp(\alpha_k)}{\sum_{j=1}^{K} \exp(\alpha_j)} \tag6
$$

Monte Carlo sampling can be used to approximate the KL divergence in cases where a direct analytical solution is not available.
Softmax normalization ensures that the mixture weights $ \pi_k $ sum to 1.
This approach makes the latent space more flexible and expressive compared to VAE, allowing the model to capture more complex data distributions.

4. Datasets for evaluation

I used the chair dataset from ModelNet.
The number of points in each point cloud is 5,000.
Each shape is normalized within the range of 0 to 1.

5. Training Settings

5-1. VAE Settings

Latent dimension$z$: 3
Learning rate: 1.0e-5
Optimizer: Adam
Total Loss: formula(7)

$$
L_\text{total} = \frac{1}{N} \sum_{i=1}^{N} | x_i - x_i^\text{reconst} |^2 + \frac{1}{2} \sum_{i=1}^{z} \left( 1 + \log(\sigma_i^2) - \mu_i^2 - \sigma_i^2 \right) \tag7
$$

5-2. MoG-VAE Settings

Number of Gaussian components$K$: 2
Latent dimension$z$: 3
Learning rate: 1.0e-5
Optimizer: Adam
Total Loss: formula(8)

$$
L_\text{total} = \frac{1}{N} \sum_{i=1}^{N} | x_i - x_i^\text{reconst} |^2 + \frac{1}{2} \sum_{k=1}^{K} \pi_k \left( \mu_k^2 + \sigma_k^2 - \log \sigma_k^2 - 1 \right) \tag8
$$

In this study, I chose three latent dimensions because they make the latent space easier to interpret. Based on my experience, three latent variables are enough to generate shapes. Therefore, I prioritized interpretability over using higher dimensions.

6. Both VAE Architectures

I used PointNet (point-wise convolution and max pooling) for the encoder and transposed convolution for the decoder.

6-1. Encoder

INPUT (N × 5000) → Conv1d (N × 64) → Conv1d (N × 128) → Conv1d (N × 1024)  
→ Adaptive Max Pooling → Linear (1024 → 512) → Linear (512 → 256) → Linear (256 → 9)  
→ Linear (9 → z) for μ (mean) and log σ² (log variance)

6-2. Decoder

INPUT (z) → Linear (n_z → 1024) → Linear (1024 → 512)  
→ ConvTranspose1d (512 → 1024) → ConvTranspose1d (1024 → 2048)  
→ Linear (2048 → 5000)

7. Performance Evaluation

The reconstructed quality of VAE and MoG-VAE is evaluated using Chamfer Distance (CD) and Earth Mover's Distance (EMD).

7-1. Chamfer Distance (CD) Formula

The Chamfer Distance (CD) measures the distance between two point sets $ P $ and $ Q $. It is defined as:

$$
\text{CD}(P, Q) = \frac{1}{|P|} \sum_{p \in P} \min_{q \in Q} | p - q |^2 + \frac{1}{|Q|} \sum_{q \in Q} \min_{p \in P} | q - p |^2 \tag9
$$

Where:

$ P $ and $ Q $ are two point sets
$ p $ and $ q $ are points from sets $ P $ and $ Q $ respectively
$ | \cdot | $ denotes the Euclidean distance
$|P|$ and $|Q|$ are the total number of points in sets $ P $ and $ Q $

Chamfer Distance computes the average squared distance from each point in one set to its nearest neighbor in the other set.

7-2. Earth Mover's Distance (EMD) Formula

Earth Mover's Distance (EMD) is a measure of the distance between two probability distributions over a region $ \mathbb{R}^d $. It can be defined as an optimization problem that finds the minimal cost to transform one distribution into another.It is defined as:

$$
\text{EMD}(P, Q) = \min_{\gamma \in \Gamma(P, Q)} \sum_{(p, q) \in P \times Q} \gamma(p, q) | p - q | \tag{10}
$$

Where:

$ P $ and $ Q $ are two distributions or point sets.
$ | p - q | $ represents the distance (typically Euclidean) between points $ p \in P $ and $ q \in Q $.
$ \Gamma(P, Q) $ is the set of all valid transportation plans (flow functions) between distributions $ P $ and $ Q $.
$ \gamma(p, q) $ is the amount of "mass" moved from $ p $ to $ q $.

8. Result

8-1. VAE Reconstruction Quality

8-2. MoG-VAE Reconstruction Quality

I couldn't see any difference between VAE and MoG-VAE from the visualizations.
Therefore, I should compare VAE and MoG-VAE using the absolute values of CD and EMD.
So, I compared them.

8-3. Comparison of VAE and MoG-VAE

Evaluation	CD	CD	EMD	EMD
Architecture	VAE	MoG-VAE	VAE	MoG-VAE
Design1	0.0245	0.0247	0.0167	0.0181
Design2	0.0247	0.0226	0.0145	0.0131
Design3	0.0390	0.0161	0.0298	0.0088
Design4	0.0303	0.0227	0.0195	0.0136
Design5	0.0333	0.0287	0.0217	0.0176
Design6	0.0292	0.0174	0.0267	0.0111
Design7	0.0463	0.0277	0.0400	0.0176
Design8	0.0286	0.0281	0.0213	0.0200
Design9	0.0315	0.0316	0.0197	0.0200
Ave.	0.0319	0.0244	0.0233	0.0155

The reconstruction quality of MoG-VAE is higher than that of VAE based on this result. In my opinion, this is because MoG-VAE has a more diverse and complex latent space.

Finally

I compared the reconstruction quality of VAE and MoG-VAE.
As expected, MoG-VAE showed higher quality compared to VAE.
Next, I will check the effect of changing the reconstruction loss function from MSE to CD.
Will the reconstruction quality improve after changing the loss function?
I'm excited to find out!
Thank you for reading.
If I have time, I will upload the code for this study to my GitHub!

Reference

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up