1. Firstly
Thank you for reading!
I have recently studied a generative shape method for Point Clouds.
As part of my study, I read the paper “VAE with VampPrior”.
I wanted to compare the performance of MoG-VAE with the standard VAE.
So, I wrote this article in English to practice and improve my English.
2. Differences between VAE and MoG-VAE
VAE (Variational Autoencoder)
In VAE, the prior distribution $ p(z) $ is a simple Gaussian distribution.
$$
p(z) = \mathcal{N}(0, I) \tag1
$$
- The latent space follows a standard normal distribution.
- This makes the latent space continuous and smooth but less flexible.
MoG-VAE (Mixture of Gaussians VAE)
In MoG-VAE, the prior distribution $ p(z) $ is a Mixture of Gaussians (MoG).
$$
p(z) = \sum_{k=1}^{K} \pi_k \mathcal{N}(z | \mu_k, \Sigma_k) \tag2
$$
-
$ K $ is the number of Gaussian components.
-
$ \pi_k $ are mixture weights ($ \sum_{k=1}^{K} \pi_k = 1 $).
-
$ \mu_k $ and $ \Sigma_k $ are the mean and covariance of each Gaussian component.
-
This makes the latent space more flexible and diverse.
Key Difference in Latent Space
- VAE: Single Gaussian → Limited capacity to capture complex structures.
- MoG-VAE: Mixture of Gaussians → Better at capturing multimodal and complex latent distributions.
3. KL Divergence in VAE and MoG-VAE
In both VAE and MoG-VAE, the KL divergence term in the loss function measures how close the approximate posterior $ q(z|x) $ is to the prior distribution $ p(z) $. However, the calculation differs due to the difference in the prior distributions.
3-1. VAE (Standard Gaussian Prior)
The prior $ p(z) $ is a standard Gaussian distribution $ \mathcal{N}(0, I) $.
Thus, the KL divergence has a closed-form solution:
$$
\text{KL}(q(z|x) \parallel p(z)) = \frac{1}{2} \sum_{i=1}^{z} \left( 1 + \log(\sigma_i^2) - \mu_i^2 - \sigma_i^2 \right) \tag3
$$
- $z$ is Latent dimension.
- $ \mu_i $ and $ \sigma_i $ are the mean and standard deviation of $ q(z|x) $.
- This term penalizes large deviations of $ q(z|x) $ from the standard Gaussian.
3-2. MoG-VAE (Mixture of Gaussians Prior)
For MoG-VAE, the prior $ p(z) $ is a Mixture of Gaussians (MoG):
$$
p(z) = \sum_{k=1}^{K} \pi_k \mathcal{N}(z | \mu_k, \Sigma_k) \tag4
$$
Thus, the KL divergence for MoG-VAE can be expressed as:
$$
\text{KL}(q(z|x) \parallel p(z)) = \frac{1}{2} \sum_{k=1}^{K} \pi_k \left( \mu_k^2 + \sigma_k^2 - \log \sigma_k^2 - 1 \right) \tag5
$$
- $ \mu_k $ and $ \sigma_k^2 $ are the mean and variance of the $ k $-th Gaussian component.
- The mixture weights $ \pi_k $ are computed using the Softmax function:
$$
\pi_k = \frac{\exp(\alpha_k)}{\sum_{j=1}^{K} \exp(\alpha_j)} \tag6
$$
- Monte Carlo sampling can be used to approximate the KL divergence in cases where a direct analytical solution is not available.
- Softmax normalization ensures that the mixture weights $ \pi_k $ sum to 1.
- This approach makes the latent space more flexible and expressive compared to VAE, allowing the model to capture more complex data distributions.
4. Datasets for evaluation
- I used the chair dataset from ModelNet.
- The number of points in each point cloud is 5,000.
5. Training Settings
5-1. VAE Settings
- Latent dimension$z$: 3
- Learning rate: 1.0e-5
- Optimizer: Adam
- Total Loss: formula(7)
$$
L_\text{total} = \frac{1}{N} \sum_{i=1}^{N} | x_i - x_i^\text{reconst} |^2 + \frac{1}{2} \sum_{i=1}^{z} \left( 1 + \log(\sigma_i^2) - \mu_i^2 - \sigma_i^2 \right) \tag7
$$
5-2. MoG-VAE Settings
- Number of Gaussian components$K$: 2
- Latent dimension$z$: 3
- Learning rate: 1.0e-5
- Optimizer: Adam
- Total Loss: formula(8)
$$
L_\text{total} = \frac{1}{N} \sum_{i=1}^{N} | x_i - x_i^\text{reconst} |^2 + \frac{1}{2} \sum_{k=1}^{K} \pi_k \left( \mu_k^2 + \sigma_k^2 - \log \sigma_k^2 - 1 \right) \tag8
$$
In this study, I chose three latent dimensions because they make the latent space easier to interpret. Based on my experience, three latent variables are enough to generate shapes. Therefore, I prioritized interpretability over using higher dimensions.
6. Both VAE Architectures
I used PointNet (point-wise convolution and max pooling) for the encoder and transposed convolution for the decoder.
6-1. Encoder
INPUT (N × 5000) → Conv1d (N × 64) → Conv1d (N × 128) → Conv1d (N × 1024)
→ Adaptive Max Pooling → Linear (1024 → 512) → Linear (512 → 256) → Linear (256 → 9)
→ Linear (9 → z) for μ (mean) and log σ² (log variance)
6-2. Decoder
INPUT (z) → Linear (n_z → 1024) → Linear (1024 → 512)
→ ConvTranspose1d (512 → 1024) → ConvTranspose1d (1024 → 2048)
→ Linear (2048 → 5000)
7. Performance Evaluation
The reconstructed quality of VAE and MoG-VAE is evaluated using Chamfer Distance (CD) and Earth Mover's Distance (EMD).
7-1. Chamfer Distance (CD) Formula
The Chamfer Distance (CD) measures the distance between two point sets $ P $ and $ Q $. It is defined as:
$$
\text{CD}(P, Q) = \frac{1}{|P|} \sum_{p \in P} \min_{q \in Q} | p - q |^2 + \frac{1}{|Q|} \sum_{q \in Q} \min_{p \in P} | q - p |^2 \tag9
$$
Where:
- $ P $ and $ Q $ are two point sets
- $ p $ and $ q $ are points from sets $ P $ and $ Q $ respectively
- $ | \cdot | $ denotes the Euclidean distance
- $|P|$ and $|Q|$ are the total number of points in sets $ P $ and $ Q $
Chamfer Distance computes the average squared distance from each point in one set to its nearest neighbor in the other set.
7-2. Earth Mover's Distance (EMD) Formula
Earth Mover's Distance (EMD) is a measure of the distance between two probability distributions over a region $ \mathbb{R}^d $. It can be defined as an optimization problem that finds the minimal cost to transform one distribution into another.It is defined as:
$$
\text{EMD}(P, Q) = \min_{\gamma \in \Gamma(P, Q)} \sum_{(p, q) \in P \times Q} \gamma(p, q) | p - q | \tag{10}
$$
Where:
- $ P $ and $ Q $ are two distributions or point sets.
- $ | p - q | $ represents the distance (typically Euclidean) between points $ p \in P $ and $ q \in Q $.
- $ \Gamma(P, Q) $ is the set of all valid transportation plans (flow functions) between distributions $ P $ and $ Q $.
- $ \gamma(p, q) $ is the amount of "mass" moved from $ p $ to $ q $.
8. Result
8-1. VAE Reconstrustion Quality
8-2. MoG-VAE Reconstrustion Quality
I couldn't see any difference between VAE and MoG-VAE from the visualizations.
Therefore, I should compare VAE and MoG-VAE using the absolute values of CD and EMD.
So, I compared them.
8-3. Comparison of VAE and MoG-VAE
Design | MoG-VAE CD | MoG-VAE EMD | VAE CD | VAE EMD |
---|---|---|---|---|
Design1 | 0.1929 | 0.2169 | 0.1935 | 0.2176 |
Design2 | 0.1278 | 0.1484 | 0.1286 | 0.1551 |
Design3 | 0.1570 | 0.1872 | 0.1567 | 0.1843 |
Design4 | 0.1422 | 0.1879 | 0.1462 | 0.1877 |
Design5 | 0.1516 | 0.1873 | 0.1497 | 0.1844 |
Design6 | 0.1993 | 0.2154 | 0.2031 | 0.2177 |
Design7 | 0.1823 | 0.2130 | 0.1761 | 0.2101 |
Design8 | 0.1371 | 0.1538 | 0.1336 | 0.1572 |
Design9 | 0.1607 | 0.2189 | 0.1714 | 0.2330 |
Average | 0.1612 | 0.1921 | 0.1621 | 0.1941 |
The reconstruction quality of MoG-VAE is higher than that of VAE based on this result. In my opinion, this is because MoG-VAE has a more diverse and complex latent space.
Finally
I compared the reconstruction quality of VAE and MoG-VAE.
As expected, MoG-VAE showed higher quality compared to VAE.
Next, I will check the effect of changing the reconstruction loss function from MSE to CD.
Will the reconstruction quality improve after changing the loss function?
I'm excited to find out!
Thank you for reading.
If I have time, I will upload the code for this study to my GitHub!
Reference