KL Divergence and Breman Distance

MachineLearning

Posted at 2023-11-28

Kullback-Leibler (KL) Divergence is an important mathematical idea frequently used in Deep Learning literature. It is defined as the following:
$$
D_KL(P||Q) = \sum_{x\in \mathcal{X}} P(x) \log (\frac{P(x)}{Q(x)}) \tag{1}
$$

One of the famous usage is in the GAN (at least original) loss function, where the optimization function is defined as:

$$
\min_G \max_D \mathrm{E}_{x \sim p_r} \log [D(x)] + \mathrm{E}_{z \sim p_z} \log [1-D(G(z))]
$$

which equivalently optimizes KL divergence between $p_g$ and $p_r$.

By staring at eq. (1), one might ask if this value can be negative. Since sometimes the argument of the log could be very small and causing the result to be very negative.

But the answer is NO. KL divergence will never be negative. It can be proved using Gibbs' inequality whose proof is easy enough to be written here! First we need to assume 2 discrete probability distributions $P={p_1,…,p_n}$ and $Q={q_1…,q_n}$. Each element will represent a number between 0 and 1 (both inclusive) and sum in each distribution would be 1.

$$
- \sum_{i \in I} p_i \ln \frac{q_i}{p_i} \ge - \sum_{i \in I} p_i \big( \ln \frac{q_i}{p_i} - 1\big) = -\sum_{i\in I} q_i + \sum_{i\in I}p_i = - \sum_{i\in I} q_i + 1 \ge 0
$$

Equivalently, over the index set $I$, we will have

$$
- \sum_{i\in I}p_i \ln \frac{q_i}{p_i} \ge 0
$$

which finishes the proof.

Another way of proof is use the fact that KL divergence is a kind of Bregman distance.

What is Bregman distance? Well, it is a measure of distance between points defined in terms of a strictly convex function where points are interpreted as probability distributions.

I will provide the definition to Bregman distance to finish this essay.

Let $F:Σ→R$ be a continuously-differentiable, strictly convex function defined on a closed convex set $Σ$

The Bregman distance associated with F for points $p,q∈Σ$ is the difference between the value of F at point $p$ and the value of the first-order Taylor expansion of F around point $q$ evaluated at point $p$

$$
D_F(p,q) = F(p)-F(q)-\langle \nabla F(q), p-q \rangle
$$

Thanks for reading! Cheers!

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up