Entropy, cross entropy, KL divergence, JS divergence

Entropy:

\[H(X) = - \int_\mathcal{X} P(x) \log P(x) dx\]

Cross entropy:

In information theory, the cross entropy between two probability distributions $P$ and $Q$ over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution $Q$, rather than the true $P$.

\[H(P, Q) = - \int_\mathcal{X} P(x) \log Q(x) dx\]

The definition may be formulated using the Kullback–Leibler divergence $D_{\text{KL}}(P || Q)$ of $Q$ from $P$ (also known as the relative entropy of $P$ with respect to $Q$):

\[H(P, Q) = H(P) + D_{\text{KL}}(P || Q)\]

Kullback–Leibler divergence (also known as the relative entropy of $P$ with respect to $Q$):

\[D_{\text{KL}}(P || Q) = \int_{\mathcal{X}} P(x) \log \frac{P(x)}{Q(x)} dx\]

$D_{\text{KL}}$ achieves the minimum zero when $P(x) =Q(x)$ everywhere.

It is noticeable according to the formula that KL divergence is asymmetric. In cases where $P(x)$ is close to zero, but $Q(x)$ is significantly non-zero, the $Q$’s effect is disregarded. It could cause buggy results when we just want to measure the similarity between two equally important distributions.

However, if we are trying to find approximations for a complex (intractable) distribution $Q(x)$ by a (tractable) approximate distribution $P(x)$, we want to be absolutely sure that any $x$ that would be very improbable to be drawn from $Q(x)$ would also be very improbable to be drawn from $P(x)$. When $P(x)$ is small but $Q(x)$ is not, that’s ok. But when $Q(x)$ is small, this grows very rapidly if $P(x)$ isn’t also small. So if we are looking for $P(x)$ by minimizing KL divergence $D_{\text{KL}}(P || Q)$, it’s very improbable that $P(x)$ will assign a lot of mass on regions where $Q(x)$ is near zero.


Jensen–Shannon divergence is another measure of similarity between two probability distributions. JS divergence is symmetric and more smooth. It is defined by:

\[\text{JSD}(P || Q) = \frac{1}{2}D_{\text{KL}}(P || \frac{P+Q}{2}) + \frac{1}{2}D_{\text{KL}} (Q || \frac{P+Q}{2})\]

Some believe (Huszar, 2015) that one reason behind GANs’ big success is switching the loss function from asymmetric KL divergence in traditional maximum-likelihood approach to symmetric JS divergence.