Understanding Forward and Reverse KL Divergence

The Kullback-Leibler (KL) divergence is one of the most important concepts in machine learning, yet its asymmetry often leads to confusion. In this post, I’ll provide an intuitive and rigorous exploration of forward KL and reverse KL divergence, demonstrating why the choice between them fundamentally shapes how our models learn.

What is KL Divergence?

KL divergence measures how one probability distribution $P$ differs from another distribution $Q$. It’s defined as:

\[D_{KL}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} = \mathbb{E}_{x \sim P}\left[\log \frac{P(x)}{Q(x)}\right]\]

For continuous distributions:

\[D_{KL}(P \| Q) = \int P(x) \log \frac{P(x)}{Q(x)} dx\]

A crucial property: KL divergence is not symmetric. That is:

\[D_{KL}(P \| Q) \neq D_{KL}(Q \| P)\]

This asymmetry has profound implications for machine learning.

The Setup: Approximating a Target Distribution

Consider a common scenario: we have a true distribution $P$ (often complex and intractable) and we want to approximate it with a simpler distribution $Q_\theta$ parameterized by $\theta$.

The question becomes: which KL divergence should we minimize?

Forward KL: $D_{KL}(P | Q_\theta)$ — expectation under $P$
Reverse KL: $D_{KL}(Q_\theta | P)$ — expectation under $Q_\theta$

Forward KL: Mean-Seeking Behavior

The forward KL divergence is:

\[D_{KL}(P \| Q) = \mathbb{E}_{x \sim P}\left[\log P(x) - \log Q(x)\right]\]

Since $\log P(x)$ doesn’t depend on $Q$, minimizing forward KL is equivalent to maximizing:

\[\mathbb{E}_{x \sim P}[\log Q(x)]\]

This is exactly maximum likelihood estimation! We’re maximizing the expected log-likelihood of $Q$ under samples from $P$.

Key Insight: Zero-Avoiding

Forward KL heavily penalizes cases where $P(x) > 0$ but $Q(x) \approx 0$. Why? Because:

\[P(x) \log \frac{P(x)}{Q(x)} \to +\infty \text{ as } Q(x) \to 0\]

This means $Q$ must cover all regions where $P$ has mass. The result is mean-seeking or mode-covering behavior: $Q$ spreads out to cover all modes of $P$, even if it means assigning probability to regions between modes where $P$ has little mass.

Reverse KL: Mode-Seeking Behavior

The reverse KL divergence is:

\[D_{KL}(Q \| P) = \mathbb{E}_{x \sim Q}\left[\log Q(x) - \log P(x)\right]\]

This is used in variational inference where we optimize a variational distribution $Q$ to approximate an intractable posterior $P$.

Key Insight: Zero-Forcing

Reverse KL heavily penalizes cases where $Q(x) > 0$ but $P(x) \approx 0$:

\[Q(x) \log \frac{Q(x)}{P(x)} \to +\infty \text{ as } P(x) \to 0\]

This forces $Q$ to be zero wherever $P$ is zero. The result is mode-seeking behavior: $Q$ concentrates on a single mode of $P$ rather than spreading across all modes.

Interactive Visualization

The following interactive visualization demonstrates these concepts. We have a bimodal target distribution $P$ (shown in blue), and we’re fitting a unimodal Gaussian $Q$ (shown in orange/green) by minimizing either forward or reverse KL.

Observing the Behavior

Try the interactive visualization above. Here’s what you should observe:

Forward KL Optimization

Click “Forward KL (P‖Q)” then “Optimize”
Watch how $Q$ spreads out to cover both modes of $P$
The optimal $Q$ has high variance, centered between the modes
This is mean-seeking: $Q$ covers everywhere $P$ has mass

Reverse KL Optimization

Click “Reverse KL (Q‖P)” then “Optimize”
Watch how $Q$ collapses onto one mode of $P$
Which mode depends on initialization (try different starting positions)
This is mode-seeking: $Q$ concentrates where $P$ is highest

Mathematical Analysis

Let’s derive why this happens. Consider our bimodal $P$ and unimodal Gaussian $Q$.

Forward KL Analysis

\[D_{KL}(P \| Q) = \int P(x) \log P(x) dx - \int P(x) \log Q(x) dx\]

The second term dominates optimization. For a Gaussian $Q$:

\[-\int P(x) \log Q(x) dx = \frac{1}{2}\log(2\pi\sigma_Q^2) + \frac{1}{2\sigma_Q^2}\mathbb{E}_P[(x-\mu_Q)^2]\]

This is minimized when $\mu_Q = \mathbb{E}_P[x]$ (the mean of $P$) and $\sigma_Q^2 = \text{Var}_P(x)$ (the variance of $P$).

For our bimodal $P$ with modes at $\pm 2.5$:

$\mathbb{E}_P[x] = 0$ (between the modes)
$\text{Var}_P(x)$ is large (spread across both modes)

Hence forward KL produces a wide Gaussian centered at 0.

Reverse KL Analysis

\[D_{KL}(Q \| P) = \int Q(x) \log Q(x) dx - \int Q(x) \log P(x) dx\]

The critical insight: if $Q$ places mass where $P \approx 0$ (between the modes), the second term explodes:

\[-\int Q(x) \log P(x) dx \to +\infty\]

To avoid this penalty, $Q$ must concentrate entirely within one mode of $P$. The optimization landscape has two local minima—one at each mode—and gradient descent falls into whichever is closer.

Practical Implications

When to Use Forward KL

Maximum Likelihood Estimation: Training generative models on data
Density Estimation: When you need to cover all possibilities
Conservative Approximations: When missing any mode is costly

Example: Training a language model. You want $Q$ (your model) to assign non-zero probability to all valid sentences in $P$ (true language distribution).

When to Use Reverse KL

Variational Inference: Approximating intractable posteriors
Mode-Finding: When you only need one good solution
Computational Tractability: When sampling from $Q$ must be easy

Example: Variational autoencoders (VAEs). The variational posterior $Q$ approximates the true posterior $P$ using reverse KL, naturally focusing on the most probable latent codes.

The Trade-off Visualized

Connection to Other Concepts

Information Theory Perspective

Forward KL is the expected surprisal when using $Q$ to encode samples from $P$:

\[D_{KL}(P \| Q) = H(P, Q) - H(P)\]

where $H(P, Q)$ is the cross-entropy and $H(P)$ is the entropy of $P$.

Variational Inference

The Evidence Lower Bound (ELBO) in variational inference:

\[\log p(x) \geq \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))\]

Maximizing ELBO is equivalent to minimizing reverse KL between the variational posterior $q(z

x)$ and true posterior $p(z

x)$.

Expectation Maximization

The E-step of EM implicitly uses forward KL to find the posterior, while the M-step uses maximum likelihood (also forward KL) for parameter updates.

Summary

Property	Forward KL $D(P\|Q)$	Reverse KL $D(Q\|P)$
Expectation under	$P$ (target)	$Q$ (approximation)
Behavior	Mean-seeking / Mode-covering	Mode-seeking / Zero-forcing
Penalty	$Q(x) \to 0$ where $P(x) > 0$	$Q(x) > 0$ where $P(x) \to 0$
Result	Over-dispersed $Q$	Under-dispersed $Q$
Use case	MLE, density estimation	Variational inference

The choice between forward and reverse KL is not merely technical—it reflects a fundamental decision about what kind of errors we’re willing to tolerate. Forward KL says “never miss anything important,” while reverse KL says “never be confidently wrong.”

Understanding this distinction is essential for anyone working with probabilistic models, variational inference, or generative AI.

This post uses interactive visualizations built with React and Recharts. The mathematical notation is rendered with MathJax.