Understanding Forward and Reverse KL Divergence
The Kullback-Leibler (KL) divergence is one of the most important concepts in machine learning, yet its asymmetry often leads to confusion. In this post, I’ll provide an intuitive and rigorous exploration of forward KL and reverse KL divergence, demonstrating why the choice between them fundamentally shapes how our models learn.
What is KL Divergence?
KL divergence measures how one probability distribution $P$ differs from another distribution $Q$. It’s defined as:
\[D_{KL}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} = \mathbb{E}_{x \sim P}\left[\log \frac{P(x)}{Q(x)}\right]\]For continuous distributions:
\[D_{KL}(P \| Q) = \int P(x) \log \frac{P(x)}{Q(x)} dx\]A crucial property: KL divergence is not symmetric. That is:
\[D_{KL}(P \| Q) \neq D_{KL}(Q \| P)\]This asymmetry has profound implications for machine learning.
The Setup: Approximating a Target Distribution
Consider a common scenario: we have a true distribution $P$ (often complex and intractable) and we want to approximate it with a simpler distribution $Q_\theta$ parameterized by $\theta$.
The question becomes: which KL divergence should we minimize?
- Forward KL: $D_{KL}(P | Q_\theta)$ — expectation under $P$
- Reverse KL: $D_{KL}(Q_\theta | P)$ — expectation under $Q_\theta$
Forward KL: Mean-Seeking Behavior
The forward KL divergence is:
\[D_{KL}(P \| Q) = \mathbb{E}_{x \sim P}\left[\log P(x) - \log Q(x)\right]\]Since $\log P(x)$ doesn’t depend on $Q$, minimizing forward KL is equivalent to maximizing:
\[\mathbb{E}_{x \sim P}[\log Q(x)]\]This is exactly maximum likelihood estimation! We’re maximizing the expected log-likelihood of $Q$ under samples from $P$.
Key Insight: Zero-Avoiding
Forward KL heavily penalizes cases where $P(x) > 0$ but $Q(x) \approx 0$. Why? Because:
\[P(x) \log \frac{P(x)}{Q(x)} \to +\infty \text{ as } Q(x) \to 0\]This means $Q$ must cover all regions where $P$ has mass. The result is mean-seeking or mode-covering behavior: $Q$ spreads out to cover all modes of $P$, even if it means assigning probability to regions between modes where $P$ has little mass.
Reverse KL: Mode-Seeking Behavior
The reverse KL divergence is:
\[D_{KL}(Q \| P) = \mathbb{E}_{x \sim Q}\left[\log Q(x) - \log P(x)\right]\]This is used in variational inference where we optimize a variational distribution $Q$ to approximate an intractable posterior $P$.
Key Insight: Zero-Forcing
Reverse KL heavily penalizes cases where $Q(x) > 0$ but $P(x) \approx 0$:
\[Q(x) \log \frac{Q(x)}{P(x)} \to +\infty \text{ as } P(x) \to 0\]This forces $Q$ to be zero wherever $P$ is zero. The result is mode-seeking behavior: $Q$ concentrates on a single mode of $P$ rather than spreading across all modes.
Interactive Visualization
The following interactive visualization demonstrates these concepts. We have a bimodal target distribution $P$ (shown in blue), and we’re fitting a unimodal Gaussian $Q$ (shown in orange/green) by minimizing either forward or reverse KL.
Observing the Behavior
Try the interactive visualization above. Here’s what you should observe:
Forward KL Optimization
- Click “Forward KL (P‖Q)” then “Optimize”
- Watch how $Q$ spreads out to cover both modes of $P$
- The optimal $Q$ has high variance, centered between the modes
- This is mean-seeking: $Q$ covers everywhere $P$ has mass
Reverse KL Optimization
- Click “Reverse KL (Q‖P)” then “Optimize”
- Watch how $Q$ collapses onto one mode of $P$
- Which mode depends on initialization (try different starting positions)
- This is mode-seeking: $Q$ concentrates where $P$ is highest
Mathematical Analysis
Let’s derive why this happens. Consider our bimodal $P$ and unimodal Gaussian $Q$.
Forward KL Analysis
\[D_{KL}(P \| Q) = \int P(x) \log P(x) dx - \int P(x) \log Q(x) dx\]The second term dominates optimization. For a Gaussian $Q$:
\[-\int P(x) \log Q(x) dx = \frac{1}{2}\log(2\pi\sigma_Q^2) + \frac{1}{2\sigma_Q^2}\mathbb{E}_P[(x-\mu_Q)^2]\]This is minimized when $\mu_Q = \mathbb{E}_P[x]$ (the mean of $P$) and $\sigma_Q^2 = \text{Var}_P(x)$ (the variance of $P$).
For our bimodal $P$ with modes at $\pm 2.5$:
- $\mathbb{E}_P[x] = 0$ (between the modes)
- $\text{Var}_P(x)$ is large (spread across both modes)
Hence forward KL produces a wide Gaussian centered at 0.
Reverse KL Analysis
\[D_{KL}(Q \| P) = \int Q(x) \log Q(x) dx - \int Q(x) \log P(x) dx\]The critical insight: if $Q$ places mass where $P \approx 0$ (between the modes), the second term explodes:
\[-\int Q(x) \log P(x) dx \to +\infty\]To avoid this penalty, $Q$ must concentrate entirely within one mode of $P$. The optimization landscape has two local minima—one at each mode—and gradient descent falls into whichever is closer.
Practical Implications
When to Use Forward KL
- Maximum Likelihood Estimation: Training generative models on data
- Density Estimation: When you need to cover all possibilities
- Conservative Approximations: When missing any mode is costly
Example: Training a language model. You want $Q$ (your model) to assign non-zero probability to all valid sentences in $P$ (true language distribution).
When to Use Reverse KL
- Variational Inference: Approximating intractable posteriors
- Mode-Finding: When you only need one good solution
- Computational Tractability: When sampling from $Q$ must be easy
Example: Variational autoencoders (VAEs). The variational posterior $Q$ approximates the true posterior $P$ using reverse KL, naturally focusing on the most probable latent codes.
The Trade-off Visualized
Connection to Other Concepts
Information Theory Perspective
Forward KL is the expected surprisal when using $Q$ to encode samples from $P$:
\[D_{KL}(P \| Q) = H(P, Q) - H(P)\]where $H(P, Q)$ is the cross-entropy and $H(P)$ is the entropy of $P$.
Variational Inference
The Evidence Lower Bound (ELBO) in variational inference:
\[\log p(x) \geq \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))\]| Maximizing ELBO is equivalent to minimizing reverse KL between the variational posterior $q(z | x)$ and true posterior $p(z | x)$. |
Expectation Maximization
The E-step of EM implicitly uses forward KL to find the posterior, while the M-step uses maximum likelihood (also forward KL) for parameter updates.
Summary
| Property | Forward KL $D(P|Q)$ | Reverse KL $D(Q|P)$ |
|---|---|---|
| Expectation under | $P$ (target) | $Q$ (approximation) |
| Behavior | Mean-seeking / Mode-covering | Mode-seeking / Zero-forcing |
| Penalty | $Q(x) \to 0$ where $P(x) > 0$ | $Q(x) > 0$ where $P(x) \to 0$ |
| Result | Over-dispersed $Q$ | Under-dispersed $Q$ |
| Use case | MLE, density estimation | Variational inference |
The choice between forward and reverse KL is not merely technical—it reflects a fundamental decision about what kind of errors we’re willing to tolerate. Forward KL says “never miss anything important,” while reverse KL says “never be confidently wrong.”
Understanding this distinction is essential for anyone working with probabilistic models, variational inference, or generative AI.
This post uses interactive visualizations built with React and Recharts. The mathematical notation is rendered with MathJax.
