Token RL — the KL anchor pulls the optimum back toward base

KL coefficient β 0.40

shift |x*−x₀|–

–

Steel-blue = the reward landscape R (a near peak at the base, a higher peak far away). Lower β and the optimum π_θ* climbs toward the far peak — but past a threshold it enters the red off-distribution band, where the critic extrapolates badly and training collapses. Raise β and it gets pinned near the base. Stable vs. surpassing the demos are two ends of the same tension — and the support bias isn't flow-specific, it comes from the KL anchor everyone adds for stability.