0.40
shift |x*−x₀|

Steel-blue = the reward landscape R (a near peak at the base, a higher peak far away). Lower β and the optimum πθ* climbs toward the far peak — but past a threshold it enters the red off-distribution band, where the critic extrapolates badly and training collapses. Raise β and it gets pinned near the base. Stable vs. surpassing the demos are two ends of the same tension — and the support bias isn't flow-specific, it comes from the KL anchor everyone adds for stability.