Q-guided flow denoising · 1-D demo

Data has three peaks (a=−2,0,1), with the true optimum a*=1. The critic Q has a "spurious bump" near a≈3 (an OOD misjudgment). Switch the guidance method, adjust the weight, and watch where the particles land.

Critic Q (the value the model believes) True return −(a−1)²

Guidance weight 1/β 2.5

Landed samples

Mean true return (higher is better)

—

Mean critic Q (what the model believes)

—

Hint: with no guidance, particles land randomly across the three peaks. Switch to QGF and raise the weight, and they concentrate at a*=1 (highest true return). Switch to OOD and raise the weight, and the particles are fooled by the "spurious bump" toward a≈3 — the critic Q is very high there, but the true return is very low. This is OOD exploiting the critic.