The same denoising trajectory, with each of the four panels staging its own "way of asking Q." Loops automatically.
① Behavior flow policyPure flow denoising: start from z~N and follow the velocity field v_t step by step to the clean action a₁. No Q involved — this is the "chassis" being guided.
② BPTTFirst roll out (dashed) to a₁, take ∇Q at a₁ (trustworthy), then multiply a chain of Jacobians da/da to backprop hop by hop to a_t. As the chain grows long, the returned arrow jitters violently = high variance, and it is expensive.
③ Noisy-point gradient (OOD)Query Q directly at the half-noisy a_t. But Q has never seen such a half-finished input, so the gradient direction is untrustworthy — marked with a question mark, pointing every which way.
④ QGF (ours)Take a one-step dashed jump along v_t to the clean estimate â₁, take ∇Q at â₁ (trustworthy), and set J=I to carry it back to a_t as-is. Cheap, stable, and pointing the right way.