RQL stepwise actor update

flow steps F 8

value V(xᶠ)–

policy updates0

The teal curve is one flow trajectory: noise x⁰ → action xᶠ over the value landscape (bright = high value). train runs the RQL actor update: at every step we take one cheap pathwise gradient of the next step's value, V(s, ·, f+1), and nudge that step's velocity uphill (the little amber arrows). No BPTT through the chain — just one hop per step. Iterate and the whole trajectory bends so the final action lands on a high-value mode; the value readout climbs. (In the real method a BC term also keeps it near the data.)