5

Route B "unrolls" the policy: the single action node at in the original MDP (top) splits into the F+1 denoise nodes x⁰→…→xF (bottom), so each environment step now contains F decision steps. Crucially, the environment chain is unchanged — same states, same reward, same γ on the real transition. The extra structure lives only inside one action. That is exactly why it buys stepwise gradients but threatens to inflate the value-learning horizon.