The extended MDP: one env step becomes F decision steps

denoise steps F 5

Route B "unrolls" the policy: the single action node a_t in the original MDP (top) splits into the F+1 denoise nodes x⁰→…→x^F (bottom), so each environment step now contains F decision steps. Crucially, the environment chain is unchanged — same states, same reward, same γ on the real transition. The extra structure lives only inside one action. That is exactly why it buys stepwise gradients but threatens to inflate the value-learning horizon.