Curse of horizon: naive ×F vs RQL

denoise steps F 10

discount γ 0.970

naive horizon H·F–

RQL horizon H–

Off-policy value error compounds along the bootstrap chain, so it scales with the effective horizon H ≈ 1/(1−γ). If you naively bootstrap through every denoise step, the chain length is multiplied by F, so the error envelope grows ×F (red). RQL keeps the curve flat in F (teal): the denoise steps are deterministic and un-discounted, so they add length without adding variance — the effective horizon stays H.