Reparameterization vs score-function gradients

μ -0.70

gradient estimate–

–

The Q landscape is the hill. The policy draws an action a = tanh(μ + σε). reparam freezes the die ε, so a is a smooth function of μ: read the slope ∇_aQ at that one point and push μ to slide the ball uphill — a clean, low-variance gradient. score-fn keeps re-rolling ε every step and only looks at where points land; the gradient estimate jitters with every roll (watch the arrow wobble) — high variance, and it never uses Q's slope. Drag μ and hit re-roll ε.