← all writing

Q-Guided Flow, From the Ground Up: Guiding a Flow Policy with a Value Function at Test Time

A from-scratch, step-by-step tutorial whose goal is to fully unpack the Q-Guided Flow (QGF) paper. We first build the background you need (flow models, the critic), then state the core problem, then take the equations apart one by one — with interactive geometric visualizations woven throughout so you don’t just follow the idea, you see it. Every demo is draggable and playable, so play as you read.

这是一篇从零开始、循序渐进的教程,目标是把 Q-Guided Flow (QGF) 这篇论文讲透。我们先搭好必要的背景(流模型、评论家),再讲它要解决的核心问题,然后逐个公式拆解,中间穿插可交互的几何可视化,让你不只是「读懂」,而是真正「看见」原理。每个演示都能直接拖动、播放——建议边读边玩。

Q-guided flow in a 2-D action space: noise particles drift along the reference velocity field while the gradient of $Q$ gently nudges them toward high-value regions. This is exactly the mechanism the whole post unpacks — play with it first to build intuition, then read on.

二维动作空间里的 Q 引导流:噪声粒子沿参考速度场流动,同时被 $Q$ 的梯度往高价值区域轻轻偏转。这正是全文要拆解的机制——先玩一玩,建立直觉,再往下读。

1. The big picture: what is this paper actually doing?

The setting is reinforcement learning (RL) for continuous control (e.g. robotics): we want a policy $\pi(a \mid s)$ that, given the current state $s$ (what the robot perceives), outputs a good action $a$ (how to drive the motors).

The most expressive continuous-control policies today are diffusion / flow models: they can fit very complex, multi-modal action distributions and are stable to train by imitation. But shoehorning them into RL to “chase high reward” is painful — you either need a bespoke training objective, or you have to backpropagate through the entire denoising process, which is unstable and hard to scale.

QGF’s core claim is: decouple the two things completely, train each separately with the most stable standard method, then stitch them together at test time.

  1. A reference flow policy $v_\theta$, trained by behavior cloning (BC) — it only learns “in this state, which actions are reasonable / look like the data,” with no guarantee of being “best.”
  2. A critic $Q_\phi(s,a)$ (a value function), trained by standard TD learning — it scores actions, telling you “how high is this action’s expected future return.”

Then at test time (inference), the gradient of $Q$ is used to “guide” the flow policy’s sampling so that it generates high-value actions — with no further policy training at all. This is what we mean by “test-time policy improvement.”

Below is the paper’s “one-figure summary,” which I’ve turned into an auto-looping animated version: one denoising trajectory, four panels each performing its own “way of asking $Q$.” Skim it now for a first impression; we’ll return at the end and explain every detail. Try dragging the center of the blue contours in any panel (that’s the peak of $Q$) — every gradient arrow recomputes live.

一、先看大局:这篇论文到底在干什么

设定是用强化学习(RL)做连续控制(比如机器人):我们想要一个策略 $\pi(a \mid s)$,给定当前状态 $s$(机器人感知到的东西),输出一个好的动作 $a$(怎么驱动电机)。

近年表达力最强的连续控制策略是扩散模型 / 流模型(diffusion / flow):它们能拟合非常复杂的多峰动作分布,做模仿学习时也很稳定。但把它们塞进 RL 去「追求高回报」时却很麻烦——要么需要特制的训练目标,要么需要对整条去噪过程做反向传播,既不稳定也难扩展。

QGF 的核心主张是:把两件事彻底拆开,各自用最稳的标准方法单独训练,再在测试时把它们拼起来。

  1. 一个参考流策略 $v_\theta$,用行为克隆(BC)训练——它只学会「在这个状态下,哪些动作是合理的 / 像数据里出现过的」,但不保证「最好」。
  2. 一个评论家 $Q_\phi(s,a)$(价值函数),用标准 TD 学习训练——它给动作打分,告诉你「这个动作的预期未来回报有多高」。

然后在测试时(推理时),用 $Q$ 的梯度去「引导」流策略的采样过程,让它生成高价值的动作——完全不需要再训练策略。这就是所谓「测试时的策略改进(test-time policy improvement)」。

下面这张图是整篇论文的「一图流」,我把它做成了自动循环播放的动态版:同一条去噪轨迹,四个面板各自上演自己「问 $Q$ 的方式」。现在先扫一眼有个印象,文章最后我们会回到它、把每个细节都讲明白。试试拖动任意面板里蓝色等高线的中心(那是 $Q$ 的峰),所有梯度箭头会实时重算。

Four ways of taking the gradient, side by side. ① pure flow denoising (no $Q$); ② BPTT backprop — the arrow jitters violently once it reaches $a_t$ = high variance; ③ OOD queries $Q$ directly at the noisy point, pointing every which way; ④ QGF jumps one step to $\hat a_1$, takes the gradient there, and carries it back unchanged. Don’t worry if it’s opaque now — this is exactly what the post explains.

四种取梯度方式的同台对比。① 纯流去噪(无 Q);② BPTT 反传,箭头到 $a_t$ 后剧烈抖动 = 高方差;③ OOD 直接在噪声点问 Q,方向乱指;④ QGF 一步跳到 $\hat a_1$ 取梯度、原样搬回。看不懂没关系,这正是全文要讲的。

2. Background: what is a “flow model”?

A flow model (flow / diffusion) is a generative model: it “sculpts” simple noise into a complex distribution. Here, that target distribution is “the good actions that appeared in the data, given state $s$.”

It works by denoising: start from pure Gaussian noise $a_0 \sim \mathcal{N}(0, I)$ and walk along time $t$ from $0$ to $1$, gradually turning it into a clean action sample $a_1$. Which direction to push at each step is decided by a learned velocity field $v_\theta(s, a_t, t)$. The whole process is solving an ordinary differential equation (ODE):

\[\frac{da_t}{dt} = v_\theta(s, a_t, t)\]

In practice we approximate it with small Euler steps:

\[a_{t+\Delta t} = a_t + v_\theta(s, a_t, t)\cdot \Delta t\]

Think of the velocity field as a body of “flowing water”: every location has an arrow saying “push this way,” and noise particles drift along those arrows, eventually pooling at the data distribution’s “peaks” (each peak = one kind of reasonable action). That is the entire mechanism of flow-policy sampling — keep the picture of “stepping along the velocity field” in mind, because everything below is built on it.

3. The core idea: how to fold $Q$ into the policy

We want an improved policy $\pi$: it generates higher-value actions, but it mustn’t stray too far — otherwise it would find weird actions that $Q$ mistakenly scores as “high.” This “high value, but don’t stray” requirement is written formally as a KL-regularized reward-maximization problem:

\[\max_{\pi}\ \mathbb{E}_{a\sim \pi(\cdot\mid s)}\big[Q(s,a)\big]\ -\ \beta\, D_{\mathrm{KL}}\big(\pi(\cdot\mid s)\,\|\,\hat\pi(\cdot\mid s)\big)\]

The first term wants high value; the second ($\beta$ times the KL divergence) penalizes “straying from the reference policy $\hat\pi$.” This problem has a beautiful closed-form solution:

\[\pi(a\mid s)\ \propto\ \hat\pi(a\mid s)\cdot \exp\!\left(\tfrac{1}{\beta}\, Q(s,a)\right)\]

Intuitively: take the reference distribution and multiply it by a re-weighting factor $\exp(Q/\beta)$ — high-$Q$ actions are amplified, low-$Q$ ones are suppressed. Here $\beta$ is a “temperature”: smaller $\beta$ means more aggressive re-weighting (almost only the highest-$Q$ action survives); larger $\beta$ stays closer to the original reference. Geometrically, it just “pinches” the reference density according to the height of $Q$.

The key step: what diffusion / flow models actually learn is the score function $\nabla_a \log p(a)$. Taking $\nabla_a \log$ of both sides above (the normalizer $Z$ doesn’t depend on $a$, so it vanishes under the derivative):

\[\nabla_a \log \pi(a\mid s)\ =\ \underbrace{\nabla_a \log \hat\pi(a\mid s)}_{\text{reference policy, }v_\theta\text{ already knows it}}\ +\ \underbrace{\tfrac{1}{\beta}\,\nabla_a Q(s,a)}_{\text{guidance term}}\]

This is exactly classifier guidance, except the “classifier” is replaced by the learned $Q$ function. The conclusion is clean: to sample from the improved policy $\pi$, run the reference flow’s denoising as usual, but add the guidance term $\tfrac1\beta\nabla Q$ at every step.

Generalizing it to the noisy action $a_t$ during denoising (those half-finished intermediate states):

\[\nabla_{a_t}\log \pi(a_t\mid s)\ \approx\ \nabla_{a_t}\log\hat\pi(a_t\mid s)\ +\ \tfrac{1}{\beta}\,\nabla_{a_t} Q(s,a_t)\]

The reference-score term is already provided by $v_\theta$. So the only question becomes: where, and how, should the guidance term $\nabla_{a_t}Q$ be computed? This is the crux of the whole paper.

4. The real difficulty: where should $\nabla_a Q$ be computed?

Here’s the trap: $Q$ was only ever trained on the “clean, complete actions” in the data. But the intermediate $a_t$ during denoising (especially early, near pure noise) is very far from that training data — it is out-of-distribution (OOD). On inputs it has never seen, the neural network $Q$ gives unreliable values and gradients. The paper compares three approaches:

  • Approach 1 (most direct, but it fails): the OOD gradient $\nabla_{a_t}Q(s,a_t)$. Query $Q$’s gradient directly at the noisy point $a_t$. The problem is the one above: at an OOD noisy point, $Q$’s gradient can point in an arbitrarily wrong direction, and may even “exploit” the critic — finding a region $Q$ mistakenly scores high but is actually bad.

  • Approach 2 (more principled, but too expensive / unstable): the BPTT gradient $\nabla_{a_t}Q\big(s,\mathrm{ODE}(a_t)\big)$. Since the flow deterministically maps $a_t$ to a clean action $a_1=\mathrm{ODE}(a_t)$, define $Q$ to always be queried on the clean action (in-distribution, trustworthy). The cost: this gradient needs backpropagation through the entire denoising chain (BPTT = backpropagation through time) — expensive, and extremely sensitive to noise (a tiny perturbation of $a_t$ makes the gradient direction lurch wildly, like a “butterfly effect,” with huge variance).

  • Approach 3 (QGF): extrapolate one step to estimate a clean action, and query $Q$ there. Neither run the whole chain (avoid BPTT’s cost and jitter) nor query at the raw noisy point (avoid OOD’s unreliability). Details next section.

Before going further, let’s see “where each of the three approaches queries $Q$.” Drag time $t$ and the noisy point $a_t$ and you’ll see three markers: orange $a_t$ (OOD’s query point), cyan $\hat a_1$ (QGF’s query point, next section), and purple $\mathrm{ODE}(a_t)$ (BPTT’s query point). Notice: early in denoising (small $t$), the orange point often lands in the “untrustworthy OOD region,” while the cyan / purple points already sit on the data peak; only as $t\to 1$ do all three coincide.

二、背景:什么是「流模型」

流模型(flow / 扩散)是一种生成模型:它把简单的噪声「雕刻」成复杂的分布。在这里,这个目标分布就是「在状态 $s$ 下,数据里出现过的那些好动作」。

它的工作方式叫去噪(denoising):从一团纯高斯噪声出发 $a_0 \sim \mathcal{N}(0, I)$,沿着时间 $t$ 从 $0$ 走到 $1$,一步步把它变成一个干净的动作样本 $a_1$。每一步往哪个方向推,由一个学出来的速度场 $v_\theta(s, a_t, t)$ 决定。整个过程其实是在解一个常微分方程(ODE):

\[\frac{da_t}{dt} = v_\theta(s, a_t, t)\]

实际计算时用一小步一小步的欧拉积分(Euler step)来近似:

\[a_{t+\Delta t} = a_t + v_\theta(s, a_t, t)\cdot \Delta t\]

你可以把速度场想象成一片「水流」:每个位置都有一个箭头指出「该往哪推」,噪声粒子就顺着这些箭头流动,最终汇聚到数据分布的几个「峰」上(每个峰代表一类合理动作)。这就是流策略采样的全部机制——记住「沿速度场一步步走」这个画面,后面所有内容都建立在它之上。

三、理论核心:怎么把 $Q$「融进」策略里

我们想要一个改进后的策略 $\pi$:它生成的动作价值更高,但又不能跑太远——否则会找到一些被 $Q$ 误判为「高分」的怪动作。这个「既要高分、又要别跑太远」的诉求,正式写成一个 KL 正则化的奖励最大化问题:

\[\max_{\pi}\ \mathbb{E}_{a\sim \pi(\cdot\mid s)}\big[Q(s,a)\big]\ -\ \beta\, D_{\mathrm{KL}}\big(\pi(\cdot\mid s)\,\|\,\hat\pi(\cdot\mid s)\big)\]

第一项要价值高,第二项($\beta$ 倍的 KL 散度)惩罚「离参考策略 $\hat\pi$ 太远」。这个问题有一个漂亮的闭式解

\[\pi(a\mid s)\ \propto\ \hat\pi(a\mid s)\cdot \exp\!\left(\tfrac{1}{\beta}\, Q(s,a)\right)\]

直观理解:拿参考分布,乘上一个 $\exp(Q/\beta)$ 的「重加权因子」——高 $Q$ 的动作被放大,低 $Q$ 的被压低。这里的 $\beta$ 是「温度」:$\beta$ 越小,重加权越激进(几乎只取 $Q$ 最大的动作);$\beta$ 越大,越贴近原来的参考分布。几何上,它就是把参考密度曲线按 $Q$ 的高低「捏一捏」。

关键一步:扩散 / 流模型学的其实是分数函数(score) $\nabla_a \log p(a)$。对上式两边取 $\nabla_a \log$(归一化常数 $Z$ 与 $a$ 无关,求导后消失):

\[\nabla_a \log \pi(a\mid s)\ =\ \underbrace{\nabla_a \log \hat\pi(a\mid s)}_{\text{参考策略,}v_\theta\text{ 已经会了}}\ +\ \underbrace{\tfrac{1}{\beta}\,\nabla_a Q(s,a)}_{\text{引导项}}\]

这正是分类器引导(classifier guidance),只不过把「分类器」换成了学出来的 $Q$ 函数。结论很简洁:想从改进策略 $\pi$ 里采样,就照常跑参考流的去噪,但在每一步额外加上引导项 $\tfrac1\beta\nabla Q$。

把它推广到去噪过程中的带噪动作 $a_t$(中间那些半成品状态):

\[\nabla_{a_t}\log \pi(a_t\mid s)\ \approx\ \nabla_{a_t}\log\hat\pi(a_t\mid s)\ +\ \tfrac{1}{\beta}\,\nabla_{a_t} Q(s,a_t)\]

参考分数那一项 $v_\theta$ 已经提供了。于是唯一的问题变成:引导项 $\nabla_{a_t}Q$ 到底该在哪里、怎么算? 这就是整篇论文的胜负手。

四、真正的难题:$\nabla_a Q$ 该在哪里计算

陷阱在这里:$Q$ 只在数据里那些「干净、完整的动作」上训练过。 而去噪中间的 $a_t$(尤其是早期、接近纯噪声时)离这些训练数据非常远,是分布外(OOD, out-of-distribution)的。在没见过的输入上,神经网络 $Q$ 的取值和梯度都不可靠。论文比较了三种做法:

  • 做法一(最直接但会失败):OOD 梯度 $\nabla_{a_t}Q(s,a_t)$。 直接在带噪点 $a_t$ 上查询 $Q$ 的梯度。问题就是上面说的:在 OOD 的噪声点上,$Q$ 的梯度可能指向任意错误的方向,甚至会「钻空子」——找到一个 $Q$ 误判为高分、实际很差的区域。

  • 做法二(更讲道理但太贵 / 不稳):BPTT 梯度 $\nabla_{a_t}Q\big(s,\mathrm{ODE}(a_t)\big)$。 既然流会把 $a_t$ 确定性地映射成一个干净动作 $a_1=\mathrm{ODE}(a_t)$,那就定义 $Q$ 永远在干净动作上被查询(分布内、可信)。代价是:这个梯度需要对整条去噪链做反向传播(BPTT = backpropagation through time)——又贵,又对噪声极其敏感($a_t$ 一点点扰动会让梯度方向剧烈乱跳,像「蝴蝶效应」,方差巨大)。

  • 做法三(QGF):一步外推估计一个干净动作,在那上面查询 $Q$。 既不跑完整条链(避开 BPTT 太贵太抖),也不在原始噪声点查询(避开 OOD 不可信)。下一节细讲。

在往下看之前,先用一个交互页面把「三种做法分别在哪里查询 $Q$」这件事看清楚。拖动时间 $t$ 和噪声点 $a_t$,你会看到三个标记:橙色 $a_t$(OOD 的查询点)、青色 $\hat a_1$(QGF 的查询点,下一节讲)、紫色 $\mathrm{ODE}(a_t)$(BPTT 的查询点)。注意:去噪早期($t$ 小),橙色点常落在「不可信的 OOD 区」,而青 / 紫点已经落在数据峰上;到 $t\to 1$ 三者才重合。

The three guidance gradients’ query points, compared. QGF’s trick: use a one-step extrapolation to get an estimate of the clean action, and query $Q$ there — cheap, and inside the trusted region.

三种引导梯度的「查询点」对比。QGF 的窍门:用一步外推得到一个「干净动作的估计」,在那上面查询 $Q$——便宜又落在可信区。

5. QGF’s solution: one-step estimate + drop the Jacobian

Step 1: approximate the clean action with a single Euler step. Instead of integrating the whole ODE, stand at $a_t$ and “jump” all the way to $t=1$ in one shot along the reference velocity field $v_\theta$:

\[\hat a_1\ =\ a_t\ +\ v_\theta(s,a_t,t)\cdot(1-t)\]

(Geometrically, this is the cyan arrow in the panel above: from $a_t$, follow the velocity direction for the remaining time $1-t$ and jump to the endpoint estimate.) There’s a lovely property here: for a standard linear-interpolation flow, this one-step estimate exactly equals the model’s posterior mean $\mathbb{E}[a_1\mid a_t]$ — i.e. “the flow model’s best guess of the clean action.” Scoring it with $Q$ is both cheap and lands near the data distribution. Reasonable.

Step 2: write the gradient with the chain rule, then replace the Jacobian with the identity. Expand the gradient of $Q(s,a_1)$ through $\hat a_1$:

\[\nabla_{a_t} Q(s,a_1)\ \approx\ \nabla_{a_t} Q(s,\hat a_1)\ =\ \underbrace{\left(\tfrac{\partial \hat a_1}{\partial a_t}\right)^{\!\top}}_{J}\ \nabla_{\hat a_1} Q(s,\hat a_1)\]

Here $J=\partial\hat a_1/\partial a_t$ is “the Jacobian of the clean estimate with respect to the noisy point.” But it requires differentiating through the neural network $v_\theta$, which in practice is often ill-conditioned and amplifies noise. QGF simply replaces $J$ with the identity matrix $I$:

\[\boxed{\ \nabla_{a_t} Q(s,a_1)\ \approx\ \hat J^{\top}\,\nabla_{\hat a_1} Q(s,\hat a_1)\,,\quad \hat J=I,\ \ \hat a_1=a_t+v_\theta(s,a_t,t)\,(1-t)\ }\]

In other words: the guidance term is just the gradient of $Q$ at the “one-step clean estimate $\hat a_1$,” used directly. Cheap (one extra $Q$-gradient), trustworthy ($\hat a_1$ is a sensible action), and stable (no long chain, no Jacobian).

Adding this guidance term to every denoising step gives the full inference algorithm:

\[\begin{aligned} &a_0\sim\mathcal N(0,I)\\ &\textbf{for } t=0,\tfrac1T,\dots,1-\tfrac1T:\\ &\qquad \hat a_1 \leftarrow a_t+(1-t)\,v_\theta(s,a_t,t) &&\text{// one Euler step: estimate the clean action}\\ &\qquad g \leftarrow \nabla_{\hat a_1} Q_\phi(s,\hat a_1) &&\text{// gradient of }Q\text{ at the clean estimate}\\ &\qquad a_{t+1} \leftarrow a_t+\tfrac1T\big(v_\theta(s,a_t,t)+\tfrac1\beta\,g\big) &&\text{// velocity + guidance, one step together}\\ &\textbf{return } a_T \end{aligned}\]

At each step, the flow velocity $v_\theta$ is responsible for “staying on the data manifold,” and the guidance term $\tfrac1\beta g$ for “bending a little toward high value”; the two are added, and $\beta$ controls how much. The whole pipeline looks like this:

五、QGF 的解法:一步估计 + 扔掉雅可比

第一步:用单步欧拉近似干净动作。 不再积分整条 ODE,而是站在 $a_t$,沿参考速度场 $v_\theta$ 一次性「跳」到 $t=1$:

\[\hat a_1\ =\ a_t\ +\ v_\theta(s,a_t,t)\cdot(1-t)\]

(几何上就是上面面板里那根青色箭头:从 $a_t$ 顺着速度方向、按剩余时间 $1-t$ 一步跳到终点估计。)这里有个很好的性质:对标准的线性插值流,这个一步估计恰好等于模型的后验均值 $\mathbb{E}[a_1\mid a_t]$——也就是「流模型对干净动作的最佳猜测」。拿它去给 $Q$ 打分,既便宜又落在数据分布附近,合情合理。

第二步:用链式法则写出梯度,再把雅可比换成单位阵。 把 $Q(s,a_1)$ 的梯度通过 $\hat a_1$ 展开:

\[\nabla_{a_t} Q(s,a_1)\ \approx\ \nabla_{a_t} Q(s,\hat a_1)\ =\ \underbrace{\left(\tfrac{\partial \hat a_1}{\partial a_t}\right)^{\!\top}}_{J}\ \nabla_{\hat a_1} Q(s,\hat a_1)\]

这里 $J=\partial\hat a_1/\partial a_t$ 是「干净估计对噪声点的雅可比矩阵」。但它需要对神经网络 $v_\theta$ 求导,实践中常常病态(ill-conditioned)、会放大噪声。QGF 干脆把 $J$ 直接换成单位矩阵 $I$

\[\boxed{\ \nabla_{a_t} Q(s,a_1)\ \approx\ \hat J^{\top}\,\nabla_{\hat a_1} Q(s,\hat a_1)\,,\quad \hat J=I,\ \ \hat a_1=a_t+v_\theta(s,a_t,t)\,(1-t)\ }\]

也就是说:引导项就是 $Q$ 在「一步干净估计 $\hat a_1$」处的梯度,直接拿来用。 便宜(只多算一次 $Q$ 的梯度)、可信($\hat a_1$ 是个合理动作)、稳定(没有长链、没有雅可比)。

把这个引导项加进每一步去噪,就得到完整的推理算法

\[\begin{aligned} &a_0\sim\mathcal N(0,I)\\ &\textbf{for } t=0,\tfrac1T,\dots,1-\tfrac1T:\\ &\qquad \hat a_1 \leftarrow a_t+(1-t)\,v_\theta(s,a_t,t) &&\text{// 一步欧拉,估计干净动作}\\ &\qquad g \leftarrow \nabla_{\hat a_1} Q_\phi(s,\hat a_1) &&\text{// 在干净估计处取 }Q\text{ 的梯度}\\ &\qquad a_{t+1} \leftarrow a_t+\tfrac1T\big(v_\theta(s,a_t,t)+\tfrac1\beta\,g\big) &&\text{// 速度场 + 引导,一起走一步}\\ &\textbf{return } a_T \end{aligned}\]

每一步,流的速度 $v_\theta$ 负责「贴着数据流形走」,引导项 $\tfrac1\beta g$ 负责「往高价值方向偏一点」,两者相加,$\beta$ 控制偏多少。整条流水线长这样:

QGF pipeline overview
QGF 流水线总览

This interactive page lets you drive the mechanism by hand. The setting is 1-D: the data has three peaks (at $a=-2,0,1$), and the true optimal action is $a^*=1$ (true return $=-(a-1)^2$). But the critic $Q$ has learned a flaw — far from the data, around $a\approx 3$, there is a “spuriously high” bump (a classic critic pathology in the OOD region). You can switch among the three guidances, drag the guidance weight $1/\beta$, and watch where the particles end up — tracking their true return and their critic score.

下面这个交互页面让你亲手玩这套机制。设定是一维:数据有三个峰(在 $a=-2,0,1$),真正的最优动作 $a^*=1$(真实回报 $=-(a-1)^2$)。但评论家 $Q$ 学歪了——在远离数据的 $a\approx 3$ 处有一个「虚高」的鼓包(critic 在 OOD 区的常见毛病)。你可以切换三种引导、拖动引导权重 $1/\beta$,看粒子最终落在哪、统计它们的真实回报评论家分数

Key observation: crank the weight up and OOD “fools” the samples toward that spurious bump (high critic score, low true return — this is “exploiting the critic”); QGF always delivers samples to the true optimal peak $a^*=1$, because it only queries $Q$ at a sensible clean estimate and never steps into the bump.

关键观察:把权重调大,OOD 会把样本「骗」到那个虚高鼓包(评论家分数很高、真实回报却很低——这就是「钻评论家空子」);而 QGF 始终把样本送到真正的最优峰 $a^*=1$,因为它只在合理的干净估计上查询 $Q$,从不踏进那个鼓包。

Next, let’s draw a whole denoising trajectory frame by frame. The page below plots the trajectory in the “action $a$ (horizontal) × time $t$ (vertical)” plane: it climbs from the bottom (pure noise) step by step to the top (clean action). Note the two dashed lines — the cyan diagonal dashed line’s landing point at the top is $\hat a_1$ (QGF’s query point, which is itself the “one-step approximation”); the top of the orange vertical dashed line is $a_t$ itself (OOD’s query point). Hit “step” or “play” and watch how $\hat a_1$ moves with the steps and how $Q$’s guidance arrow bends the trajectory toward the optimal peak.

接下来把「一整条去噪轨迹」逐帧画出来。下面这个页面在「动作 $a$(横)× 时间 $t$(纵)」的平面里展示轨迹:从底部(纯噪声)一步步爬到顶部(干净动作)。注意两条虚线——青色斜虚线冲到顶端的落点就是 $\hat a_1$(QGF 的查询点,它本身就是「一步近似」),橙色竖虚线顶端就是 $a_t$ 本身(OOD 的查询点)。点「逐步」或「播放」,看 $\hat a_1$ 怎么随步数移动、$Q$ 的引导箭头怎么把轨迹掰向最优峰。

Switch guidance to OOD, raise the weight, hit play — the trajectory is “fooled” all the way toward the spurious bump at $a\approx 3$; switch back to QGF and the cyan query point $\hat a_1$ stays glued to the data peak, converging cleanly to the true optimum $a^*=1$.

把引导切到 OOD、权重拉大、点播放——轨迹会被橙色查询点一路「骗」向 $a\approx 3$ 的虚高鼓包;切回 QGF,青色查询点 $\hat a_1$ 始终贴在数据峰上,轨迹稳稳收敛到真最优 $a^*=1$。

6. Why the “approximation” beats the “exact” version

QGF makes two seemingly lazy approximations: dropping the Jacobian, and replacing the whole ODE with a single step. Counterintuitively, the paper finds they are not compromises — each one beats its more “exact” counterpart:

First, one-step estimate vs full ODE denoising. Running the whole denoising forces the estimated clean action onto the entire data distribution (it must represent all the modes in the data); the one-step approximation allows small deviations, letting the flow “pick out” one high-value mode rather than being dragged back to represent the whole dataset. So this approximation actually gives the optimization more freedom. (Interestingly, the paper notes that EDP, the best-performing train-time baseline, also uses a one-step approximation.)

Second, identity vs true Jacobian. The true $J$ is easily ill-conditioned (especially early in denoising, when the one-step approximation is already coarse), amplifying noise and making the gradient variance explode; replacing it with $I$ gives a low-variance, clean direction. In practice, a low-variance gradient estimate simply makes a better “optimizer.”

The page below is dedicated to “why imprecise is actually better.” The thing to watch is the stability of the guidance gradient: plot the guidance gradient $g$ as a function of $a_t$ — horizontal axis is the noisy point $a_t$, vertical axis is the guidance gradient computed there. Three curves:

  • QGF (the approximation we use): take $Q’$ at $\hat a_1=$ the posterior mean, and treat the Jacobian as $I$ — the curve is smooth.
  • BPTT (exact, run the whole chain): $\tfrac{d}{da_t}Q(\mathrm{ODE}(a_t))$ — the curve jitters violently with spikes. The spikes occur at “watersheds”: once $a_t$ crosses some boundary, the whole denoising chain flips to a different peak, the endpoint changes abruptly, and the gradient blows up.
  • OOD: $Q’(a_t)$ — smooth, but pointing at the “spurious bump” (wrong direction).

六、为什么「近似」反而比「精确」更好

QGF 里有两个看似偷懒的近似:扔掉雅可比、用一步代替整条 ODE。反直觉的是,论文发现它们不是将就,而是分别打败了各自更「精确」的版本

第一,一步估计 vs 完整 ODE 去噪。 跑完整条去噪会把估计的干净动作强行约束在整个数据分布上(必须能代表数据里的所有模式);而一步近似允许小幅偏离,让流去「挑中」某个高价值的模式,而不是被拽回去代表整个数据集。所以这个近似其实给了优化更多自由。(有意思的是,论文提到表现最好的训练时基线 EDP 也用了一步近似。)

第二,单位阵 vs 真实雅可比。 真 $J$ 容易病态(尤其在去噪早期、一步近似本就很粗糙时),会放大噪声、让梯度方差暴增;换成 $I$ 得到一个低方差、干净的方向。在实践中,低方差的梯度估计就是更好的「优化器」。

下面这个页面专门回答「为什么不精确反而更好」。核心要看的是引导梯度的稳定性:把引导梯度 $g$ 当作 $a_t$ 的函数画出来,横轴是噪声点 $a_t$,纵轴是该处算出的引导梯度。三条曲线:

  • QGF(我们用的近似):在 $\hat a_1=$ 后验均值上取 $Q’$、并把雅可比当成 $I$ —— 曲线光滑
  • BPTT(精确,跑完整条链):$\tfrac{d}{da_t}Q(\mathrm{ODE}(a_t))$ —— 曲线剧烈抖动、带尖刺。尖刺出现在「分水岭」处:$a_t$ 越过某个界,整条去噪链就翻到另一个峰,终点骤变,梯度炸裂。
  • OOD:$Q’(a_t)$ —— 光滑,但指向那个「虚高鼓包」(方向错)。

Drag time $t$: the smaller $t$ (the longer the chain), the wilder BPTT’s spikes; as $t\to1$ the three converge. Then drag the probe $a_t$: it shows how much each gradient changes when you nudge the probe by a tiny $\epsilon$ — smaller change = more stable. BPTT jumps at the slightest touch near a watershed (high variance); QGF barely moves (low variance).

拖动时间 $t$:$t$ 越小(链越长),BPTT 的尖刺越疯;$t\to1$ 时三条趋于一致。再拖动探针 $a_t$:它会显示把探针挪动一丁点 $\epsilon$ 后每条梯度变化多大——变化越小越稳定。你会看到 BPTT 在分水岭附近一碰就跳(高方差),QGF 几乎不动(低方差)。

The paper quantifies this with two metrics: (1) the gradient’s noise sensitivity — compare the cosine similarity of the gradient at $a_t$ vs $a_t+\epsilon$; closer to 1 is more stable, and QGF has the lowest variance; (2) the ability to optimize $Q$ — treat each gradient as an “optimizer” and look at the final action’s $Q$ value; QGF is best, approaching the best-of-$N$ upper bound, while OOD does poorly because it exploits the critic.

7. Down to the metal: three rings that take BPTT and the Jacobian apart

If you want to really understand “why BPTT jitters, and why QGF’s two cuts work,” let’s take it apart to the lowest level. This section goes ring by ring, from shallow to deep, each ring paired with a hands-on, genuinely-simulated interactive page — all from zero background.

Ring 1: what is a Jacobian?

Start in 1-D. The derivative $f’(x)$ means something simple: push the input $x$ by a tiny $\Delta$ and the output changes by about $f’(x)\cdot\Delta$. It’s the slope of the curve at that point — the “local linear approximation.”

In higher dimensions (input and output are both vectors), this “slope” upgrades to a matrix, the Jacobian $J$:

\[\Delta(\text{output})\ \approx\ J\cdot \Delta(\text{input}),\qquad J_{ij}=\frac{\partial(\text{output}_i)}{\partial(\text{input}_j)}\]

Geometrically, what $J$ does is: turn a tiny “circle” at the input into an “ellipse” at the output — stretching + rotating. The lengths of the ellipse’s two semi-axes are the singular values — the magnification along each direction. If one direction is magnified a lot while the other is nearly squashed flat (a long, thin ellipse), $J$ is said to be ill-conditioned: a tiny wobble of the input gets amplified into a huge swing along some direction — this is exactly the source of “noise being amplified.”

论文用两个指标量化了这一点:(1) 梯度的噪声敏感度——比较 $a_t$ 与 $a_t+\epsilon$ 处梯度的余弦相似度,越接近 1 越稳定,QGF 方差最低;(2) 优化 $Q$ 的能力——把每种梯度当成「优化器」看最终动作的 $Q$ 值,QGF 最好,逼近 best-of-$N$ 的上界,而 OOD 因为钻评论家空子表现很差。

七、刨根问底:用「三环」彻底拆开雅可比与 BPTT

如果你想真正搞懂「为什么 BPTT 会抖、QGF 的两刀为什么有效」,我们把它拆到最底层。这一节按「三环」由浅入深,每一环配一个能直接上手、数值真实模拟的交互页面,全部从零基础讲起。

第一环:雅可比是什么

先看一维。导数 $f’(x)$ 的意思很朴素:把输入 $x$ 推一丁点 $\Delta$,输出大约变化 $f’(x)\cdot\Delta$。 它就是曲线在该点的斜率,也就是「局部线性近似」。

到了多维(输入、输出都是向量),这个「斜率」就升级成一个矩阵,叫雅可比 $J$

\[\Delta(\text{输出})\ \approx\ J\cdot \Delta(\text{输入}),\qquad J_{ij}=\frac{\partial(\text{输出}_i)}{\partial(\text{输入}_j)}\]

几何上,$J$ 做的事情是:把输入处一个极小的「圆」变成输出处一个「椭圆」——它在拉伸 + 旋转。椭圆两个半轴的长度叫奇异值,就是两个方向上的放大倍数。如果一个方向放大很多、另一个几乎压扁(椭圆又细又长),就说 $J$ 病态(ill-conditioned):输入的一点点抖动,会在某个方向被放大成巨大的摆动——这正是「噪声被放大」的来源。

Drag the point on the left; the right side shows in real time the ellipse the little circle maps to. Turn up the “curviness” (mimicking a neural-net-like function $v_\theta$ that bends everywhere) and the ellipse changes shape drastically from place to place, sometimes squashed into a line (ill-conditioned). Remember this picture: $v_\theta$ is a very curvy function, and its Jacobian wanders around and goes ill-conditioned just like this.

拖动左边的点,右边实时显示那个小圆被映射成的椭圆。把「弯曲程度」调大(模拟神经网络那种到处弯的函数 $v_\theta$),你会看到椭圆在不同位置形状剧变,有些地方被压成一条线(病态)。记住这个画面:$v_\theta$ 是个很弯的函数,它的雅可比就是这样到处乱变、时不时病态的。

Ring 2: the chain rule = a product of Jacobians, so BPTT jitters

Denoising is a chain: $a_t \to a_{t+\Delta}\to \cdots \to a_1$, where each arrow is one Euler step (one function). So the final clean action is a long stack of nested functions: $a_1=\mathrm{ODE}(a_t)$.

BPTT wants $\nabla_{a_t}Q(a_1)$. By the chain rule, that equals multiplying every step’s Jacobian together and then by $\nabla Q$. In 1-D:

\[\frac{d\,\mathrm{ODE}(a_t)}{d a_t}\ =\ \prod_{\text{each step}}\Big(1+\Delta t\,\frac{\partial v}{\partial a}\Big)\]

A long product of numbers (or matrices) is dangerous: if each factor is slightly above 1, the product explodes exponentially; slightly below 1 and it vanishes; worst of all, near a “watershed” (the boundary that decides which peak you finally fall into), a hair’s difference in the start lands the endpoint in a completely different peak — the derivative there is astronomical. This is exactly the source of the spikes on BPTT’s red curve in the previous section. It is both expensive (you must store and backprop the whole chain) and a high-variance, unstable gradient.

第二环:链式法则 = 一串雅可比相乘,所以 BPTT 会抖

去噪是一条:$a_t \to a_{t+\Delta}\to \cdots \to a_1$,每一个箭头都是一次欧拉步(一个函数)。所以最终的干净动作是一长串函数套起来:$a_1=\mathrm{ODE}(a_t)$。

BPTT 想要 $\nabla_{a_t}Q(a_1)$。按链式法则,它等于把每一步的雅可比全部乘起来再乘 $\nabla Q$。一维就是:

\[\frac{d\,\mathrm{ODE}(a_t)}{d a_t}\ =\ \prod_{\text{每一步}}\Big(1+\Delta t\,\frac{\partial v}{\partial a}\Big)\]

一长串数(或矩阵)连乘是很危险的:每个因子稍微大于 1,乘积就指数爆炸;稍小于 1 就衰减消失;更要命的是,在「分水岭」(决定你最终掉进哪个峰的那条边界)附近,起点差一根头发丝,终点就掉进完全不同的峰——这里的导数是天文数字。这正是上一节 BPTT 红线上那些尖刺的来源。它既贵(要把整条链都存下来反传),又是个高方差、不稳定的梯度。

Two trajectories start a tiny $\epsilon$ apart, both flowing from bottom to top under the pure flow (no guidance). Most places they stick together (the chain is contracting, the gradient small and stable); but drag the start to the watershed between two peaks and they split off toward different peaks — endpoint gap ÷ start gap is the “amplification,” i.e. the chain Jacobian $\big|\tfrac{d\,\mathrm{ODE}}{da_t}\big|$. Once it explodes, the BPTT gradient explodes.

两条轨迹起点只差一丁点 $\epsilon$,都在纯流(无引导)下从底爬到顶。大多数地方它们粘在一起(链在收缩,梯度又小又稳);但把起点拖到两峰之间的分水岭,两条就劈叉飞向不同的峰——终点间距 ÷ 起点间距,就是那个「放大率」,也就是链雅可比 $\big|\tfrac{d\,\mathrm{ODE}}{da_t}\big|$。它一旦爆炸,BPTT 梯度就爆炸。

Ring 3: QGF’s two cuts — one-step approximation + replace the Jacobian with $I$

QGF takes two cuts to remove both problems above at once.

Cut 1: replace the whole chain with “one step.” Instead of running all the denoising steps, use a single Euler step to estimate the clean action $\hat a_1 = a_t+(1-t)\,v_\theta(a_t,t)$. No long chain, so no “product explosion.” Its corresponding Jacobian is:

\[J=\frac{\partial \hat a_1}{\partial a_t}=I+(1-t)\frac{\partial v_\theta}{\partial a_t}\qquad(\text{1-D: } 1+(1-t)\,v')\]

Note that $J$ still requires one differentiation through the velocity field $v_\theta$. And Ring 1 told us: $v_\theta$ is very curvy, so this Jacobian still wanders around and even goes ill-conditioned. So “one step” cut off the worst chain explosion, but $J$ still carries the residual “curviness” of $v_\theta$. The gradient using a one-step estimate + the exact $J$ is $g=J^\top\nabla_{\hat a_1}Q$, which the paper calls QGF-Jacobian.

Cut 2: replace $J$ directly with the identity $I$. The guidance gradient then reduces to $g=\nabla_{\hat a_1}Q(s,\hat a_1)$ — just the direction of “where to get better” of $Q$ at the clean estimate $\hat a_1$, no longer distorted by $J$. Why is this allowed? Because $J$ carries $\partial v_\theta/\partial a_t$ — the very “curviness / ill-conditioning / high variance” we just escaped would sneak back in through it; whereas $I$ is a constant, clean, low-variance direction. The paper finds that dropping $J$ not only saves compute, the resulting gradient is lower-variance and actually better at optimizing $Q$.

第三环:QGF 的两刀——一步近似 + 把雅可比换成 $I$

QGF 用两刀把上面两个毛病一起砍掉。

第一刀:用「一步」换掉整条链。 不跑完所有去噪步,只用单步欧拉估计干净动作 $\hat a_1 = a_t+(1-t)\,v_\theta(a_t,t)$。没有长链,就没有「连乘爆炸」。它对应的雅可比是:

\[J=\frac{\partial \hat a_1}{\partial a_t}=I+(1-t)\frac{\partial v_\theta}{\partial a_t}\qquad(\text{一维:} 1+(1-t)\,v')\]

注意这里 $J$ 仍然需要对速度场 $v_\theta$ 求一次导。而第一环告诉我们:$v_\theta$ 很弯,它的这个雅可比仍然会到处乱变、甚至病态。所以「一步」砍掉了最凶的链爆炸,但 $J$ 里还残留着 $v_\theta$ 的「弯」。用一步估计 + 精确 $J$ 的梯度 $g=J^\top\nabla_{\hat a_1}Q$,论文把它叫 QGF-Jacobian

第二刀:把 $J$ 直接换成单位阵 $I$。 于是引导梯度就剩下 $g=\nabla_{\hat a_1}Q(s,\hat a_1)$——只取 $Q$ 在干净估计 $\hat a_1$ 处「往哪更好」的方向,不再被 $J$ 扭曲。 为什么能这么干?因为 $J$ 携带的是 $\partial v_\theta/\partial a_t$——刚逃掉的那种「弯 / 病态 / 高方差」又会从这里钻回来;而 $I$ 是个恒定、干净、低方差的方向。论文发现:扔掉 $J$ 不只是省算力,得到的梯度方差更低、优化 $Q$ 的能力反而更强

All three gradients drawn together; you can peel them back ring by ring: ① full-chain BPTT (red, the most spikes — Ring 2’s chain explosion) → ② cut to one step but keep the exact $J$ (purple, fewer spikes, but $J$’s curviness remains) → ③ drop $J$ = QGF (cyan, smoothest). Drag the probe to watch each one’s jitter $|\Delta g|$; the cyan line is always the steadiest.

三种梯度画在一起,你可以逐环剥离:① 只看整条链 BPTT(红,尖刺最多——第二环的链爆炸)→ ② 砍成一步但保留精确 $J$(紫,尖刺变少,但 $J$ 的「弯」还在)→ ③ 再扔掉 $J$ = QGF(青,最光滑)。拖动探针看每条的抖动 $|\Delta g|$,青线永远最稳。

Stringing the three rings together gives the complete logic of QGF’s design:

  • Ring 1: taking the Jacobian of a very curvy function $v_\theta$ gives a matrix that wanders around and is sometimes squashed into a thin line (ill-conditioned) — it amplifies noise along some direction.
  • Ring 2: BPTT must differentiate through the whole denoising chain, i.e. multiply many such Jacobians — so it explodes exponentially at watersheds — expensive and high-variance.
  • Ring 3: QGF makes two cuts. Cut 1, “take just one step,” directly kills the chain-product explosion; Cut 2, “$J\to I$,” kills the residual curviness / ill-conditioning from “differentiating $v_\theta$ once.” What remains, $g=\nabla_{\hat a_1}Q$, is cheap, lands near the trustworthy data manifold, and is low-variance.

So, back to that inference algorithm, each line now has its place:

\[\begin{aligned} &\hat a_1 \leftarrow a_t+(1-t)\,v_\theta(s,a_t,t) &&\text{// Cut 1: one-step clean estimate (avoid BPTT's chain explosion)}\\ &g \leftarrow \nabla_{\hat a_1} Q_\phi(s,\hat a_1) &&\text{// Cut 2: }J{=}I\text{, take only }Q\text{'s direction (avoid the ill-conditioned Jacobian)}\\ &a_{t+1}\leftarrow a_t+\tfrac1T\big(v_\theta(s,a_t,t)+\tfrac1\beta\,g\big) &&\text{// velocity + guidance, one step} \end{aligned}\]

The key insight: these two “approximations” are not compromises — each precisely plugs one source of noise — one cut for BPTT’s chain explosion (Ring 2), one for a single Jacobian’s ill-conditioning (Ring 1). With both noise sources removed, what’s left is a gradient that is lower-variance and better at optimizing $Q$ than the “exact” version. That’s why “imprecise” wins.

8. Back to the teaser figure

Now that you have all the background, you can return to the animated teaser at the top and read it again — this time every detail should make sense:

  • ① Behavior flow policy: the “chassis” of pure flow denoising, no $Q$ involved, walking from noise along the velocity field $v_t$ all the way to the clean action $a_1$.
  • ② BPTT: first roll out (dashed) to $a_1$, take the trustworthy $\nabla Q$ at $a_1$, then multiply a stack of Jacobians to backprop step by step back to $a_t$ — the arrow that finally lands at $a_t$ keeps jittering, the geometric picture of “high variance” (the static original figure can’t show it; animated, it’s obvious at a glance).
  • ③ OOD: no rollout at all, the red arrow grows directly on the half-noisy $a_t$, topped with a question mark, pointing every which way — because $Q$ gives no reliable direction on inputs it has never seen.
  • ④ QGF: a dashed line jumps one step to $\hat a_1$, takes the blue $\nabla Q$ pointing at the $Q$ peak there, then slides it back to $a_t$ unchanged — this “translate-and-carry” animation is the geometric meaning of $\hat J=I$: no stretching or rotating by any Jacobian, the gradient is copied verbatim.

Drag the center of the contours (the peak of $Q$) and you’ll see BPTT’s endpoint gradient and QGF’s gradient both obediently turn to follow the peak, while OOD’s question-mark arrow does its own thing regardless — “who listens to $Q$, and where” becomes instantly clear.

9. Headline experimental results

  • QGF substantially outperforms all prior test-time RL methods, and matches the strongest train-time methods.
  • It beats its own Jacobian-keeping variant (QGF-Jacobian) — confirming that dropping $J$ genuinely helps, not just saves compute.
  • It gets better as the model grows (the paper reports ~$4\times$ improvement at 3.2M parameters), whereas train-time baselines saturate or even collapse.
  • It is insensitive to the critic type — swapping in a stronger TD critic does even better.
  • It is orders of magnitude cheaper than best-of-$N$ sampling: on its own it already beats $N{=}4$, and combined it matches $N{=}16$ at far lower compute.

In one sentence

Train a flow policy $v_\theta$ that “generates reasonable actions” by behavior cloning, and a critic $Q_\phi$ that “scores actions” by TD; at test time, at each denoising step, first extrapolate one step to guess the clean action $\hat a_1$, take the gradient of $Q$ on it as the guidance term $\tfrac1\beta\nabla Q(\hat a_1)$, and add it onto the velocity field to step together — so you stay on the data manifold (don’t run off) while bending toward high value (get better), with no further policy training at all. The two “coarse approximations” happen to buy a low-variance, stable, and more $Q$-optimizing guidance gradient — that’s why QGF works.

把三环连起来,就是 QGF 这套设计的完整逻辑链:

  • 第一环:对一个很弯的函数 $v_\theta$ 求雅可比,得到的是一个会到处乱变、时不时被压成细线(病态)的矩阵——它会把噪声沿某个方向放大。
  • 第二环:BPTT 要对整条去噪链求导,等于把许多这样的雅可比连乘,于是在分水岭处指数爆炸——又贵、又高方差。
  • 第三环:QGF 砍两刀。第一刀「只走一步」直接消灭了链式连乘的爆炸;第二刀「$J\to I$」消灭了「对 $v_\theta$ 求一次导」残留的那点弯 / 病态。剩下的 $g=\nabla_{\hat a_1}Q$ 又便宜、又落在可信的数据流形附近、又低方差。

所以回到那段推理算法,每一行现在都能对号入座:

\[\begin{aligned} &\hat a_1 \leftarrow a_t+(1-t)\,v_\theta(s,a_t,t) &&\text{// 第一刀:一步估计干净动作(避开 BPTT 链爆炸)}\\ &g \leftarrow \nabla_{\hat a_1} Q_\phi(s,\hat a_1) &&\text{// 第二刀:}J{=}I\text{,只取 }Q\text{ 的方向(避开病态雅可比)}\\ &a_{t+1}\leftarrow a_t+\tfrac1T\big(v_\theta(s,a_t,t)+\tfrac1\beta\,g\big) &&\text{// 速度场 + 引导,走一步} \end{aligned}\]

关键洞见:这两个「近似」不是将就,而是各自精准地堵死了一种噪声来源——一刀对应 BPTT 的链爆炸(第二环),一刀对应单个雅可比的病态(第一环)。把噪声源都去掉后,得到的反而是比「精确版」方差更低、更能优化 $Q$ 的梯度。这就是为什么「不精确」会赢。

八、回看那张 Teaser 图

现在你已经有了全部背景,可以回到文章开头那张动态 Teaser 图重新看一遍——这次每个细节都该看懂了:

  • ① 行为流策略:纯流去噪的「底盘」,没有 $Q$ 参与,从噪声沿速度场 $v_t$ 一路走到干净动作 $a_1$。
  • ② BPTT:先虚线推演到 $a_1$、在 $a_1$ 处取可信的 $\nabla Q$,再乘一串雅可比逐跳反传回 $a_t$——最后落在 $a_t$ 的箭头持续抖动,这就是「高方差」的几何画面(静态原图画不出,动起来一眼就懂)。
  • ③ OOD:没有任何推演,红色箭头直接长在半噪声的 $a_t$ 上、顶着问号乱指——因为 $Q$ 在没见过的输入上给不出可靠方向。
  • ④ QGF:一条虚线一步跳到 $\hat a_1$,在那里取指向 $Q$ 峰的蓝色 $\nabla Q$,然后原封不动地滑回 $a_t$——这个「平移搬运」的动画就是 $\hat J=I$ 的几何含义:不经过任何雅可比的拉伸旋转,梯度照搬。

拖动等高线中心($Q$ 的峰),你会看到 BPTT 终点处的梯度和 QGF 的梯度都乖乖跟着峰转向,而 OOD 的问号箭头依旧我行我素——「谁在听 $Q$ 的话、在哪里听」立刻清晰。

九、主要实验结论

  • QGF 大幅超过此前所有测试时 RL 方法,并与最强的训练时方法持平
  • 比自己保留雅可比的变体(QGF-Jacobian)更好——说明扔掉 $J$ 确实有用,不只是省算力。
  • 随模型增大而变好(论文报告 3.2M 参数时约 $4\times$ 提升),而训练时基线却饱和甚至崩溃。
  • 对评论家类型不敏感——换成更强的 TD 评论家还能更好。
  • 比 best-of-$N$ 采样便宜几个数量级:单独用就能超过 $N{=}4$,组合后以低得多的算力匹配 $N{=}16$。

一句话串起来

用行为克隆训练一个会「生成合理动作」的流策略 $v_\theta$,再用 TD 训练一个会「给动作打分」的评论家 $Q_\phi$;测试时,在每一步去噪里,先用一步外推猜出干净动作 $\hat a_1$,在它上面取 $Q$ 的梯度作为引导项 $\tfrac1\beta\nabla Q(\hat a_1)$,加到速度场上一起走——这样既贴着数据流形(不乱跑),又往高价值方向偏(变更好),全程不需要再训练策略。两个「粗糙近似」恰好换来了低方差、稳定、还更能优化 $Q$ 的引导梯度,这正是 QGF 能打的原因。


Paper homepage and code: https://q-guided-flow.github.io/. All interactive demos here are 1-D / 2-D pedagogical simulations, meant to help you “see” the principle; in a real system, a critic’s wild extrapolation on OOD inputs produces exactly the kind of being-fooled-by-a-spurious-bump shown in the demos.

论文主页与代码:https://q-guided-flow.github.io/。本文所有交互演示均为一维 / 二维的教学性模拟,用来「看见」原理;真实系统里评论家在 OOD 输入上的乱外推,效果就是演示中那种被「虚高鼓包」骗走的情形。