← all writing

The Reachability of Steering: One Distribution-Level Law Behind Every VLA Trick

CFGRL, DSRL, test-time search, token-level RL — it looks like a zoo of methods for “steering” a frozen robot policy toward higher reward. Underneath, they’re all the same move: take the base policy’s distribution, multiply it by a non-negative weight, and shove mass toward high value. That move has a hard mathematical ceiling, and it doesn’t care whether you built your policy out of a flow model or an autoregressive token model. This post drops each method onto that one law, and lands on the uncomfortable conclusion: the only knob you actually get to turn is how rich and wide the pretrained base distribution is. Every demo is live — drag the sliders as you read.

CFGRL、DSRL、test-time 搜索、token 级 RL——看起来是一大堆「把冻结的机器人策略往高回报方向 steer」的方法。但扒开看,它们干的是同一件事:拿基模的分布,乘上一个非负权重,把质量往高价值的地方挪。这个操作有一个数学上的硬天花板,而且它根本不在乎你的策略是用 flow 搭的还是用自回归 token 搭的。这篇文章把每个方法都落到这一条定律上,最后落在一个不太舒服的结论上:你真正能调的杠杆,只有一个——预训练基模的分布够不够丰富、覆盖够不够广。 每个演示都能实时交互,边读边拖。

The whole post is one inequality, so let me put it up front:

\[\operatorname{supp}\pi^{\star}(\cdot\mid s)\ \subseteq\ \operatorname{supp}\pi_{\mathrm{base}}(\cdot\mid s)\]

The steered policy’s support can never be larger than the base policy’s support. The demo below is just this inequality made draggable. Pull steering strength all the way up and the achieved value only ever creeps up to the in-support ceiling — the gap Δ to the true optimum doesn’t move. The only thing that closes Δ is widening the base.

整篇文章其实就是一个不等式,我干脆先摆出来:

\[\operatorname{supp}\pi^{\star}(\cdot\mid s)\ \subseteq\ \operatorname{supp}\pi_{\mathrm{base}}(\cdot\mid s)\]

steered 策略的 support 永远不可能比基模的 support 更大。下面这个演示就是把这个不等式做成可以拖的。把steering 强度拉满,达成的价值也只会爬到那个「support 内上界」——到真实最优的差距 Δ 一动不动。唯一能让 Δ 关掉的,是把基模加宽

Reweighting moves mass around inside the support — it can’t manufacture mass where the base put none. The optimum behind the right wall is simply unreachable until the base learns to put a little probability there.

1. Pin the problem down first

Your base policy $\pi_{\mathrm{base}}(a\mid s)$ comes from imitation / pretraining. You want to do RL on top of it to chase reward. Almost every method that promises “improve it without breaking it” is really optimizing the same KL-regularized RL objective:

\[\max_{\pi}\ \mathbb{E}_{a\sim\pi(\cdot\mid s)}\!\big[Q(s,a)\big]\;-\;\alpha\, D_{\mathrm{KL}}\!\big(\pi(\cdot\mid s)\,\|\,\pi_{\mathrm{base}}(\cdot\mid s)\big)\]

That KL term isn’t decoration — it’s where the stability comes from. Drop it and the policy drifts onto state–action pairs the base never visited, the critic extrapolates garbage there, and training falls over. So whether you’re flow or token, everyone keeps this anchor.

Here’s the part people misdiagnose. “Flow matching has no good online-RL method” is a computational complaint — policy gradients want $\log\pi(a\mid s)$ or an importance ratio $\pi_{\text{new}}/\pi_{\text{old}}$, and a flow’s density needs you to integrate the velocity-field divergence along the ODE, which you can’t get in closed form. But DSRL (move RL into noise space), CFGRL (use a guidance weight), and test-time (just pick) all sidestep that density. So the computational wall isn’t the real wall. The real wall is the one below, and it holds for any generative class:

The central question. Without breaking stability (i.e. keeping the KL anchor, or staying inside the base’s support), what is the highest value steering can reach, and what sets it?

重加权只是在 support 内部挪质量——基模没放质量的地方,它造不出来。右边墙后的最优解,在基模学会往那儿放一点概率之前,根本够不着。

一、先把问题钉死

你的基模 $\pi_{\mathrm{base}}(a\mid s)$ 来自模仿学习 / 预训练。你想在它之上做 RL、追回报。几乎所有号称「在不弄崩基模的前提下做改进」的方法,实际优化的都是同一个 KL 正则的 RL 目标

\[\max_{\pi}\ \mathbb{E}_{a\sim\pi(\cdot\mid s)}\!\big[Q(s,a)\big]\;-\;\alpha\, D_{\mathrm{KL}}\!\big(\pi(\cdot\mid s)\,\|\,\pi_{\mathrm{base}}(\cdot\mid s)\big)\]

这个 KL 项不是摆设——稳定性就来自它。把它去掉,策略会漂到基模没见过的状态-动作上,critic 在那儿外推出一堆垃圾,训练就崩了。所以不管你是 flow 还是 token,大家都带着这个锚。

有个地方常被误诊。「flow matching 没有好的 online RL 方法」本质是个计算障碍——policy gradient 要 $\log\pi(a\mid s)$ 或者重要性比 $\pi_{\text{new}}/\pi_{\text{old}}$,而 flow 的密度要你沿着 ODE 积分速度场的散度,解析上拿不到。但 DSRL(把 RL 搬到噪声空间)、CFGRL(用引导权重)、test-time(直接挑)都绕过了这个密度。所以计算障碍不是真正的墙。真正的墙是下面这条,它对任何生成式类别都成立:

核心问题。 在不破坏稳定性(带着 KL 锚,或停留在基模 support 内)的前提下,steering 能达到的价值上界是多少?由什么决定?

2. The core theorem: reweighting can’t leave the support

That objective has a closed-form solution — the familiar Gibbs / Boltzmann form (the same $\pi^\star\propto\pi_{\text{ref}}e^{r/\beta}$ you’ve seen in RLHF / DPO):

\[\pi^\star(a\mid s)=\frac{1}{Z(s)}\,\pi_{\mathrm{base}}(a\mid s)\,\exp\!\Big(\tfrac{1}{\alpha}Q(s,a)\Big)\]

The exponential factor is always positive. It can only reweight mass that’s already there — it can’t conjure mass out of nothing. So:

\[\pi_{\mathrm{base}}(a\mid s)=0\ \Rightarrow\ \pi^\star(a\mid s)=0\quad\Rightarrow\quad \operatorname{supp}\pi^\star\ \subseteq\ \operatorname{supp}\pi_{\mathrm{base}}\]

Take $\alpha\to 0$ (pure exploitation) and the steered policy collapses onto the single best action that the base actually covers:

\[V_{\mathrm{steer}}(s)=\!\!\max_{a\in\operatorname{supp}\pi_{\mathrm{base}}}\!\!Q(s,a)\ \le\ \max_{a}Q(s,a)=V^\star(s),\qquad \Delta(s)=V^\star(s)-V_{\mathrm{steer}}(s)\ge 0\]

and $\Delta(s)=0$ iff the true optimal action $a^\star(s)$ lives in the base support. This is the precise meaning of “the base model matters a lot.” Note the ceiling is set by the support, not by the base’s average performance. You don’t need the base to be good on average — you need it to occasionally emit that good action.

And support-containment is only the asymptotic story. The finite-sample version bites harder. Steering (DSRL and friends) finds good actions by sampling from $\pi_{\mathrm{base}}$ and getting feedback. The chance of hitting a good action of density $\epsilon$ within $N$ draws is about $1-(1-\epsilon)^N$, so

\[N\ \sim\ \frac{1}{\pi_{\mathrm{base}}(a^\star\mid s)},\qquad C^\star=\mathbb{E}_{s\sim d^\star}\!\left[\frac{1}{\pi_{\mathrm{base}}\!\big(a^\star(s)\mid s\big)}\right]\]

$C^\star$ is exactly the coverage / concentrability coefficient from offline RL. The more density the base puts on good actions (the wider and richer it is), the smaller $C^\star$, the cheaper steering gets; as that density → 0, sample complexity blows up exponentially. So the good action doesn’t just have to be “in the support” — it needs non-negligible density. That’s the precise form of “the distribution has to be wide enough.”

二、核心定理:重加权出不了 support

那个目标有闭式解,就是大家熟的 Gibbs / Boltzmann 形式(和 RLHF / DPO 里 $\pi^\star\propto\pi_{\text{ref}}e^{r/\beta}$ 是同一个东西):

\[\pi^\star(a\mid s)=\frac{1}{Z(s)}\,\pi_{\mathrm{base}}(a\mid s)\,\exp\!\Big(\tfrac{1}{\alpha}Q(s,a)\Big)\]

那个指数因子恒为正。它只能给已有的质量重新加权——凭空变不出质量。于是:

\[\pi_{\mathrm{base}}(a\mid s)=0\ \Rightarrow\ \pi^\star(a\mid s)=0\quad\Rightarrow\quad \operatorname{supp}\pi^\star\ \subseteq\ \operatorname{supp}\pi_{\mathrm{base}}\]

令 $\alpha\to 0$(纯利用),steered 策略会塌到基模真正覆盖到的那个 Q 最大动作上:

\[V_{\mathrm{steer}}(s)=\!\!\max_{a\in\operatorname{supp}\pi_{\mathrm{base}}}\!\!Q(s,a)\ \le\ \max_{a}Q(s,a)=V^\star(s),\qquad \Delta(s)=V^\star(s)-V_{\mathrm{steer}}(s)\ge 0\]

而且 $\Delta(s)=0$ 当且仅当真实最优动作 $a^\star(s)$ 落在基模 support 里。这才是「对基模要求很高」的精确含义。注意天花板由 support 决定,不是由基模的平均表现决定。你不需要基模平均很强,你需要它偶尔能吐出那个好动作。

而且 support 包含还只是渐近的故事。有限样本版本更咬人。steering(DSRL 这类)是靠从 $\pi_{\mathrm{base}}$ 采样、拿反馈来发现好动作的。在 $N$ 次采样里撞上密度为 $\epsilon$ 的好动作,概率约 $1-(1-\epsilon)^N$,所以

\[N\ \sim\ \frac{1}{\pi_{\mathrm{base}}(a^\star\mid s)},\qquad C^\star=\mathbb{E}_{s\sim d^\star}\!\left[\frac{1}{\pi_{\mathrm{base}}\!\big(a^\star(s)\mid s\big)}\right]\]

$C^\star$ 正是 offline RL 里的覆盖 / concentrability 系数。基模在好动作上放的密度越高(越宽、越丰富),$C^\star$ 越小,steering 越便宜;密度趋零,样本复杂度指数爆炸。所以好动作不光要「在 support 里」,还得有不可忽略的密度。这就是「分布要足够广」的数学形式。

3. Every method is one special case of this law

What follows are five methods. The first four are all just different ways to write $\pi=\frac1Z\pi_{\mathrm{base}}\,g(s,a)$ with $g\ge 0$ — only the form of $g$ and the algorithm differ. The fifth, Q-chunking, is orthogonal: it doesn’t touch $g$, it makes $Q$ itself learnable.

CFGRL — classifier-free guidance, ported from image diffusion

Train the flow/diffusion to condition on an “optimality / high-reward” signal $c$; at sampling time crank the guidance weight $w$ and extrapolate along $v_w=v_\varnothing+w(v_c-v_\varnothing)$, pushing toward high-reward modes.

\[\pi_w(a)\ \propto\ \pi_{\varnothing}(a)\Big(\tfrac{\pi_c(a)}{\pi_\varnothing(a)}\Big)^{\!w}\ \propto\ \pi_{\mathrm{base}}(a)\,e^{\,wQ(a)/\beta}\]

On the law: $g=(\pi_c/\pi_{\mathrm{base}})^w\propto e^{wQ/\beta}$, and the guidance weight $w$ is the inverse temperature $1/\alpha$. No matter how big $w$ gets, it’s still just reweighting the unconditional base — the support doesn’t change.

三、每个方法,都是这条定律的一个特例

下面是五个方法。前四个都只是把 $\pi=\frac1Z\pi_{\mathrm{base}}\,g(s,a)$($g\ge 0$)写成不同样子——只是 $g$ 的形式和算法不一样。第五个 Q-chunking 是正交的:它不碰 $g$,它负责让 $Q$ 本身能被学好。

CFGRL —— 从图像扩散搬来的无分类器引导

训练时让 flow / 扩散在一个「最优性 / 高回报」信号 $c$ 上做条件生成;采样时把引导权重 $w$ 调大,沿 $v_w=v_\varnothing+w(v_c-v_\varnothing)$ 外推,往高回报模式推。

\[\pi_w(a)\ \propto\ \pi_{\varnothing}(a)\Big(\tfrac{\pi_c(a)}{\pi_\varnothing(a)}\Big)^{\!w}\ \propto\ \pi_{\mathrm{base}}(a)\,e^{\,wQ(a)/\beta}\]

落到定律上:$g=(\pi_c/\pi_{\mathrm{base}})^w\propto e^{wQ/\beta}$,引导权重 $w$ 就是逆温度 $1/\alpha$。$w$ 再大,也只是在给无条件基模重加权——support 不变。

Drag $w$ up: mass piles onto the high-reward mode, entropy drops — but the result is locked inside the base support. Guidance is a knob, not an escape hatch.

DSRL — diffusion steering by doing RL in noise space

Freeze the base. The action is a deterministic function of the input noise, $a=f_{\mathrm{base}}(z,s)$ (given the frozen ODE). So instead of doing RL in action space (density unavailable), learn a policy $\pi_z(z\mid s)$ over the noise $z$ with plain off-policy RL (SAC), treating the base as a black-box decoder. Extremely sample-efficient, forward-pass only.

\[\max_{\pi_z}\ \mathbb{E}_{z\sim\pi_z(\cdot\mid s)}\big[Q\big(s,\,f_{\mathrm{base}}(z,s)\big)\big],\qquad a=f_{\mathrm{base}}(z,s)\]

On the law: the reachable action set is exactly the image of $f_{\mathrm{base}}(\cdot,s)$, which equals $\operatorname{supp}\pi_{\mathrm{base}}$. Changing $z$ is an efficient search inside the support — not an escape from it.

把 $w$ 拖大:质量堆向高回报模式、熵下降——但结果锁在基模 support 之内。引导是个旋钮,不是逃生口。

DSRL —— 在噪声空间里做 RL 的扩散 steering

冻结基模。注意动作是输入噪声的确定性函数 $a=f_{\mathrm{base}}(z,s)$(给定冻结的 ODE)。于是别在动作空间做 RL(密度拿不到),而是在噪声 $z$ 上学一个策略 $\pi_z(z\mid s)$,用标准 off-policy RL(SAC)就行,把基模当黑盒解码器。极省样本,只要前向。

\[\max_{\pi_z}\ \mathbb{E}_{z\sim\pi_z(\cdot\mid s)}\big[Q\big(s,\,f_{\mathrm{base}}(z,s)\big)\big],\qquad a=f_{\mathrm{base}}(z,s)\]

落到定律上:可达动作集 = $f_{\mathrm{base}}(\cdot,s)$ 的像,正好等于 $\operatorname{supp}\pi_{\mathrm{base}}$。换 $z$ 只是在 support 内部高效搜索——不是逃出 support。

Drag the noise handle around the whole disk and the action stays trapped inside the teal image. The red $a^\star$ outside it is unreachable for every $z$ — DSRL’s optimization can’t go where the base never maps.

Test-time search / Best-of-N

Don’t train a policy at all. At inference, sample $N$ candidate actions from the base, score them with a value / reward model, keep the best (or run a small search / MPC). It’s empirical reweighting by reject-and-select.

\[a^{(N)}=\arg\max_{i}\,Q(s,a_i),\quad a_i\sim\pi_{\mathrm{base}}(\cdot\mid s),\qquad \mathbb{E}\big[Q(a^{(N)})\big]\ \xrightarrow[N\to\infty]{}\ \!\!\max_{a\in\operatorname{supp}\pi_{\mathrm{base}}}\!\!Q\]

On the law: $g\propto \mathbb{1}[a=\arg\max Q]$, selecting only among base samples. It’s the cleanest demonstration of “ceiling + coverage” there is.

拖着噪声手柄在整个圆盘里转,动作始终被困在青色像集内。外面那个红色 $a^\star$ 对任何 $z$ 都够不着——基模没映射到的地方,DSRL 的优化也去不了。

测试时搜索 / Best-of-N

干脆不训练策略。推理时从基模采 $N$ 个候选动作,用价值 / 奖励模型打分,挑最好的(或者做个小搜索 / MPC)。本质就是用拒绝-选择做经验重加权。

\[a^{(N)}=\arg\max_{i}\,Q(s,a_i),\quad a_i\sim\pi_{\mathrm{base}}(\cdot\mid s),\qquad \mathbb{E}\big[Q(a^{(N)})\big]\ \xrightarrow[N\to\infty]{}\ \!\!\max_{a\in\operatorname{supp}\pi_{\mathrm{base}}}\!\!Q\]

落到定律上:$g\propto \mathbb{1}[a=\arg\max Q]$,只在采到的基模样本里挑。这是「天花板 + 覆盖」最干净的一个演示。

The best-of-N curve climbs and then flattens onto the in-support ceiling. It never reaches the global optimum, no matter how big you make N. To raise a good action of density ε you’d need N ∼ 1/ε — there’s your $C^\star$ again.

Token-level / autoregressive RL

For autoregressive VLAs that discretize actions into tokens (the OpenVLA family), you can do PPO / GRPO directly, RLHF-style, with reward at the end of the sequence. This is the “easy” class — token policies have an analytic softmax likelihood, so standard policy gradients just work (unlike flow). In principle it can move mass to any token combination — the full support of the discrete action space.

\[\max_{\theta}\ \mathbb{E}_{a\sim\pi_\theta}\!\big[R(s,a)\big]\;-\;\beta\, D_{\mathrm{KL}}\!\big(\pi_\theta\,\|\,\pi_{\mathrm{base}}\big)\quad\Longrightarrow\quad \pi_\theta^\star\propto\pi_{\mathrm{base}}\,e^{\,Q/\beta}\]

The nuance that matters: to not collapse, everyone adds the $\beta\cdot$KL anchor — which lands you right back on the same Gibbs form and the same near-support bias. Discretization caps precision on top of that. The support bias isn’t a flow thing — it comes from the KL anchor everyone bolts on for stability.

best-of-N 曲线先爬,然后压在 support 内上界上摊平。不管 N 多大都碰不到全局最优。要撑起密度为 ε 的好动作,需要 N ∼ 1/ε——你的 $C^\star$ 又出现了。

Token 级 / 自回归 RL

对把动作离散成 token 的自回归 VLA(OpenVLA 一类),可以像 RLHF 一样直接做 PPO / GRPO,奖励放在序列末端。这是「容易」的一类——token 策略有可解析的 softmax 似然,标准 policy gradient 直接能用(不像 flow)。原则上它能把质量挪到任意 token 组合——离散动作空间的全 support。

\[\max_{\theta}\ \mathbb{E}_{a\sim\pi_\theta}\!\big[R(s,a)\big]\;-\;\beta\, D_{\mathrm{KL}}\!\big(\pi_\theta\,\|\,\pi_{\mathrm{base}}\big)\quad\Longrightarrow\quad \pi_\theta^\star\propto\pi_{\mathrm{base}}\,e^{\,Q/\beta}\]

要紧的 nuance:为了不崩,大家都加 $\beta\cdot$KL 锚——于是又回到同一个 Gibbs 形式、同一个 near-support 偏置。再叠上离散化本身对精度的限制。support 偏置不是 flow 独有的——它来自所有人为稳定而拴上的那个 KL 锚。

Lower β and the optimum wants to climb to the far, higher peak — but cross the line and you’re in the off-distribution band where the critic lies and training collapses. Raise β and you’re pinned safely near base. Stable vs. surpassing-the-demos is one tension, two ends.

Q-chunking — RL on chunked actions (orthogonal)

Do TD learning in a chunked action space: the policy emits $k$ steps at once and the critic does n-step backups over the chunk. Two payoffs: (1) the effective horizon shrinks by $k$, credit assignment gets shorter, the n-step return is unbiased; (2) committing a whole chunk gives temporally coherent exploration — no more per-step random kicks that cancel out into a jitter in place.

\[\hat Q\big(s_t,\mathbf a_{t:t+k}\big)=\sum_{i=0}^{k-1}\gamma^i r_{t+i}+\gamma^k\max_{\mathbf a'}Q\big(s_{t+k},\mathbf a'\big),\qquad H\ \to\ \lceil H/k\rceil\]

On the law: it doesn’t change $g$, it doesn’t change the support. It fixes the other half — “under long-horizon sparse reward, $Q$ won’t train and exploration is useless” — which is exactly what the steering methods above (they hand you a controllable knob) are missing.

把 β 调小,最优解想爬向远处那个更高的峰——但越线就进了 off-distribution 带,critic 在那儿说谎、训练崩掉。把 β 调大,就安全地钉在基模附近。稳定 vs 超越示范,是同一个张力的两端。

Q-chunking —— 在分块动作上做 RL(正交)

分块动作空间里做 TD 学习:策略一次吐 $k$ 步,critic 在 chunk 上做 n-step 回填。两个收益:(1) 有效 horizon 缩小 $k$ 倍,credit assignment 变短,n-step 回报无偏;(2) 提交一整个 chunk 带来时序一致的探索——不再是逐步随机踢腿、互相抵消成原地抖动。

\[\hat Q\big(s_t,\mathbf a_{t:t+k}\big)=\sum_{i=0}^{k-1}\gamma^i r_{t+i}+\gamma^k\max_{\mathbf a'}Q\big(s_{t+k},\mathbf a'\big),\qquad H\ \to\ \lceil H/k\rceil\]

落到定律上:它不改 $g$,不改 support。 它修的是另一半——「长程稀疏奖励下 $Q$ 学不出来、探索没效率」——这恰好是上面那些 steering 方法(给了你一个可控旋钮)缺的那一半。

Same start, same step budget. Per-step resampling (steel-blue, H=1) wanders near the origin; committing chunks (teal) sweeps out far bigger loops and usually ends much farther out. Hit “re-roll” a few times to watch the trend — coherent exploration covers ground, per-step noise cancels itself.

4. Tie it together: the only lever is base richness

Write every steering method in one unified form — keep the expressive base, multiply by a non-negative weight that tilts the output toward high reward:

\[\pi(a\mid s)=\frac{1}{Z(s)}\,\pi_{\mathrm{base}}(a\mid s)\,g(s,a),\qquad g(s,a)\ge 0\ \ \Longrightarrow\ \ \operatorname{supp}\pi\subseteq\operatorname{supp}\pi_{\mathrm{base}}\]
Method Weight $g(s,a)$ / mechanism Bound by the support ceiling?
CFGRL $(\pi_c/\pi_{\mathrm{base}})^w\propto e^{wQ/\beta}$ Yes
DSRL implicit; RL over $z$, $a=f_{\mathrm{base}}(z)$, reachable = flow image Yes
test-time $\mathbb{1}[a=\arg\max_i Q]$, $a_i\sim\pi_{\mathrm{base}}$ Yes
RL-token $e^{Q/\beta}$ (KL anchor; discrete full support but pulled back) Yes (in practice)
Q-chunking doesn’t change $g$; makes $Q$ learnable (unbiased n-step, $H\to H/k$, coherent exploration) Orthogonal

So here’s the whole thing, formalized:

Proposition (reachability of steering). Given a frozen base $\pi_{\mathrm{base}}$ and the true optimum $a^\star(s)=\arg\max_a Q^\star(s,a)$, any steering policy of the form $\pi(a\mid s)=\frac1Z\pi_{\mathrm{base}}(a\mid s)\,g(s,a)$ with $g\ge 0$ satisfies $\operatorname{supp}\pi\subseteq\operatorname{supp}\pi_{\mathrm{base}}$, hence \(V_{\mathrm{steer}}(s)=\!\!\max_{a\in\operatorname{supp}\pi_{\mathrm{base}}}\!\!Q^\star(s,a)\le V^\star(s),\qquad \Delta(s)\ge0,\) \(\Delta(s)=0\iff a^\star(s)\in\operatorname{supp}\pi_{\mathrm{base}};\qquad \text{sample complexity}\ \propto\ C^\star=\mathbb{E}_{s\sim d^\star}\!\Big[\tfrac{1}{\pi_{\mathrm{base}}(a^\star(s)\mid s)}\Big].\)

The takeaway. Under the anchored / regularized paradigm, the only lever that moves $\Delta$ and $C^\star$ is how much coverage density $\pi_{\mathrm{base}}$ puts on useful-but-rare actions. So the goal isn’t to invent yet another flow-RL algorithm (the DSRL line is already enough) — it’s two things:

  1. Maximize the base’s coverage width and good-action density — more diverse pretraining data, deliberately higher-entropy / more varied demos. This shoves the hard part back from RL into pretraining.
  2. Build a closed loop that can safely, autonomously grow the support — steer to bootstrap → autonomously roll out and harvest the slightly-off-distribution successes → fold them back into the base and grow it → repeat. The prerequisite is an oracle that can stand in for a human on failure detection / safe recovery.

Get those two right, and only then do the five methods above have a ceiling worth talking about.

同一起点、同样的步数预算。逐步重采(钢蓝,H=1)在原点附近打转;提交 chunk(青色)扫出大得多的回路,通常也落得远得多。多点几次「重走一次」看趋势——时序一致的探索能覆盖到地方,逐步噪声只会自我抵消。

四、串起来:唯一的杠杆是基模的丰富度

把每个 steering 方法写成一个统一形式——保留富表达的基模,乘一个非负权重,把输出往高回报偏:

\[\pi(a\mid s)=\frac{1}{Z(s)}\,\pi_{\mathrm{base}}(a\mid s)\,g(s,a),\qquad g(s,a)\ge 0\ \ \Longrightarrow\ \ \operatorname{supp}\pi\subseteq\operatorname{supp}\pi_{\mathrm{base}}\]
方法 权重 $g(s,a)$ / 机制 受 support 上界限制?
CFGRL $(\pi_c/\pi_{\mathrm{base}})^w\propto e^{wQ/\beta}$
DSRL 隐式;在 $z$ 上做 RL,$a=f_{\mathrm{base}}(z)$,可达 = flow 像集
test-time $\mathbb{1}[a=\arg\max_i Q]$,$a_i\sim\pi_{\mathrm{base}}$
RL-token $e^{Q/\beta}$(KL 锚;离散全 support 但被锚拉回) 是(实践中)
Q-chunking 不改 $g$;让 $Q$ 可学(无偏 n-step、$H\to H/k$、时序一致探索) 正交

所以整件事,形式化一下:

命题(Steering 的可达性). 给定冻结基模 $\pi_{\mathrm{base}}$ 与真实最优 $a^\star(s)=\arg\max_a Q^\star(s,a)$。任何形如 $\pi(a\mid s)=\frac1Z\pi_{\mathrm{base}}(a\mid s)\,g(s,a)$($g\ge 0$)的 steering 策略都满足 $\operatorname{supp}\pi\subseteq\operatorname{supp}\pi_{\mathrm{base}}$,故 \(V_{\mathrm{steer}}(s)=\!\!\max_{a\in\operatorname{supp}\pi_{\mathrm{base}}}\!\!Q^\star(s,a)\le V^\star(s),\qquad \Delta(s)\ge0,\) \(\Delta(s)=0\iff a^\star(s)\in\operatorname{supp}\pi_{\mathrm{base}};\qquad \text{样本复杂度}\ \propto\ C^\star=\mathbb{E}_{s\sim d^\star}\!\Big[\tfrac{1}{\pi_{\mathrm{base}}(a^\star(s)\mid s)}\Big].\)

结论。 在锚定 / 正则范式下,唯一能动 $\Delta$ 和 $C^\star$ 的杠杆,是 $\pi_{\mathrm{base}}$ 在有用但稀有的动作上铺了多少覆盖密度。所以目标不是再发明一个 flow-RL 算法(DSRL 那条线已经够用了)——而是两件事:

  1. 把基模的覆盖宽度和好动作密度做到最大——更多样的预训练数据、刻意更高熵 / 更多样的示范。这把难点从 RL 推回到 pretrain。
  2. 搭一个能安全、自主地扩张 support 的闭环——steering 自举 → 自主 rollout、收回那些略微 off-distribution 的成功 → 折回基模、把它长大 → 重复。前提是一个能替代人去做失败检测 / 安全恢复的 oracle。

把这两件做好,上面那五个方法才有真正值得谈的天花板。

References

The methods this post collapses onto one law, and the pieces it leans on:

  • DSRL — A. Wagenmaker, M. Nakamoto, Y. Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, S. Levine. Steering Your Diffusion Policy with Latent Space Reinforcement Learning. arXiv:2506.15799 · project
  • CFGRL — K. Frans et al. Diffusion Guidance Is a Controllable Policy Improvement Operator. arXiv:2505.23458 · code
  • Q-chunking — Q. Li, Z. Zhou, S. Levine. Reinforcement Learning with Action Chunking. arXiv:2507.07969 (NeurIPS 2025)
  • Token / autoregressive VLA — M. J. Kim et al. OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246 — the OpenVLA-class policies you fine-tune with PPO / GRPO.
  • The KL-regularized closed form $\pi^\star\propto\pi_{\mathrm{ref}}\,e^{r/\beta}$ — R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. Manning, C. Finn. Direct Preference Optimization. arXiv:2305.18290 — the same Gibbs form as RLHF / control-as-inference.
  • Coverage / concentrability $C^\star$ — J. Chen, N. Jiang. Information-Theoretic Considerations in Batch Reinforcement Learning. ICML 2019, arXiv:1905.00360
  • RelatedQ-Guided Flow, the test-time value-guided flow I walk through elsewhere on this site: my QGF post · q-guided-flow.github.io

参考文献

这篇文章收进同一条定律的那些方法,以及它依赖的几块拼图:

  • DSRL — A. Wagenmaker, M. Nakamoto, Y. Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, S. Levine. Steering Your Diffusion Policy with Latent Space Reinforcement Learning. arXiv:2506.15799 · 项目主页
  • CFGRL — K. Frans 等. Diffusion Guidance Is a Controllable Policy Improvement Operator. arXiv:2505.23458 · 代码
  • Q-chunking — Q. Li, Z. Zhou, S. Levine. Reinforcement Learning with Action Chunking. arXiv:2507.07969(NeurIPS 2025)
  • Token / 自回归 VLA — M. J. Kim 等. OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246 —— 你用 PPO / GRPO 微调的就是 OpenVLA 这一类策略。
  • KL 正则的闭式解 $\pi^\star\propto\pi_{\mathrm{ref}}\,e^{r/\beta}$ — R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. Manning, C. Finn. Direct Preference Optimization. arXiv:2305.18290 —— 和 RLHF / control-as-inference 是同一个 Gibbs 形式。
  • 覆盖 / concentrability $C^\star$ — J. Chen, N. Jiang. Information-Theoretic Considerations in Batch Reinforcement Learning. ICML 2019, arXiv:1905.00360
  • 相关Q-Guided Flow,本站另一篇讲过的测试时价值引导流:我的 QGF 文章 · q-guided-flow.github.io

$\pi^\star\propto\pi_{\mathrm{base}}\,e^{Q/\alpha}$ · support-containment · the coverage coefficient $C^\star$ — the one line every steering method shares.


$\pi^\star\propto\pi_{\mathrm{base}}\,e^{Q/\alpha}$ · support 包含 · 覆盖系数 $C^\star$ —— 所有 steering 方法共享的那一条线。