← all writing

Reversal Q-Learning (RQL) from Scratch: Making Denoising an RL Action Without the Horizon Tax

A long read that strings together flow policies, the extended MDP, flow reversal, the reparameterization policy gradient, DDPG/SAC, and RQL’s value function — into one line. Every key idea is paired with a live, interactive visualization (there are eleven). Drag and play as you read.

Note: this is a teaching-style reconstruction based on the RQL project page (Oberai, Park, Levine, 2026). The code is illustrative PyTorch-flavored pseudocode meant to explain the idea — it is not the official implementation; some engineering details (discount bookkeeping, network shapes) favor clarity and may differ from the original.

一篇把 flow 策略、扩展 MDP、flow reversal、重参数化策略梯度、DDPG/SAC、以及 RQL 的价值函数串成一条线的长文。每个关键想法都配了一个可实时交互的可视化(一共十一个),边读边拖、边玩。

注:本文是基于 RQL 项目主页(Oberai, Park, Levine, 2026)的教学式重建。文中代码是说明性的 PyTorch 风格伪代码,用于解释思路,不是官方实现;一些工程细节(折扣安排、网络结构)以讲清原理为先,可能与原实现不完全一致。

TL;DR

RQL’s problem: do reinforcement learning on a flow (diffusion-style) policy using offline data.

  • A flow policy doesn’t emit an action in one shot — it starts from noise and refines it over $F$ “denoise” steps. Expressive, but awkward to put under RL.
  • A natural idea is to treat each denoise step as an RL action. That lets you train with stable, local, one-step gradients — but it multiplies the value-learning horizon by $F$ (off-policy RL’s “curse of horizon”), and the offline data has no intermediate steps in it.
  • RQL’s key move is flow reversal: integrate the ODE backwards from the dataset’s final action to fill in the missing steps, building “virtual trajectories.” Because those intermediate transitions are deterministic, multi-step returns across them are zero-variance and unbiased — so the stretched horizon is free.
  • In practice: learn a value $V(s, x^f, f)$ defined on partial actions, then nudge each denoise step with a reparameterization (pathwise) gradient, plus a BC term to anchor the policy to the data.

Let’s take it apart, one piece at a time.

1. The setting: a flow policy + offline RL

We’re in a standard MDP — state $s$, action $a$, reward $r$, next state $s’$, discount $\gamma$ — with a batch of offline data: transitions $(s, a, r, s’)$. We want the best policy we can get, and the policy is a flow model.

Why a flow policy? Because the distribution of good actions is often multi-modal and complex. A plain Gaussian policy can only represent one mode; a flow/diffusion model can fit arbitrarily complex distributions. The cost is that generating an action takes many steps — which is the headache the rest of the post resolves.

2. What a flow policy actually is

A flow policy turns noise into an action through a chain of deterministic refinement steps. Start from noise $x^0 \sim \mathcal{N}(0, I)$ and integrate an ODE given by a network $v_\theta$:

\[\frac{dx_s}{ds} = v_\theta(s_{\text{state}}, x_s, s), \qquad x^0 \to x^1 \to \cdots \to x^F\]

The final $x^F$ is the action $a$. Discretized into $F$ Euler steps, each step is

\[x^{f+1} = x^f + v_\theta(s, x^f, f)\]

Here is the fact that runs through the whole post: given the velocity field, the path from any $x^f$ to $x^F$ is fully determined. The only randomness is the initial noise $x^0$. Drag the demo: the noise cloud flows to the modes; re-roll the noise and different $x^0$ flow to different modes — but each path is deterministic.

TL;DR

RQL 想解决的问题是:用已有的离线数据,对一个 flow(扩散式)策略做强化学习。

  • 一个 flow 策略不是一次性吐出动作,而是从噪声出发、经过 $F$ 步「降噪/精炼」逐步生成动作。表达力强,但很难直接套 RL。
  • 一个自然想法是把每一步降噪当成一个 RL 动作。这能让你用稳定的「逐步局部梯度」去训策略,但代价是:价值学习的 horizon 被乘了 $F$(off-policy RL 的「curse of horizon」),而且离线数据里根本没有中间降噪步。
  • RQL 的核心招数是 flow reversal:从数据里的最终动作沿 ODE 反向积分,补出中间步,造出「虚拟轨迹」。由于这些中间转移是确定性的,跨它们的多步回报零方差、无偏 —— 于是被拉长的 horizon 不要钱。
  • 实现上:学一个定义在部分动作上的价值函数 $V(s, x^f, f)$,再用重参数化(pathwise)梯度逐步微调每个降噪步,外加一个 BC 正则项把策略锚在数据附近。

下面我们一点一点把它拆开。

一、问题设定:flow 策略 + 离线 RL

我们处在一个标准 MDP 里:状态 $s$、动作 $a$、奖励 $r$、下一状态 $s’$、折扣 $\gamma$。手里有一批离线数据,形如一堆转移 $(s, a, r, s’)$。我们想学一个尽量好的策略,而且策略本身是一个 flow 模型

为什么用 flow 策略?因为真实任务里「好动作的分布」经常是多峰的、复杂的。普通高斯策略只能表达一个单峰,而 flow / 扩散模型能拟合任意复杂的分布。代价是它生成一个动作要走很多步——这正是全文要解决的麻烦。

二、flow 策略到底是什么

一个 flow 策略通过一串确定性的精炼步把噪声变成动作。从噪声 $x^0 \sim \mathcal{N}(0, I)$ 出发,沿一个由网络 $v_\theta$ 给出的速度场积分一个 ODE:

\[\frac{dx_s}{ds} = v_\theta(s_{\text{state}}, x_s, s), \qquad x^0 \to x^1 \to \cdots \to x^F\]

最终 $x^F$ 就是动作 $a$。离散成 $F$ 步,每一步就是一次欧拉更新:

\[x^{f+1} = x^f + v_\theta(s, x^f, f)\]

这里有一个贯穿全文的关键事实给定速度场,从任意 $x^f$ 到 $x^F$ 的路径是完全确定的。 唯一的随机性来自最开始那个噪声 $x^0$。拖一拖下面的演示:噪声点云流向几个峰;重采噪声,不同的 $x^0$ 流向不同的峰——但每条路径都是确定的。

# flow policy forward (generate one action), conceptual
def sample_action(state, v_net, F):
    x = torch.randn(action_dim)        # x^0 ~ N(0, I)
    for f in range(F):                 # F denoise steps
        x = x + v_net(state, x, f)     # x^{f+1} = x^f + v(s, x^f, f)
    return x                           # x^F, executed in the environment

Note the execution flow: sample noise → run $F$ denoise steps → get $x^F$ → execute once. In the environment this is one action; the environment-level horizon doesn’t change at all. We’ll use this repeatedly.

3. Two routes: treat the flow as a black box, or treat each step as an action

To do RL on this policy, the core difficulty is: how do you take the gradient w.r.t. $\theta$ that pushes the action toward higher value? There are two fundamentally different routes.

Route A: treat the whole flow as a black box

Learn one value $Q(s, x^F)$ on the final action only. The MDP is unchanged, the horizon is unchanged. Sounds great — until policy improvement, where you need

\[\nabla_\theta\, Q(s, x^F), \qquad x^F = x^0 + \int_0^F v_\theta(s, x_s, s)\, ds\]

and $x^F$ is the output of $F$ integration steps. Differentiating through it means backpropagating through all $F$ steps (BPTT). Each step multiplies the gradient by its Jacobian, so the gradient at the input is roughly $\lambda^F$ — exploding if $\lambda>1$, vanishing if $\lambda<1$. Drag $F$ and $\lambda$:

# flow 策略的前向(生成一个动作),概念示意
def sample_action(state, v_net, F):
    x = torch.randn(action_dim)        # x^0 ~ N(0, I)
    for f in range(F):                 # F 步降噪
        x = x + v_net(state, x, f)     # x^{f+1} = x^f + v(s, x^f, f)
    return x                           # x^F,拿去环境里执行

注意执行流程:采噪声 → 跑 $F$ 步降噪 → 得到 $x^F$ → 一次性执行。在环境里这就是「一个动作」,环境层面的 horizon 没有任何变化。这一点后面会反复用到。

三、两条路线:把 flow 当黑盒,还是把每步当动作

要对这个策略做 RL,核心难点是:怎么对参数 $\theta$ 求梯度,把动作往高价值方向推? 这里有两条根本不同的路线。

路线 A:把整个 flow 当黑盒

只在最终动作上学一个价值 $Q(s, x^F)$。MDP 不变,horizon 不变。听起来很美好——直到策略提升,你要算

\[\nabla_\theta\, Q(s, x^F), \qquad x^F = x^0 + \int_0^F v_\theta(s, x_s, s)\, ds\]

而 $x^F$ 是 $F$ 步积分出来的。对它求梯度,就得穿过全部 $F$ 步反向传播(BPTT)。每一步把梯度乘上它的雅可比,于是输入处的梯度约为 $\lambda^F$——$\lambda>1$ 爆炸,$\lambda<1$ 消失。拖动 $F$ 和 $\lambda$:

To dodge BPTT, the black-box route spawns workarounds, each with a wound: BPTT (unstable), one-step distillation (squashes the flow to one step, killing expressiveness), weighted regression (no gradient at all, just value-weighted regression — empirically weak).

Route B: treat each denoise step as an RL action

“Unroll” the generation into an extended MDP: one environment step becomes $F$ decision steps. The single action node $a_t$ splits into $x^0 \to x^1 \to \cdots \to x^F (= a_t) \to s’$. Now you can learn a value $V(s, x^f, f)$ at every intermediate step and update each step with a local one-step gradient — no BPTT, full flow expressiveness retained. Hit “unroll”:

为绕开 BPTT,黑盒路线衍生出几种做法,各有硬伤:BPTT(不稳定)、一步流蒸馏(把多步压成一步,丢表达力)、加权回归(干脆不求梯度、用价值加权回归——实测偏弱)。

路线 B:把每一步降噪当成一个 RL 动作

把生成过程「展开」成一个扩展 MDP:一个环境步变成 $F$ 个决策步。单独的动作节点 $a_t$ 裂成 $x^0 \to x^1 \to \cdots \to x^F (= a_t) \to s’$。现在你可以在每个中间步学一个价值 $V(s, x^f, f)$,再用逐步的局部梯度更新那一步——不用 BPTT,又保留多步 flow 的表达力。点「展开」:

4. The core tension: horizon inflation (training ≠ execution) + no data for intermediate steps

A common confusion (mine, at first): “but at execution time you denoise once and execute $x^F$ — how can the horizon change?” The answer: the horizon problem isn’t in execution, it’s in training-time value learning.

Once you adopt route B and bootstrap value through the intermediate steps, the bootstrap chain length gets multiplied by $F$:

\[\text{effective horizon}: \quad H \longrightarrow H \times F\]

Off-policy RL is very sensitive to long horizons — value error compounds along the bootstrap chain (the curse of horizon). Multiplying it by $F$ is bad. Note this is a virtual horizon inside value learning, not the environment horizon. Slide $F$ and $\gamma$:

四、核心矛盾:horizon 膨胀(训练 ≠ 执行)+ 中间步没有数据

很多人(包括一开始的我)会困惑:「执行时不就是降噪完一次性执行 $x^F$ 吗?horizon 怎么会变?」答案是:horizon 问题不在执行,而在训练时的价值学习。

一旦你采用路线 B、在中间步上 bootstrap 价值,bootstrap 链长就乘了 $F$

\[\text{有效 horizon}: \quad H \longrightarrow H \times F\]

off-policy RL 对长 horizon 特别敏感——价值误差沿 bootstrap 链累积放大(curse of horizon)。乘以 $F$ 很糟。注意这是价值学习里的虚拟 horizon,不是环境 horizon。拖动 $F$ 和 $\gamma$:

A second problem piles on: the offline data only has $(s, x^F, r, s’)$ — it never recorded the intermediate denoise steps $x^0, \dots, x^{F-1}$. You don’t even have samples to train the extended MDP. The tension:

  horizon policy improvement data
Route A (black box) unchanged (good) needs BPTT / distill / weighted-reg (bad) enough
Route B (extended) ×$F$ (bad) local stepwise gradient (good) missing intermediate steps

RQL keeps Route B’s policy-improvement upside while killing its horizon and data problems.

5. Solution part 1: flow reversal — manufacture virtual trajectories from the data

Recall the key fact: denoise transitions are deterministic. So even though the data has no intermediate steps, we can take the dataset’s final action $x^F$ and integrate the velocity field backwards to recover any intermediate step:

\[\underbrace{x^f}_{\text{partial action}} = \underbrace{x^F}_{\text{dataset action}} - \int_f^F v_\theta(s, x_s, s)\, ds\]

This builds a “virtual trajectory” $x^0 \to \cdots \to x^F$, perfectly fitting the extended framework. These trajectories are deterministic (ODE-defined, no randomness) and on-policy (the current flow field, run in reverse). Drag the dataset action $x^F$; the reversal follows:

雪上加霜的第二个问题:离线数据里只有 $(s, x^F, r, s’)$——根本没记录中间降噪步 $x^0, \dots, x^{F-1}$。你连训练扩展 MDP 的样本都没有。矛盾摆在面前:

  horizon 策略提升 数据
路线 A(黑盒) 不变(好) 要 BPTT / 蒸馏 / 加权回归(差) 够用
路线 B(扩展) ×$F$(差) 逐步局部梯度(好) 缺中间步

RQL 要做的,就是拿路线 B 的策略提升优点,同时把它的 horizon 与数据问题一并干掉。

五、解法一:flow reversal —— 从数据里造出虚拟轨迹

回忆那个关键事实:降噪步之间的转移是确定性的。 所以虽然数据没有中间步,我们可以拿数据里的最终动作 $x^F$,把速度场反向积分,补出任意中间步:

\[\underbrace{x^f}_{\text{部分动作}} = \underbrace{x^F}_{\text{数据集动作}} - \int_f^F v_\theta(s, x_s, s)\, ds\]

这就造出一条「虚拟轨迹」$x^0 \to \cdots \to x^F$,完美适配扩展框架。这些轨迹是确定性的(由 ODE 决定,无随机)、也是 on-policy 的(当前 flow 场反着走)。拖动数据集动作 $x^F$,反向轨迹会跟着走:

# flow reversal: recover intermediate steps from a dataset action x_F
def flow_reversal(state, x_F, v_net, F):
    xs = [None] * (F + 1); xs[F] = x_F; x = x_F
    for f in reversed(range(F)):       # F-1, ..., 0
        x = x - v_net(state, x, f)     # x^f = x^{f+1} - v(s, x^f, f)
        xs[f] = x
    return xs                          # [x^0, x^1, ..., x^F]

Because the intermediate transitions are deterministic, multi-step returns across them are unbiased and zero-variance — you pay no off-policy cost for that extra length. That’s what makes stable off-policy RL possible in the extended framework.

6. The insight: within one action, every step has the same value

This is RQL’s most elegant — and most overlooked — point, and it answers “where do the intermediate-step values even come from.”

The value $V$ is one shared network that takes the step index $f$ as input: $V(s, x^f, f)$. Its TD target is:

\[\mathcal{L}(V) = \mathbb{E}_{\widetilde{\tau}}\Big[\ell_2^\kappa\big(V(s, x^f, f) - (\underbrace{r + \gamma\, V(s', x'^0, 0)}_{\text{target — no } f \text{!}})\big)\Big]\]

The target contains no $f$. Whether you ask about $V(s, x^0, 0)$, $V(s, x^5, 5)$, or $V(s, x^F, F)$, the regression target is the same number. Why? Because the denoise steps between $x^f$ and $x^F$ are (1) deterministic, (2) reward-free, and crucially (3) un-discounted — $\gamma$ fires only on the real environment transition, never inside denoising. So the value is constant along the deterministic chain:

\[V(s, x^0, 0) = V(s, x^1, 1) = \cdots = V(s, x^F, F) = \underbrace{r + \gamma V(s', x'^0, 0)}_{=\; Q(s, x^F)}\]

In words: a partial action’s value is just the $Q$-value of the complete action it will become. Flip the toggle to see the wrong version (discounting every denoise step decays the value $F\times$ faster and inflates the horizon):

# flow reversal:从数据里的最终动作 x_F 反推中间步
def flow_reversal(state, x_F, v_net, F):
    xs = [None] * (F + 1); xs[F] = x_F; x = x_F
    for f in reversed(range(F)):       # F-1, ..., 0
        x = x - v_net(state, x, f)     # x^f = x^{f+1} - v(s, x^f, f)
        xs[f] = x
    return xs                          # [x^0, x^1, ..., x^F]

因为中间转移确定,跨这些步的多步回报是无偏、零方差的——你不必为这段长度付任何 off-policy 代价。这就让扩展框架下的稳定 off-policy RL 成为可能。

六、洞察:一个动作内部,各步的价值其实相等

这是 RQL 最优雅、也最容易被忽略的一点,它直接回答了「中间步的价值到底怎么算出来」。

价值 $V$ 是一个共享网络,把步号 $f$ 当输入:$V(s, x^f, f)$。它的 TD 目标是:

\[\mathcal{L}(V) = \mathbb{E}_{\widetilde{\tau}}\Big[\ell_2^\kappa\big(V(s, x^f, f) - (\underbrace{r + \gamma\, V(s', x'^0, 0)}_{\text{目标,与 } f \text{ 无关!}})\big)\Big]\]

目标里根本没有 $f$。不管你问 $V(s, x^0, 0)$、$V(s, x^5, 5)$ 还是 $V(s, x^F, F)$,回归目标都是同一个值。为什么?因为 $x^f$ 到 $x^F$ 之间那些降噪步:(1) 确定性,(2) 无奖励,关键地 (3) 不打折扣——$\gamma$ 只在真正的环境转移用一次,降噪内部不消耗 $\gamma$。于是价值沿确定性链一路相等:

\[V(s, x^0, 0) = V(s, x^1, 1) = \cdots = V(s, x^F, F) = \underbrace{r + \gamma V(s', x'^0, 0)}_{=\; Q(s, x^F)}\]

一句话:部分动作的价值,就等于它最终会变成的那个完整动作的 $Q$ 值。 切到错误版本看看(每个降噪步都打折扣,价值衰减快 $F$ 倍,horizon 被拉长):

So if the values are equal, why does $V$ still take $f$?

Because the policy update needs not the scalar but its gradient $\nabla_{x^f} V$. The objects $x^f$ at different $f$ are very different — near $f=0$ almost pure noise, near $f=F$ nearly a finished action. The same coordinate means something different at each noise level, so even though the on-trajectory scalar is equal, the shape/slope of $V(s, \cdot, f)$ differs across $f$. The network needs $f$ to point the gradient correctly at each level. Slide $f$: the dot’s height (the value) barely moves; the slope changes a lot.

既然值都一样,$V$ 为什么还要吃 $f$?

因为策略更新要的不是标量,而是它对部分动作的梯度 $\nabla_{x^f} V$。不同 $f$ 下的 $x^f$ 是很不一样的东西——$f=0$ 附近几乎是纯噪声,$f=F$ 附近几乎是成品动作。同一个坐标在不同噪声层级含义不同,所以即便在轨标量相等,$V(s, \cdot, f)$ 的形状/坡度在不同 $f$ 上不同。网络需要 $f$ 才能在每个层级把梯度指对方向。拖动 $f$:圆点的高度(价值)几乎不动,但斜率变化很大。

# value training: inputs from reversal, target independent of f
def value_loss(batch, V_net, V_target, v_net, F, gamma):
    s, a, r, s_next = batch.s, batch.a, batch.r, batch.s_next
    xs = flow_reversal(s, a, v_net, F)              # [x^0, ..., x^F]
    x_next_0 = torch.randn_like(a)                  # x'^0 ~ N(0, I)
    with torch.no_grad():
        target = r + gamma * V_target(s_next, x_next_0, 0)   # no f
    loss = 0.0
    for f in range(F + 1):
        loss = loss + huber(V_net(s, xs[f], f) - target)     # regress every step to the SAME target
    return loss / (F + 1)

7. The tool: the reparameterization (pathwise) policy gradient

RQL’s policy update is essentially the “stepwise” version of this classic trick, so let’s nail it down. Policy improvement maximizes $J(\theta) = \mathbb{E}{a \sim \pi\theta}[Q(s, a)]$. The snag: $a$ is sampled, and sampling isn’t differentiable. Two routes:

  • Score-function (REINFORCE): $\nabla_\theta J = \mathbb{E}[Q(s,a)\nabla_\theta \log \pi_\theta(a \mid s)]$. Uses only the value of $Q$ (black box), but is high variance and ignores $Q$’s slope.
  • Reparameterization (pathwise): write $a = g_\theta(s, \epsilon)$ with $\epsilon$ fixed noise. Now $g_\theta$ is deterministic and differentiable, and the gradient flows straight through: $\nabla_\theta J = \mathbb{E}\epsilon[\nabla_a Q \cdot \nabla\theta g_\theta]$. Low variance, uses the first-order slope.

The mental model: freeze the dice ($\epsilon$), then push a single ball along a known smooth rail. Score-function keeps re-rolling the dice and only checks where points land (jittery); reparameterization freezes one die, reads $\nabla_a Q$ at that one point, and asks how to move $\mu$ to slide it uphill. Switch modes and watch the gradient arrow:

# 价值训练:输入靠 reversal 补出,目标与 f 无关
def value_loss(batch, V_net, V_target, v_net, F, gamma):
    s, a, r, s_next = batch.s, batch.a, batch.r, batch.s_next
    xs = flow_reversal(s, a, v_net, F)              # [x^0, ..., x^F]
    x_next_0 = torch.randn_like(a)                  # x'^0 ~ N(0, I)
    with torch.no_grad():
        target = r + gamma * V_target(s_next, x_next_0, 0)   # 与 f 无关
    loss = 0.0
    for f in range(F + 1):
        loss = loss + huber(V_net(s, xs[f], f) - target)     # 所有步回归到同一个目标
    return loss / (F + 1)

七、基础工具:重参数化(pathwise)策略梯度

RQL 的策略更新本质上就是这个经典招数的「逐步版」,先讲清楚。策略提升要最大化 $J(\theta) = \mathbb{E}{a \sim \pi\theta}[Q(s, a)]$。难点:$a$ 是采样出来的,采样不可导。两条路线:

  • score-function(REINFORCE): $\nabla_\theta J = \mathbb{E}[Q(s,a)\nabla_\theta \log \pi_\theta(a \mid s)]$。只用 $Q$ 的数值(当黑盒),但方差大,且没用到 $Q$ 的斜率。
  • 重参数化(pathwise): 写成 $a = g_\theta(s, \epsilon)$,$\epsilon$ 是固定噪声。于是 $g_\theta$ 确定可导,梯度直接钻进期望走链式法则:$\nabla_\theta J = \mathbb{E}\epsilon[\nabla_a Q \cdot \nabla\theta g_\theta]$。方差低,用上了一阶斜率。

心智模型:把骰子($\epsilon$)钉死,再沿一条已知的光滑滑轨推一个具体的球。 score-function 反复重掷骰子、只看点落在哪(乱抖);重参数化钉死一个骰子,读出那一点的 $\nabla_a Q$,再问怎么挪 $\mu$ 让球往上坡滑。切换模式,看那根梯度箭头:

8. What DDPG and SAC do

Both are off-policy actor-critics whose actor is updated by $\nabla_a Q$ — the pathwise gradient. DDPG’s actor is deterministic $\mu_\theta(s)$; its update is $\nabla_\theta J = \mathbb{E}[\nabla_a Q|{a=\mu\theta(s)}\cdot \nabla_\theta \mu_\theta]$. SAC makes the policy stochastic with an entropy bonus and writes the action as a reparameterized squashed Gaussian $a = \tanh(\mu_\theta + \sigma_\theta \odot \epsilon)$, then uses the same $\nabla_a Q$ through it. The crucial shared trait: the actor is one forward pass, so the gradient reaches the parameters in a single cheap hop. That single hop is the picture below — drag the action; hit ascend and it walks uphill along $\nabla_a Q$:

八、DDPG 与 SAC 怎么做

两者都是 off-policy actor-critic,actor 都靠 $\nabla_a Q$(pathwise 梯度)更新。DDPG 的 actor 是确定性 $\mu_\theta(s)$,更新为 $\nabla_\theta J = \mathbb{E}[\nabla_a Q|{a=\mu\theta(s)}\cdot \nabla_\theta \mu_\theta]$。SAC 让策略随机、加熵正则,把动作写成重参数化的压缩高斯 $a = \tanh(\mu_\theta + \sigma_\theta \odot \epsilon)$,再用同样的 $\nabla_a Q$ 穿过它。关键共性:actor 是一次前向,所以梯度一跳就到参数。 那一跳就是下面这张图——拖动动作,点上坡,它就沿 $\nabla_a Q$ 往高处走:

9. RQL’s actor update: make the SAC/DDPG trick “stepwise”

Now assemble everything. RQL’s policy loss is:

\[\mathcal{L}(v) = \underbrace{-\,\mathbb{E}_{\widetilde{\tau}}\big[V(s,\; x^f + v(s, x^f, f),\; f+1)\big]}_{\text{value maximization}} + \underbrace{\alpha\, \mathcal{L}^{\text{BC}}(v)}_{\text{behavior regularizer}}\]

Read it: at each partial action $x^f$, take one step $v$ to get $x^{f+1}$, and raise the value after that step, $V(\cdot, f+1)$. The gradient passes through that one call to $v$ — a single-step pathwise gradient. The BC term anchors the policy near the data (essential offline). The punchline is the gradient flow: DDPG/SAC actors are one forward pass (1 hop); RQL is $F$ steps but one hop per step, never BPTT through the chain. Hit backprop:

九、RQL 的 actor 更新:把 SAC/DDPG 的招数「逐步化」

把前面所有线索拼起来。RQL 的策略损失是:

\[\mathcal{L}(v) = \underbrace{-\,\mathbb{E}_{\widetilde{\tau}}\big[V(s,\; x^f + v(s, x^f, f),\; f+1)\big]}_{\text{价值最大化}} + \underbrace{\alpha\, \mathcal{L}^{\text{BC}}(v)}_{\text{行为正则}}\]

读法:在每个部分动作 $x^f$ 上,走一步 $v$ 得到 $x^{f+1}$,让走完这一步后的价值 $V(\cdot, f+1)$ 变高。梯度只穿过这一次 $v$ 调用——单步 pathwise 梯度。BC 项把策略锚在数据附近(离线必备)。题眼在梯度流:DDPG/SAC 的 actor 是一次前向(1 跳);RQL 是 $F$ 步,但每步只 1 跳,从不穿过整条链做 BPTT。点反传:

And here is that update actually running — the stepwise pathwise gradient bending a flow trajectory toward a high-value mode, one cheap hop per step:

而下面就是这个更新真正在跑——逐步的 pathwise 梯度把一条 flow 轨迹掰向高价值峰,每步一跳便宜的反传:

# RQL policy update: stepwise pathwise gradient + BC
def policy_loss(batch, V_net, v_net, F, bc_coef):
    s, a = batch.s, batch.a
    xs = flow_reversal(s, a, v_net, F)        # partial actions via reversal
    value_term = 0.0
    for f in range(F):
        x_f = xs[f].detach()                  # gradient flows only through this one v
        x_next = x_f + v_net(s, x_f, f)       # x^{f+1}
        value_term = value_term - V_net(s, x_next, f + 1).mean()
    bc_term = 0.0
    for f in range(F):
        v_target = (xs[f + 1] - xs[f]).detach()             # the data's own step
        bc_term = bc_term + ((v_net(s, xs[f].detach(), f) - v_target) ** 2).mean()
    return value_term / F + bc_coef * bc_term / F

10. The full training loop

def rql_train_step(batch, V_net, V_target, v_net, opt_V, opt_v, F, gamma, bc_coef, tau):
    loss_V = value_loss(batch, V_net, V_target, v_net, F, gamma)
    opt_V.zero_grad(); loss_V.backward(); opt_V.step()
    loss_v = policy_loss(batch, V_net, v_net, F, bc_coef)
    opt_v.zero_grad(); loss_v.backward(); opt_v.step()
    with torch.no_grad():                      # Polyak target update
        for p, p_t in zip(V_net.parameters(), V_target.parameters()):
            p_t.mul_(1 - tau).add_(tau * p)

for step in range(num_steps):
    batch = replay.sample(batch_size)          # offline, off-policy
    rql_train_step(batch, V_net, V_target, v_net, opt_V, opt_v, F, gamma, bc_coef, tau)

Everything is fully offline and off-policy; the intermediate steps are conjured by reversal on the fly — the dataset never stores denoise sequences.

11. Why it beats BPTT / distillation / weighted regression

RQL gets both sides at once: like DDPG/SAC it pushes the policy with stable first-order $\nabla V$ gradients, stepwise (no BPTT, no distillation); and via reversal + “equal values within an action,” the stretched horizon is free (zero-variance, unbiased, no effective-horizon inflation). The only cost is needing a value defined on partial actions — which is exactly what can be trained unbiasedly. The authors report the best offline-RL performance against 19 SOTA flow-RL algorithms across 50 tasks.

12. One line to remember

Treating denoise steps as RL actions stretches the value-learning horizon; but because the flow is deterministic, you can reverse-solve the intermediate steps from offline data, building zero-variance multi-step returns that cancel the horizon penalty; then a value function on partial actions lets you apply a cheap, stable, stepwise reparameterization gradient — that is RQL.

References

Reminder: the code here is a teaching reconstruction for explaining the mechanism, not the official implementation; for the real discount bookkeeping, architecture, and regularizers, see the original paper and code.

# RQL 策略更新:逐步 pathwise 梯度 + BC
def policy_loss(batch, V_net, v_net, F, bc_coef):
    s, a = batch.s, batch.a
    xs = flow_reversal(s, a, v_net, F)        # 靠 reversal 拿到每步 x^f
    value_term = 0.0
    for f in range(F):
        x_f = xs[f].detach()                  # 梯度只走这一步 v
        x_next = x_f + v_net(s, x_f, f)       # x^{f+1}
        value_term = value_term - V_net(s, x_next, f + 1).mean()
    bc_term = 0.0
    for f in range(F):
        v_target = (xs[f + 1] - xs[f]).detach()             # 数据隐含的「该走的一步」
        bc_term = bc_term + ((v_net(s, xs[f].detach(), f) - v_target) ** 2).mean()
    return value_term / F + bc_coef * bc_term / F

十、完整训练循环

def rql_train_step(batch, V_net, V_target, v_net, opt_V, opt_v, F, gamma, bc_coef, tau):
    loss_V = value_loss(batch, V_net, V_target, v_net, F, gamma)
    opt_V.zero_grad(); loss_V.backward(); opt_V.step()
    loss_v = policy_loss(batch, V_net, v_net, F, bc_coef)
    opt_v.zero_grad(); loss_v.backward(); opt_v.step()
    with torch.no_grad():                      # 目标网络软更新
        for p, p_t in zip(V_net.parameters(), V_target.parameters()):
            p_t.mul_(1 - tau).add_(tau * p)

for step in range(num_steps):
    batch = replay.sample(batch_size)          # 离线,off-policy
    rql_train_step(batch, V_net, V_target, v_net, opt_V, opt_v, F, gamma, bc_coef, tau)

整个过程完全离线、off-policy;中间步全靠 reversal 即时造出来——数据集从不存降噪序列。

十一、为什么它打败 BPTT / 蒸馏 / 加权回归

RQL 同时拿到两边的好处:像 DDPG/SAC 一样用稳定的一阶 $\nabla V$ 梯度逐步、稳定地推策略(不 BPTT、不蒸馏);又通过 reversal +「动作内价值相等」让被拉长的 horizon 不要钱(零方差、无偏、不膨胀有效 horizon)。代价只是需要一个定义在部分动作上的价值函数——而它恰好能被无偏训练。作者报告:在 50 个任务上对比 19 个 SOTA flow RL 算法,RQL 取得最佳离线 RL 性能。

十二、一条线记住它

把降噪步当 RL 动作会拉长价值学习的 horizon;但由于 flow 是确定性的,可以从离线数据里把动作「倒着解出」中间步,构造零方差的多步回报,从而消除 horizon 惩罚;再用一个定义在部分动作上的价值函数,逐步施加便宜又稳定的重参数化梯度——这就是 RQL。

参考

再次提醒:本文代码为教学式重建,用于解释机制,非官方实现;实际的折扣安排、网络结构与正则细节请以原始论文与代码为准。