Reversal Q-Learning (RQL) from Scratch: Making Denoising an RL Action Without the Horizon Tax

A long read that strings together flow policies, the extended MDP, flow reversal, the reparameterization policy gradient, DDPG/SAC, and RQL’s value function — into one line. Every key idea is paired with a live, interactive visualization (there are eleven). Drag and play as you read.

Note: this is a teaching-style reconstruction based on the RQL project page (Oberai, Park, Levine, 2026). The code is illustrative PyTorch-flavored pseudocode meant to explain the idea — it is not the official implementation; some engineering details (discount bookkeeping, network shapes) favor clarity and may differ from the original.

一篇把 flow 策略、扩展 MDP、flow reversal、重参数化策略梯度、DDPG/SAC、以及 RQL 的价值函数串成一条线的长文。每个关键想法都配了一个可实时交互的可视化（一共十一个），边读边拖、边玩。

注：本文是基于 RQL 项目主页（Oberai, Park, Levine, 2026）的教学式重建。文中代码是说明性的 PyTorch 风格伪代码，用于解释思路，不是官方实现；一些工程细节（折扣安排、网络结构）以讲清原理为先，可能与原实现不完全一致。

TL;DR

RQL’s problem: do reinforcement learning on a flow (diffusion-style) policy using offline data.

A flow policy doesn’t emit an action in one shot — it starts from noise and refines it over $F$ “denoise” steps. Expressive, but awkward to put under RL.
A natural idea is to treat each denoise step as an RL action. That lets you train with stable, local, one-step gradients — but it multiplies the value-learning horizon by $F$ (off-policy RL’s “curse of horizon”), and the offline data has no intermediate steps in it.
RQL’s key move is flow reversal: integrate the ODE backwards from the dataset’s final action to fill in the missing steps, building “virtual trajectories.” Because those intermediate transitions are deterministic, multi-step returns across them are zero-variance and unbiased — so the stretched horizon is free.
In practice: learn a value $V(s, x^f, f)$ defined on partial actions, then nudge each denoise step with a reparameterization (pathwise) gradient, plus a BC term to anchor the policy to the data.

Let’s take it apart, one piece at a time.

1. The setting: a flow policy + offline RL

We’re in a standard MDP — state $s$, action $a$, reward $r$, next state $s’$, discount $\gamma$ — with a batch of offline data: transitions $(s, a, r, s’)$. We want the best policy we can get, and the policy is a flow model.

Why a flow policy? Because the distribution of good actions is often multi-modal and complex. A plain Gaussian policy can only represent one mode; a flow/diffusion model can fit arbitrarily complex distributions. The cost is that generating an action takes many steps — which is the headache the rest of the post resolves.

2. What a flow policy actually is

A flow policy turns noise into an action through a chain of deterministic refinement steps. Start from noise $x^0 \sim \mathcal{N}(0, I)$ and integrate an ODE given by a network $v_\theta$:

\[\frac{dx_s}{ds} = v_\theta(s_{\text{state}}, x_s, s), \qquad x^0 \to x^1 \to \cdots \to x^F\]

The final $x^F$ is the action $a$. Discretized into $F$ Euler steps, each step is

\[x^{f+1} = x^f + v_\theta(s, x^f, f)\]

Here is the fact that runs through the whole post: given the velocity field, the path from any $x^f$ to $x^F$ is fully determined. The only randomness is the initial noise $x^0$. Drag the demo: the noise cloud flows to the modes; re-roll the noise and different $x^0$ flow to different modes — but each path is deterministic.

TL;DR

RQL 想解决的问题是：用已有的离线数据，对一个 flow（扩散式）策略做强化学习。

一个 flow 策略不是一次性吐出动作，而是从噪声出发、经过 $F$ 步「降噪/精炼」逐步生成动作。表达力强，但很难直接套 RL。
一个自然想法是把每一步降噪当成一个 RL 动作。这能让你用稳定的「逐步局部梯度」去训策略，但代价是：价值学习的 horizon 被乘了 $F$（off-policy RL 的「curse of horizon」），而且离线数据里根本没有中间降噪步。
RQL 的核心招数是 flow reversal：从数据里的最终动作沿 ODE 反向积分，补出中间步，造出「虚拟轨迹」。由于这些中间转移是确定性的，跨它们的多步回报零方差、无偏 —— 于是被拉长的 horizon 不要钱。
实现上：学一个定义在部分动作上的价值函数 $V(s, x^f, f)$，再用重参数化（pathwise）梯度逐步微调每个降噪步，外加一个 BC 正则项把策略锚在数据附近。

下面我们一点一点把它拆开。

一、问题设定：flow 策略 + 离线 RL

我们处在一个标准 MDP 里：状态 $s$、动作 $a$、奖励 $r$、下一状态 $s’$、折扣 $\gamma$。手里有一批离线数据，形如一堆转移 $(s, a, r, s’)$。我们想学一个尽量好的策略，而且策略本身是一个 flow 模型。

为什么用 flow 策略？因为真实任务里「好动作的分布」经常是多峰的、复杂的。普通高斯策略只能表达一个单峰，而 flow / 扩散模型能拟合任意复杂的分布。代价是它生成一个动作要走很多步——这正是全文要解决的麻烦。

二、flow 策略到底是什么

一个 flow 策略通过一串确定性的精炼步把噪声变成动作。从噪声 $x^0 \sim \mathcal{N}(0, I)$ 出发，沿一个由网络 $v_\theta$ 给出的速度场积分一个 ODE：

\[\frac{dx_s}{ds} = v_\theta(s_{\text{state}}, x_s, s), \qquad x^0 \to x^1 \to \cdots \to x^F\]

最终 $x^F$ 就是动作 $a$。离散成 $F$ 步，每一步就是一次欧拉更新：

\[x^{f+1} = x^f + v_\theta(s, x^f, f)\]

这里有一个贯穿全文的关键事实：给定速度场，从任意 $x^f$ 到 $x^F$ 的路径是完全确定的。 唯一的随机性来自最开始那个噪声 $x^0$。拖一拖下面的演示：噪声点云流向几个峰；重采噪声，不同的 $x^0$ 流向不同的峰——但每条路径都是确定的。

# flow policy forward (generate one action), conceptual
def sample_action(state, v_net, F):
    x = torch.randn(action_dim)        # x^0 ~ N(0, I)
    for f in range(F):                 # F denoise steps
        x = x + v_net(state, x, f)     # x^{f+1} = x^f + v(s, x^f, f)
    return x                           # x^F, executed in the environment

Note the execution flow: sample noise → run $F$ denoise steps → get $x^F$ → execute once. In the environment this is one action; the environment-level horizon doesn’t change at all. We’ll use this repeatedly.

3. Two routes: treat the flow as a black box, or treat each step as an action

To do RL on this policy, the core difficulty is: how do you take the gradient w.r.t. $\theta$ that pushes the action toward higher value? There are two fundamentally different routes.

Route A: treat the whole flow as a black box

Learn one value $Q(s, x^F)$ on the final action only. The MDP is unchanged, the horizon is unchanged. Sounds great — until policy improvement, where you need

\[\nabla_\theta\, Q(s, x^F), \qquad x^F = x^0 + \int_0^F v_\theta(s, x_s, s)\, ds\]

and $x^F$ is the output of $F$ integration steps. Differentiating through it means backpropagating through all $F$ steps (BPTT). Each step multiplies the gradient by its Jacobian, so the gradient at the input is roughly $\lambda^F$ — exploding if $\lambda>1$, vanishing if $\lambda<1$. Drag $F$ and $\lambda$:

# flow 策略的前向（生成一个动作），概念示意
def sample_action(state, v_net, F):
    x = torch.randn(action_dim)        # x^0 ~ N(0, I)
    for f in range(F):                 # F 步降噪
        x = x + v_net(state, x, f)     # x^{f+1} = x^f + v(s, x^f, f)
    return x                           # x^F，拿去环境里执行

注意执行流程：采噪声 → 跑 $F$ 步降噪 → 得到 $x^F$ → 一次性执行。在环境里这就是「一个动作」，环境层面的 horizon 没有任何变化。这一点后面会反复用到。

三、两条路线：把 flow 当黑盒，还是把每步当动作

要对这个策略做 RL，核心难点是：怎么对参数 $\theta$ 求梯度，把动作往高价值方向推？ 这里有两条根本不同的路线。

路线 A：把整个 flow 当黑盒

只在最终动作上学一个价值 $Q(s, x^F)$。MDP 不变，horizon 不变。听起来很美好——直到策略提升，你要算

\[\nabla_\theta\, Q(s, x^F), \qquad x^F = x^0 + \int_0^F v_\theta(s, x_s, s)\, ds\]

而 $x^F$ 是 $F$ 步积分出来的。对它求梯度，就得穿过全部 $F$ 步反向传播（BPTT）。每一步把梯度乘上它的雅可比，于是输入处的梯度约为 $\lambda^F$——$\lambda>1$ 爆炸，$\lambda<1$ 消失。拖动 $F$ 和 $\lambda$：

To dodge BPTT, the black-box route spawns workarounds, each with a wound: BPTT (unstable), one-step distillation (squashes the flow to one step, killing expressiveness), weighted regression (no gradient at all, just value-weighted regression — empirically weak).

Route B: treat each denoise step as an RL action

“Unroll” the generation into an extended MDP: one environment step becomes $F$ decision steps. The single action node $a_t$ splits into $x^0 \to x^1 \to \cdots \to x^F (= a_t) \to s’$. Now you can learn a value $V(s, x^f, f)$ at every intermediate step and update each step with a local one-step gradient — no BPTT, full flow expressiveness retained. Hit “unroll”:

为绕开 BPTT，黑盒路线衍生出几种做法，各有硬伤：BPTT（不稳定）、一步流蒸馏（把多步压成一步，丢表达力）、加权回归（干脆不求梯度、用价值加权回归——实测偏弱）。

路线 B：把每一步降噪当成一个 RL 动作

把生成过程「展开」成一个扩展 MDP：一个环境步变成 $F$ 个决策步。单独的动作节点 $a_t$ 裂成 $x^0 \to x^1 \to \cdots \to x^F (= a_t) \to s’$。现在你可以在每个中间步学一个价值 $V(s, x^f, f)$，再用逐步的局部梯度更新那一步——不用 BPTT，又保留多步 flow 的表达力。点「展开」：

4. The core tension: horizon inflation (training ≠ execution) + no data for intermediate steps

A common confusion (mine, at first): “but at execution time you denoise once and execute $x^F$ — how can the horizon change?” The answer: the horizon problem isn’t in execution, it’s in training-time value learning.

Once you adopt route B and bootstrap value through the intermediate steps, the bootstrap chain length gets multiplied by $F$:

\[\text{effective horizon}: \quad H \longrightarrow H \times F\]

Off-policy RL is very sensitive to long horizons — value error compounds along the bootstrap chain (the curse of horizon). Multiplying it by $F$ is bad. Note this is a virtual horizon inside value learning, not the environment horizon. Slide $F$ and $\gamma$:

四、核心矛盾：horizon 膨胀（训练 ≠ 执行）+ 中间步没有数据

很多人（包括一开始的我）会困惑：「执行时不就是降噪完一次性执行 $x^F$ 吗？horizon 怎么会变？」答案是：horizon 问题不在执行，而在训练时的价值学习。

一旦你采用路线 B、在中间步上 bootstrap 价值，bootstrap 链长就乘了 $F$：

\[\text{有效 horizon}: \quad H \longrightarrow H \times F\]

off-policy RL 对长 horizon 特别敏感——价值误差沿 bootstrap 链累积放大（curse of horizon）。乘以 $F$ 很糟。注意这是价值学习里的虚拟 horizon，不是环境 horizon。拖动 $F$ 和 $\gamma$：

A second problem piles on: the offline data only has $(s, x^F, r, s’)$ — it never recorded the intermediate denoise steps $x^0, \dots, x^{F-1}$. You don’t even have samples to train the extended MDP. The tension:

	horizon	policy improvement	data
Route A (black box)	unchanged (good)	needs BPTT / distill / weighted-reg (bad)	enough
Route B (extended)	×$F$ (bad)	local stepwise gradient (good)	missing intermediate steps

RQL keeps Route B’s policy-improvement upside while killing its horizon and data problems.

5. Solution part 1: flow reversal — manufacture virtual trajectories from the data

Recall the key fact: denoise transitions are deterministic. So even though the data has no intermediate steps, we can take the dataset’s final action $x^F$ and integrate the velocity field backwards to recover any intermediate step:

\[\underbrace{x^f}_{\text{partial action}} = \underbrace{x^F}_{\text{dataset action}} - \int_f^F v_\theta(s, x_s, s)\, ds\]

This builds a “virtual trajectory” $x^0 \to \cdots \to x^F$, perfectly fitting the extended framework. These trajectories are deterministic (ODE-defined, no randomness) and on-policy (the current flow field, run in reverse). Drag the dataset action $x^F$; the reversal follows:

雪上加霜的第二个问题：离线数据里只有 $(s, x^F, r, s’)$——根本没记录中间降噪步 $x^0, \dots, x^{F-1}$。你连训练扩展 MDP 的样本都没有。矛盾摆在面前：

	horizon	策略提升	数据
路线 A（黑盒）	不变（好）	要 BPTT / 蒸馏 / 加权回归（差）	够用
路线 B（扩展）	×$F$（差）	逐步局部梯度（好）	缺中间步

RQL 要做的，就是拿路线 B 的策略提升优点，同时把它的 horizon 与数据问题一并干掉。

五、解法一：flow reversal —— 从数据里造出虚拟轨迹

回忆那个关键事实：降噪步之间的转移是确定性的。 所以虽然数据没有中间步，我们可以拿数据里的最终动作 $x^F$，把速度场反向积分，补出任意中间步：

\[\underbrace{x^f}_{\text{部分动作}} = \underbrace{x^F}_{\text{数据集动作}} - \int_f^F v_\theta(s, x_s, s)\, ds\]

这就造出一条「虚拟轨迹」$x^0 \to \cdots \to x^F$，完美适配扩展框架。这些轨迹是确定性的（由 ODE 决定，无随机）、也是 on-policy 的（当前 flow 场反着走）。拖动数据集动作 $x^F$，反向轨迹会跟着走：

# flow reversal: recover intermediate steps from a dataset action x_F
def flow_reversal(state, x_F, v_net, F):
    xs = [None] * (F + 1); xs[F] = x_F; x = x_F
    for f in reversed(range(F)):       # F-1, ..., 0
        x = x - v_net(state, x, f)     # x^f = x^{f+1} - v(s, x^f, f)
        xs[f] = x
    return xs                          # [x^0, x^1, ..., x^F]

Because the intermediate transitions are deterministic, multi-step returns across them are unbiased and zero-variance — you pay no off-policy cost for that extra length. That’s what makes stable off-policy RL possible in the extended framework.

6. The insight: within one action, every step has the same value

This is RQL’s most elegant — and most overlooked — point, and it answers “where do the intermediate-step values even come from.”

The value $V$ is one shared network that takes the step index $f$ as input: $V(s, x^f, f)$. Its TD target is:

\[\mathcal{L}(V) = \mathbb{E}_{\widetilde{\tau}}\Big[\ell_2^\kappa\big(V(s, x^f, f) - (\underbrace{r + \gamma\, V(s', x'^0, 0)}_{\text{target — no } f \text{!}})\big)\Big]\]

The target contains no $f$. Whether you ask about $V(s, x^0, 0)$, $V(s, x^5, 5)$, or $V(s, x^F, F)$, the regression target is the same number. Why? Because the denoise steps between $x^f$ and $x^F$ are (1) deterministic, (2) reward-free, and crucially (3) un-discounted — $\gamma$ fires only on the real environment transition, never inside denoising. So the value is constant along the deterministic chain:

\[V(s, x^0, 0) = V(s, x^1, 1) = \cdots = V(s, x^F, F) = \underbrace{r + \gamma V(s', x'^0, 0)}_{=\; Q(s, x^F)}\]

In words: a partial action’s value is just the $Q$-value of the complete action it will become. Flip the toggle to see the wrong version (discounting every denoise step decays the value $F\times$ faster and inflates the horizon):

# flow reversal：从数据里的最终动作 x_F 反推中间步
def flow_reversal(state, x_F, v_net, F):
    xs = [None] * (F + 1); xs[F] = x_F; x = x_F
    for f in reversed(range(F)):       # F-1, ..., 0
        x = x - v_net(state, x, f)     # x^f = x^{f+1} - v(s, x^f, f)
        xs[f] = x
    return xs                          # [x^0, x^1, ..., x^F]

因为中间转移确定，跨这些步的多步回报是无偏、零方差的——你不必为这段长度付任何 off-policy 代价。这就让扩展框架下的稳定 off-policy RL 成为可能。

六、洞察：一个动作内部，各步的价值其实相等

这是 RQL 最优雅、也最容易被忽略的一点，它直接回答了「中间步的价值到底怎么算出来」。

价值 $V$ 是一个共享网络，把步号 $f$ 当输入：$V(s, x^f, f)$。它的 TD 目标是：

\[\mathcal{L}(V) = \mathbb{E}_{\widetilde{\tau}}\Big[\ell_2^\kappa\big(V(s, x^f, f) - (\underbrace{r + \gamma\, V(s', x'^0, 0)}_{\text{目标，与 } f \text{ 无关！}})\big)\Big]\]

目标里根本没有 $f$。不管你问 $V(s, x^0, 0)$、$V(s, x^5, 5)$ 还是 $V(s, x^F, F)$，回归目标都是同一个值。为什么？因为 $x^f$ 到 $x^F$ 之间那些降噪步：(1) 确定性，(2) 无奖励，关键地 (3) 不打折扣——$\gamma$ 只在真正的环境转移用一次，降噪内部不消耗 $\gamma$。于是价值沿确定性链一路相等：

\[V(s, x^0, 0) = V(s, x^1, 1) = \cdots = V(s, x^F, F) = \underbrace{r + \gamma V(s', x'^0, 0)}_{=\; Q(s, x^F)}\]

一句话：部分动作的价值，就等于它最终会变成的那个完整动作的 $Q$ 值。 切到错误版本看看（每个降噪步都打折扣，价值衰减快 $F$ 倍，horizon 被拉长）：

So if the values are equal, why does $V$ still take $f$?

Because the policy update needs not the scalar but its gradient $\nabla_{x^f} V$. The objects $x^f$ at different $f$ are very different — near $f=0$ almost pure noise, near $f=F$ nearly a finished action. The same coordinate means something different at each noise level, so even though the on-trajectory scalar is equal, the shape/slope of $V(s, \cdot, f)$ differs across $f$. The network needs $f$ to point the gradient correctly at each level. Slide $f$: the dot’s height (the value) barely moves; the slope changes a lot.

既然值都一样，$V$ 为什么还要吃 $f$？

因为策略更新要的不是标量，而是它对部分动作的梯度 $\nabla_{x^f} V$。不同 $f$ 下的 $x^f$ 是很不一样的东西——$f=0$ 附近几乎是纯噪声，$f=F$ 附近几乎是成品动作。同一个坐标在不同噪声层级含义不同，所以即便在轨标量相等，$V(s, \cdot, f)$ 的形状/坡度在不同 $f$ 上不同。网络需要 $f$ 才能在每个层级把梯度指对方向。拖动 $f$：圆点的高度（价值）几乎不动，但斜率变化很大。

# value training: inputs from reversal, target independent of f
def value_loss(batch, V_net, V_target, v_net, F, gamma):
    s, a, r, s_next = batch.s, batch.a, batch.r, batch.s_next
    xs = flow_reversal(s, a, v_net, F)              # [x^0, ..., x^F]
    x_next_0 = torch.randn_like(a)                  # x'^0 ~ N(0, I)
    with torch.no_grad():
        target = r + gamma * V_target(s_next, x_next_0, 0)   # no f
    loss = 0.0
    for f in range(F + 1):
        loss = loss + huber(V_net(s, xs[f], f) - target)     # regress every step to the SAME target
    return loss / (F + 1)

7. The tool: the reparameterization (pathwise) policy gradient

RQL’s policy update is essentially the “stepwise” version of this classic trick, so let’s nail it down. Policy improvement maximizes $J(\theta) = \mathbb{E}{a \sim \pi\theta}[Q(s, a)]$. The snag: $a$ is sampled, and sampling isn’t differentiable. Two routes:

Score-function (REINFORCE): $\nabla_\theta J = \mathbb{E}[Q(s,a)\nabla_\theta \log \pi_\theta(a \mid s)]$. Uses only the value of $Q$ (black box), but is high variance and ignores $Q$’s slope.
Reparameterization (pathwise): write $a = g_\theta(s, \epsilon)$ with $\epsilon$ fixed noise. Now $g_\theta$ is deterministic and differentiable, and the gradient flows straight through: $\nabla_\theta J = \mathbb{E}\epsilon[\nabla_a Q \cdot \nabla\theta g_\theta]$. Low variance, uses the first-order slope.

The mental model: freeze the dice ($\epsilon$), then push a single ball along a known smooth rail. Score-function keeps re-rolling the dice and only checks where points land (jittery); reparameterization freezes one die, reads $\nabla_a Q$ at that one point, and asks how to move $\mu$ to slide it uphill. Switch modes and watch the gradient arrow:

# 价值训练：输入靠 reversal 补出，目标与 f 无关
def value_loss(batch, V_net, V_target, v_net, F, gamma):
    s, a, r, s_next = batch.s, batch.a, batch.r, batch.s_next
    xs = flow_reversal(s, a, v_net, F)              # [x^0, ..., x^F]
    x_next_0 = torch.randn_like(a)                  # x'^0 ~ N(0, I)
    with torch.no_grad():
        target = r + gamma * V_target(s_next, x_next_0, 0)   # 与 f 无关
    loss = 0.0
    for f in range(F + 1):
        loss = loss + huber(V_net(s, xs[f], f) - target)     # 所有步回归到同一个目标
    return loss / (F + 1)

七、基础工具：重参数化（pathwise）策略梯度

RQL 的策略更新本质上就是这个经典招数的「逐步版」，先讲清楚。策略提升要最大化 $J(\theta) = \mathbb{E}{a \sim \pi\theta}[Q(s, a)]$。难点：$a$ 是采样出来的，采样不可导。两条路线：

score-function（REINFORCE）： $\nabla_\theta J = \mathbb{E}[Q(s,a)\nabla_\theta \log \pi_\theta(a \mid s)]$。只用 $Q$ 的数值（当黑盒），但方差大，且没用到 $Q$ 的斜率。
重参数化（pathwise）： 写成 $a = g_\theta(s, \epsilon)$，$\epsilon$ 是固定噪声。于是 $g_\theta$ 确定可导，梯度直接钻进期望走链式法则：$\nabla_\theta J = \mathbb{E}\epsilon[\nabla_a Q \cdot \nabla\theta g_\theta]$。方差低，用上了一阶斜率。

心智模型：把骰子（$\epsilon$）钉死，再沿一条已知的光滑滑轨推一个具体的球。 score-function 反复重掷骰子、只看点落在哪（乱抖）；重参数化钉死一个骰子，读出那一点的 $\nabla_a Q$，再问怎么挪 $\mu$ 让球往上坡滑。切换模式，看那根梯度箭头：

8. What DDPG and SAC do

Both are off-policy actor-critics whose actor is updated by $\nabla_a Q$ — the pathwise gradient. DDPG’s actor is deterministic $\mu_\theta(s)$; its update is $\nabla_\theta J = \mathbb{E}[\nabla_a Q|{a=\mu\theta(s)}\cdot \nabla_\theta \mu_\theta]$. SAC makes the policy stochastic with an entropy bonus and writes the action as a reparameterized squashed Gaussian $a = \tanh(\mu_\theta + \sigma_\theta \odot \epsilon)$, then uses the same $\nabla_a Q$ through it. The crucial shared trait: the actor is one forward pass, so the gradient reaches the parameters in a single cheap hop. That single hop is the picture below — drag the action; hit ascend and it walks uphill along $\nabla_a Q$:

八、DDPG 与 SAC 怎么做

两者都是 off-policy actor-critic，actor 都靠 $\nabla_a Q$（pathwise 梯度）更新。DDPG 的 actor 是确定性 $\mu_\theta(s)$，更新为 $\nabla_\theta J = \mathbb{E}[\nabla_a Q|{a=\mu\theta(s)}\cdot \nabla_\theta \mu_\theta]$。SAC 让策略随机、加熵正则，把动作写成重参数化的压缩高斯 $a = \tanh(\mu_\theta + \sigma_\theta \odot \epsilon)$，再用同样的 $\nabla_a Q$ 穿过它。关键共性：actor 是一次前向，所以梯度一跳就到参数。 那一跳就是下面这张图——拖动动作，点上坡，它就沿 $\nabla_a Q$ 往高处走：

9. RQL’s actor update: make the SAC/DDPG trick “stepwise”

Now assemble everything. RQL’s policy loss is:

\[\mathcal{L}(v) = \underbrace{-\,\mathbb{E}_{\widetilde{\tau}}\big[V(s,\; x^f + v(s, x^f, f),\; f+1)\big]}_{\text{value maximization}} + \underbrace{\alpha\, \mathcal{L}^{\text{BC}}(v)}_{\text{behavior regularizer}}\]

Read it: at each partial action $x^f$, take one step $v$ to get $x^{f+1}$, and raise the value after that step, $V(\cdot, f+1)$. The gradient passes through that one call to $v$ — a single-step pathwise gradient. The BC term anchors the policy near the data (essential offline). The punchline is the gradient flow: DDPG/SAC actors are one forward pass (1 hop); RQL is $F$ steps but one hop per step, never BPTT through the chain. Hit backprop:

九、RQL 的 actor 更新：把 SAC/DDPG 的招数「逐步化」

把前面所有线索拼起来。RQL 的策略损失是：

\[\mathcal{L}(v) = \underbrace{-\,\mathbb{E}_{\widetilde{\tau}}\big[V(s,\; x^f + v(s, x^f, f),\; f+1)\big]}_{\text{价值最大化}} + \underbrace{\alpha\, \mathcal{L}^{\text{BC}}(v)}_{\text{行为正则}}\]

读法：在每个部分动作 $x^f$ 上，走一步 $v$ 得到 $x^{f+1}$，让走完这一步后的价值 $V(\cdot, f+1)$ 变高。梯度只穿过这一次 $v$ 调用——单步 pathwise 梯度。BC 项把策略锚在数据附近（离线必备）。题眼在梯度流：DDPG/SAC 的 actor 是一次前向（1 跳）；RQL 是 $F$ 步，但每步只 1 跳，从不穿过整条链做 BPTT。点反传：

And here is that update actually running — the stepwise pathwise gradient bending a flow trajectory toward a high-value mode, one cheap hop per step:

而下面就是这个更新真正在跑——逐步的 pathwise 梯度把一条 flow 轨迹掰向高价值峰，每步一跳便宜的反传：

# RQL policy update: stepwise pathwise gradient + BC
def policy_loss(batch, V_net, v_net, F, bc_coef):
    s, a = batch.s, batch.a
    xs = flow_reversal(s, a, v_net, F)        # partial actions via reversal
    value_term = 0.0
    for f in range(F):
        x_f = xs[f].detach()                  # gradient flows only through this one v
        x_next = x_f + v_net(s, x_f, f)       # x^{f+1}
        value_term = value_term - V_net(s, x_next, f + 1).mean()
    bc_term = 0.0
    for f in range(F):
        v_target = (xs[f + 1] - xs[f]).detach()             # the data's own step
        bc_term = bc_term + ((v_net(s, xs[f].detach(), f) - v_target) ** 2).mean()
    return value_term / F + bc_coef * bc_term / F

10. The full training loop

def rql_train_step(batch, V_net, V_target, v_net, opt_V, opt_v, F, gamma, bc_coef, tau):
    loss_V = value_loss(batch, V_net, V_target, v_net, F, gamma)
    opt_V.zero_grad(); loss_V.backward(); opt_V.step()
    loss_v = policy_loss(batch, V_net, v_net, F, bc_coef)
    opt_v.zero_grad(); loss_v.backward(); opt_v.step()
    with torch.no_grad():                      # Polyak target update
        for p, p_t in zip(V_net.parameters(), V_target.parameters()):
            p_t.mul_(1 - tau).add_(tau * p)

for step in range(num_steps):
    batch = replay.sample(batch_size)          # offline, off-policy
    rql_train_step(batch, V_net, V_target, v_net, opt_V, opt_v, F, gamma, bc_coef, tau)

Everything is fully offline and off-policy; the intermediate steps are conjured by reversal on the fly — the dataset never stores denoise sequences.

11. Why it beats BPTT / distillation / weighted regression

RQL gets both sides at once: like DDPG/SAC it pushes the policy with stable first-order $\nabla V$ gradients, stepwise (no BPTT, no distillation); and via reversal + “equal values within an action,” the stretched horizon is free (zero-variance, unbiased, no effective-horizon inflation). The only cost is needing a value defined on partial actions — which is exactly what can be trained unbiasedly. The authors report the best offline-RL performance against 19 SOTA flow-RL algorithms across 50 tasks.

12. One line to remember

Treating denoise steps as RL actions stretches the value-learning horizon; but because the flow is deterministic, you can reverse-solve the intermediate steps from offline data, building zero-variance multi-step returns that cancel the horizon penalty; then a value function on partial actions lets you apply a cheap, stable, stepwise reparameterization gradient — that is RQL.

References

Aditya Oberai, Seohong Park, Sergey Levine. Reversal Q-Learning. 2026. Project page: aober.ai/rql
Background: DDPG (Lillicrap et al., 2016, arXiv:1509.02971); TD3 (Fujimoto et al., 2018, arXiv:1802.09477); SAC (Haarnoja et al., 2018, arXiv:1801.01290); Flow Matching (Lipman et al., 2023, arXiv:2210.02747).

Reminder: the code here is a teaching reconstruction for explaining the mechanism, not the official implementation; for the real discount bookkeeping, architecture, and regularizers, see the original paper and code.

# RQL 策略更新：逐步 pathwise 梯度 + BC
def policy_loss(batch, V_net, v_net, F, bc_coef):
    s, a = batch.s, batch.a
    xs = flow_reversal(s, a, v_net, F)        # 靠 reversal 拿到每步 x^f
    value_term = 0.0
    for f in range(F):
        x_f = xs[f].detach()                  # 梯度只走这一步 v
        x_next = x_f + v_net(s, x_f, f)       # x^{f+1}
        value_term = value_term - V_net(s, x_next, f + 1).mean()
    bc_term = 0.0
    for f in range(F):
        v_target = (xs[f + 1] - xs[f]).detach()             # 数据隐含的「该走的一步」
        bc_term = bc_term + ((v_net(s, xs[f].detach(), f) - v_target) ** 2).mean()
    return value_term / F + bc_coef * bc_term / F

十、完整训练循环

def rql_train_step(batch, V_net, V_target, v_net, opt_V, opt_v, F, gamma, bc_coef, tau):
    loss_V = value_loss(batch, V_net, V_target, v_net, F, gamma)
    opt_V.zero_grad(); loss_V.backward(); opt_V.step()
    loss_v = policy_loss(batch, V_net, v_net, F, bc_coef)
    opt_v.zero_grad(); loss_v.backward(); opt_v.step()
    with torch.no_grad():                      # 目标网络软更新
        for p, p_t in zip(V_net.parameters(), V_target.parameters()):
            p_t.mul_(1 - tau).add_(tau * p)

for step in range(num_steps):
    batch = replay.sample(batch_size)          # 离线，off-policy
    rql_train_step(batch, V_net, V_target, v_net, opt_V, opt_v, F, gamma, bc_coef, tau)

整个过程完全离线、off-policy；中间步全靠 reversal 即时造出来——数据集从不存降噪序列。

十一、为什么它打败 BPTT / 蒸馏 / 加权回归

RQL 同时拿到两边的好处：像 DDPG/SAC 一样用稳定的一阶 $\nabla V$ 梯度逐步、稳定地推策略（不 BPTT、不蒸馏）；又通过 reversal +「动作内价值相等」让被拉长的 horizon 不要钱（零方差、无偏、不膨胀有效 horizon）。代价只是需要一个定义在部分动作上的价值函数——而它恰好能被无偏训练。作者报告：在 50 个任务上对比 19 个 SOTA flow RL 算法，RQL 取得最佳离线 RL 性能。

十二、一条线记住它

把降噪步当 RL 动作会拉长价值学习的 horizon；但由于 flow 是确定性的，可以从离线数据里把动作「倒着解出」中间步，构造零方差的多步回报，从而消除 horizon 惩罚；再用一个定义在部分动作上的价值函数，逐步施加便宜又稳定的重参数化梯度——这就是 RQL。

参考

Aditya Oberai, Seohong Park, Sergey Levine. Reversal Q-Learning. 2026. 项目主页：aober.ai/rql
背景阅读：DDPG（Lillicrap et al., 2016, arXiv:1509.02971）；TD3（Fujimoto et al., 2018, arXiv:1802.09477）；SAC（Haarnoja et al., 2018, arXiv:1801.01290）；Flow Matching（Lipman et al., 2023, arXiv:2210.02747）。

再次提醒：本文代码为教学式重建，用于解释机制，非官方实现；实际的折扣安排、网络结构与正则细节请以原始论文与代码为准。