Discarded Predictions: Rewriting Action-Chunking's Waste as Latent Distributions, Reality-Grounded Supervision, and Lipschitz Geometry
A research note — an idea, not a result. The thread: action chunking throws away a huge amount of information on every inference. If we treat the future latent as a distribution, supervise it against real observations, and then bolt on Lipschitz / contraction geometry, we can upgrade “prediction error” from a scalar loss into a bounded, falsifiable certificate — and, with that, recast control as a constraint-satisfaction problem. Every figure below is a live p5.js demo; drag the sliders and the 3-D scenes as you read.
一篇研究笔记——是个想法,不是结论。主线是:action chunking 在每次推理里都丢掉了大量信息。如果我们把未来 latent 当作一个分布、用真实观测去监督它,再挂上 Lipschitz / contraction 这一层几何,就能把「预测误差」从一个标量 loss,升级成一个有界、可证伪的证书——并由此把控制重新定义成一个约束满足问题。下面每张图都是能实时交互的 p5.js 演示,边读边拖滑块、转一转那几个三维场景。
1. The starting point: action chunking, and what it quietly throws away
Modern VLA policies — ACT, Diffusion Policy, the π0 / π0.5 flow-matching family — mostly run on action chunking: predict a whole future action sequence $a_{t:t+H}$ at once (π0.5 uses $H=50$, about a second), execute only the first $s \le H$ steps, then re-observe and re-predict. Two hard-nosed reasons this became the default:
First, reality itself is the dynamics model. During the open-loop $s$ steps, the policy needs no learned forward model — it isn’t simulating a rollout, it’s actually moving the robot and letting the next real frame report what happened. The error is physical, and physics corrects it for free.
Second, actions are a privileged interface. They’re the one thing the robot can consume, and demonstrations come with action labels, so predicting actions is the most direct, best-grounded supervision you can get.
But precisely because it works so well, the paradigm quietly drops two kinds of information. The first is the unexecuted tail of every chunk — those $H-s$ steps. In a receding-horizon sense, discarding the tail is correct for control: it was computed from the stale observation $o_t$; by the time you reach $t+s$ you already have the fresher $o_{t+s}$, which should replace it. This “waste” is not a bug — drag $s$ below to see why.
一、出发点:action chunking,以及它默默丢弃的东西
现代 VLA 策略——ACT、Diffusion Policy、π0 / π0.5 这类 flow-matching 策略——主流做法是 action chunking:一次预测未来一整段动作序列 $a_{t:t+H}$(π0.5 取 $H=50$,约 1 秒),只执行前 $s \le H$ 步,然后重新观测、重新预测。这套范式成为默认,有两个非常硬的理由:
第一,现实本身就是那个 dynamics model。 在开环执行的那 $s$ 步里,策略不需要任何学到的前向模型——它不是在模拟 rollout,而是真的让机器人去动,再由下一帧真实观测告诉它结果。误差是物理的,物理免费帮你纠正。
第二,action 是有特权的接口。 它是机器人唯一能消费的东西,而且示范数据里直接带 action 标签,所以预测 action 是 grounding 最强、最直接的监督。
但正因为太好用,这套范式悄悄丢掉了两类信息。第一类是每个 chunk 没被执行的尾巴——那 $H-s$ 步。在 receding-horizon 的意义下,丢弃尾巴对控制是正确的:它是基于过时观测 $o_t$ 算出来的;等你走到 $t+s$,已经拿到信息更全的 $o_{t+s}$,旧的本就该被替换。这一类「浪费」不是 bug——拖下面的 $s$ 看看为什么。
The second kind is subtler and bigger: the policy’s implicit belief about the future. To predict a good $a_{t:t+H}$ at all, the model must have implicitly modeled “what the world will look like after these actions.” That belief is never exposed and never supervised — and it’s the genuinely large signal being wasted. The discarded tail, paired with the trajectory reality actually went on to take, is a free self-supervised label: “you predicted this, the world did that.” Scrub time below to harvest those pairs.
第二类更微妙、量级也更大:策略内部对未来的隐式 belief。 要预测出一个好的 $a_{t:t+H}$,模型内部必然已经隐式建模了「执行这些动作后世界会变成什么样」。这个 belief 从不被暴露、从不被监督——它才是真正被浪费掉的、量级巨大的信号。被丢弃的尾巴,配上后来现实真正走出来的轨迹,就是一份免费的自监督标签:「你预测成这样,世界却那样走。」拖动下面的时间,把这些 (预测, 目标) 对收集起来。
2. Make the implicit belief explicit — and ground it in reality
So let’s not throw the belief away. Have a latent forward predictor $g$ roll the current latent $z_t$ and the known actions into a future latent, and supervise it against the latent encoded from the real future observation:
\[\mathcal{L}_{\text{pred}}=\sum_{k=1}^{H} d\Big(\,g_k\big(z_t,\,a_{t:t+k}\big),\ \ \mathrm{sg}\!\big[\mathrm{enc}_{\text{target}}(o_{t+k})\big]\,\Big)\]with $\mathrm{sg}[\cdot]$ a stop-gradient, $\mathrm{enc}_{\text{target}}$ an EMA target encoder, $d$ usually a cosine distance. This is exactly the self-predictive / latent-consistency machinery from model-based RL (SPR, TD-MPC2, DreamerV3), and it can be bolted straight onto a π0.5-style policy as a co-training head.
Plain version: predict the future in latent space, then grade the prediction against what reality actually became — never against your own guess.
That last clause is the trap you must not step in. If the target is the model’s own rolled-out latent, the objective is self-referential and the gradient drags every latent onto a single constant point — classic representation collapse. The whole JEPA stack of stop-gradient / EMA / variance-covariance regularizers exists to prevent exactly this. Let the target come from the real $o_{t+k}$, and the prediction-vs-reality discrepancy becomes a real gradient that can’t collapse. Flip the toggle below and watch it happen.
二、把隐式 belief 变显式——并用现实钉住它
那就别把 belief 丢掉。让一个 latent 前向 predictor $g$ 把当前 latent $z_t$ 和已知动作 roll 成未来 latent,再用由真实未来观测编码出的 latent去监督它:
\[\mathcal{L}_{\text{pred}}=\sum_{k=1}^{H} d\Big(\,g_k\big(z_t,\,a_{t:t+k}\big),\ \ \mathrm{sg}\!\big[\mathrm{enc}_{\text{target}}(o_{t+k})\big]\,\Big)\]其中 $\mathrm{sg}[\cdot]$ 是 stop-gradient,$\mathrm{enc}_{\text{target}}$ 是 EMA 的 target encoder,$d$ 常取 cosine 距离。这正是 model-based RL 里 self-predictive / latent-consistency 那套机制(SPR、TD-MPC2、DreamerV3),可以直接当一个 co-training head 挂到 π0.5 这类策略上。
白话版:在 latent 空间里预测未来,然后拿现实真正变成的样子去给预测打分——绝不拿你自己的猜测去打分。
最后这半句是绝对不能踩的坑。如果监督目标是模型自己 roll 出来的 latent,目标就成了自指的,梯度会把所有 latent 拖到同一个常数点——经典的表征塌缩。JEPA 一整套 stop-gradient / EMA / 方差-协方差正则,做的就是防这个。让目标来自真实的 $o_{t+k}$,预测与现实之间的差才是有效梯度,也才不会塌缩。点下面的切换,亲眼看它发生。
This is a training-time auxiliary loss (the target only exists after you’ve executed), good for relabeling already-collected trajectories offline or for online continual learning — it is not a way to speed up a single inference.
3. Treat the latent as a distribution, not a point
A flow / diffusion policy already outputs a distribution, not a point estimate — so the future latent should be one too. The immediate payoff is keeping multimodality. “Go left” and “go right” around an obstacle are both valid; a point estimate, or naive time-averaging across chunks, crushes them into the mean — the straight line through the obstacle, the single most dangerous action, one nobody proposed. Drag the slider from “distribution” to “point estimate” and watch the collision risk climb.
这是一个训练时的 auxiliary loss(target 只有在执行之后才存在),适合离线 relabel 已采的轨迹,或者在线持续学习——它不是用来加速单次推理的。
三、把 latent 当作分布,而不是点
flow / 扩散策略本来就输出一个分布,而不是点估计——所以未来 latent 也该是个分布。最直接的好处是保留多模态。绕障碍「往左」和「往右」都合法;点估计、或者跨 chunk 的朴素时间平均,会把它们压成均值——一条笔直穿过障碍的线,最危险、而且没有任何一个模式真的提议过的那个动作。把滑块从「分布」拖到「点估计」,看碰撞风险一路爬上去。
The same point bites at chunk boundaries. Splice two perfectly-valid chunks together naively — a hard cut, or a blind average — and the executed action jumps, which means a spike in acceleration: the kind of thing that snaps a wrist or trips a safety stop. RTC’s fix is to condition the new chunk on the already-committed one (“inpainting”), i.e. match the distribution across the seam, not the point. Widen the blend below and watch the spike vanish.
同样的道理在 chunk 边界上咬人。把两个各自完全合法的 chunk 朴素地拼起来——硬切,或者盲目平均——执行的动作就会跳变,也就是加速度出现尖峰:足以扭断手腕、触发安全急停的那种。RTC 的解法是让新 chunk 以已提交的那段为条件(inpainting),也就是在接缝上匹配分布而不是点。把下面的融合调宽,看尖峰消失。
The catch is honest: distribution matching (KL / optimal transport / score matching) is harder and less stable to train than a consistency loss. No free lunch — but the multimodality is worth paying for.
4. Lipschitz & contraction — making the latent route trustworthy
Here’s the question that’s been hanging over everything: rolling out in latent space compounds error, and there’s no ground truth in latent space to measure that error against. This is the central difficulty of model-based control, and it’s the prerequisite for steps 2–3 to mean anything.
Contraction is the key that pins it down. If the latent dynamics $F(\cdot,a)$ is $\kappa$-Lipschitz with $\kappa<1$ (a contraction), then $k$-step rollout error converges geometrically instead of exploding:
\[\|e_k\|\le\kappa^{k}\,\|e_0\|+\frac{1}{1-\kappa}\,\sup_j\|\delta_j\|\]Plain version: each step multiplies yesterday’s error by κ and adds fresh noise. If κ<1, old mistakes fade away and the new noise can only pile into a finite puddle of height supδ/(1−κ). If κ≥1, errors feed on themselves and avalanche. Only when the dynamics is contractive is “correct the latent once, then roll” well-posed. Drag κ across 1 below.
代价要说实话:分布匹配(KL / 最优传输 / score matching)比一致性损失更难训、更不稳。没有免费午餐——但多模态值这个价。
四、Lipschitz 与 contraction —— 让 latent 路线变得可信
有一个问题一直悬在所有事情上面:在 latent 空间里 roll out 会复合累积误差,而 latent 空间里没有 ground truth 可以度量这个误差。这是 model-based 控制最核心的难点,也是第二、三步能否成立的前提。
Contraction(收缩) 是钉住它的那把钥匙。若 latent dynamics $F(\cdot,a)$ 在某度量下是 $\kappa$-Lipschitz 且 $\kappa<1$(即收缩),则 $k$ 步 rollout 误差几何收敛、而非爆炸:
\[\|e_k\|\le\kappa^{k}\,\|e_0\|+\frac{1}{1-\kappa}\,\sup_j\|\delta_j\|\]白话版:每一步把昨天的误差乘以 κ,再加上新噪声。κ<1 时旧错误消退,新噪声最多堆成一个高度为 supδ/(1−κ) 的有限水坑;κ≥1 时误差自己喂自己,雪崩。 只有 dynamics 是收缩的,「先把 latent 修正一次,再往下 roll」才是良态的。把下面的 κ 拖过 1 试试。
Once the latent is a distribution, the natural way to measure “prediction vs reality diverged” is the Wasserstein distance — and Wasserstein and Lipschitz are dual (Kantorovich–Rubinstein):
\[W_1(\mu,\nu)=\sup_{\mathrm{Lip}(g)\le 1}\Big(\mathbb{E}_{\mu}[g]-\mathbb{E}_{\nu}[g]\Big)\]Plain version: the cost of shovelling one pile of sand onto another equals the best score any “gentle ruler” — a function whose slope never exceeds 1 — can give by rewarding μ and docking ν. Drag the ruler below; it can never beat the true $W_1$, and the best one touches it exactly.
And if the observation map $f$ is $L$-Lipschitz, pushing both distributions through it can’t blow their distance up by more than $L$:
\[W_1(f_{\#}\mu,\,f_{\#}\nu)\ \le\ L\,W_1(\mu,\nu)\]That’s a strict exchange rate between “divergence I see in observation space” and “distribution drift in latent space” — the error stops being a dimensionless scalar and becomes a bounded, geometric quantity.
一旦 latent 是个分布,度量「预测 vs 现实产生了差异」最自然的尺度就是 Wasserstein 距离——而 Wasserstein 和 Lipschitz 是对偶的(Kantorovich–Rubinstein):
\[W_1(\mu,\nu)=\sup_{\mathrm{Lip}(g)\le 1}\Big(\mathbb{E}_{\mu}[g]-\mathbb{E}_{\nu}[g]\Big)\]白话版:把一堆沙铲到另一堆上的代价,等于任何一把「温柔的尺子」——斜率永不超过 1 的函数——通过奖励 μ、扣 ν 所能打出的最高分。 拖下面那把尺子;它永远赢不过真实的 $W_1$,而最优的那把正好贴上去。
而且若观测映射 $f$ 是 $L$-Lipschitz,把两个分布都推过它,距离最多被放大 $L$ 倍:
\[W_1(f_{\#}\mu,\,f_{\#}\nu)\ \le\ L\,W_1(\mu,\nu)\]这是「我在观测空间里看到的发散」和「latent 空间里的分布漂移」之间一个严格的汇率——误差不再是一个无量纲的标量,而成了一个有界、有几何含义的量。
The 1-D error bound is easier to believe once you can walk around it. Below, twenty corrected latents are rolled forward by the same map: with $\kappa<1$ they all spiral into one fixed point (the initial spread is forgotten — so it doesn’t matter exactly where you corrected to); push $\kappa$ past 1 and the same map flings them apart. Drag to orbit.
能绕着它走一圈,那个一维误差上界就更容易相信。下面,二十个已修正的 latent 被同一个映射往前 roll:$\kappa<1$ 时它们全部螺旋进同一个不动点(初始的散布被忘掉——所以你到底修正到哪个点并不重要);把 $\kappa$ 推过 1,同一个映射就把它们甩开。拖动旋转。
5. The inverse projection — the crux
The genuinely interesting direction is backwards: from a divergence you see in observation space (that “swerve”), infer how much the latent should be corrected. That needs not the forward Lipschitz (an upper bound) but a co-Lipschitz lower bound:
\[\ell\,\|z_1-z_2\|\le\|f(z_1)-f(z_2)\|\quad\Longrightarrow\quad\|z_1-z_2\|\le\tfrac{1}{\ell}\,\|f(z_1)-f(z_2)\|\]Both directions together is bi-Lipschitz:
\[\ell\,\|z_1-z_2\|\ \le\ \|f(z_1)-f(z_2)\|\ \le\ L\,\|z_1-z_2\|\]Plain version: f is a fair currency exchange — it never shrinks a distance below ℓ× or grows it above L×. The lower rail ℓ is the one that lets you read a latent gap off an observation gap. Drag a point below; the output stays sandwiched in the ring between the ℓ and L circles, and the inner ring dies as ℓ→0.
五、逆向投影 —— 命门所在
真正有意思的是反方向:从你在观测空间里看到的发散(那条「拐弯」),倒推 latent 应该被修正多少。这需要的不是前向 Lipschitz(一个上界),而是一个 co-Lipschitz 下界:
\[\ell\,\|z_1-z_2\|\le\|f(z_1)-f(z_2)\|\quad\Longrightarrow\quad\|z_1-z_2\|\le\tfrac{1}{\ell}\,\|f(z_1)-f(z_2)\|\]两个方向都满足即 bi-Lipschitz:
\[\ell\,\|z_1-z_2\|\ \le\ \|f(z_1)-f(z_2)\|\ \le\ L\,\|z_1-z_2\|\]白话版:f 是一次公道的货币兑换——它永远不会把一个距离压到 ℓ× 以下、也不会放大到 L× 以上。那条下轨 ℓ,才是让你能从观测差读出 latent 差的关键。 拖下面的点;输出始终被夹在 ℓ 圈和 L 圈之间的圆环里,而内圈会随 ℓ→0 而消失。
I want to be honest: this step is the crux of the whole idea, and the hardest part. Forward Lipschitz is cheap (a dab of spectral normalization). The lower bound $\ell$ is brutal: deep nets collapse information by default (many latent directions map to the same output), and without a lower bound a tiny observation change can correspond to an enormous latent change — the inverse projection just fails. The 3-D scene below is observation space: $f$ squashes one axis by $\ell$, so a fixed little observation gap pulls back to a latent needle of length $d/\ell$. Drag $\ell$ toward 0 and the needle explodes.
Getting bi-Lipschitz essentially needs an invertible architecture — which collides nicely with “treat the latent as a distribution and sample it”: a normalizing flow is bijective by construction, bi-Lipschitz-ish, and gives exact densities. That’s not a coincidence; the constraint is choosing the architecture for us. One more engineering reality: global Lipschitz constants in deep nets are almost always vacuous — what’s usable is a local, per-region certificate, not one global number.
我得说实话:这一步是整个构想的命门,也是最难的部分。 前向 Lipschitz 很便宜(抹一层 spectral normalization 就有)。下界 $\ell$ 则极难:深网默认会塌缩信息(很多 latent 方向被映到同一个输出),没有下界,一个极小的观测变化可能对应一次巨大的 latent 变化——逆向投影直接失效。下面这个三维场景就是观测空间:$f$ 把一个轴按 $\ell$ 压扁,于是一个固定的小观测差,反推回去就是一根长度为 $d/\ell$ 的 latent 针。把 $\ell$ 拖向 0,针就炸开。
拿到 bi-Lipschitz 基本需要一个可逆架构——而这恰好和「把 latent 当分布并采样」撞个正着:normalizing flow 天生双射、天生 bi-Lipschitz-ish、还能给精确密度。这不是巧合,是约束本身在替我们选架构。还有一个工程现实:深网里的全局 Lipschitz 常数几乎总是空洞的——真正能用的是局部、per-region 的证书,而不是一个全局常数。
5.5 Why a normalizing flow actually is bi-Lipschitz
I claimed a normalizing flow hands you bi-Lipschitz almost for free. Since that’s the one load-bearing assumption, it deserves a derivation.
Step 1 — bi-Lipschitz is a statement about the Jacobian’s singular values. Zoom in near a point $z$ and the map looks linear, $f(z+\delta)\approx f(z)+J(z)\delta$, with $J$ the Jacobian. The local stretch in direction $\delta$ is $|J\delta|/|\delta|$, and over all directions that ratio sweeps exactly the gap between the smallest and largest singular values of $J$:
\[\sigma_{\min}(J)\ \le\ \frac{\|J\delta\|}{\|\delta\|}\ \le\ \sigma_{\max}(J)\]So the local upper rail is $L=\sigma_{\max}$ and the local lower rail is $\ell=\sigma_{\min}$; a map is bi-Lipschitz on a region exactly when $\sigma_{\min}\ge\ell>0$ and $\sigma_{\max}\le L$ across it. Geometrically a tiny circle maps to an ellipse with semi-axes $\sigma_{\max},\sigma_{\min}$ — bi-Lipschitz means that ellipse never collapses to a line.
Plain version: the little ellipse you get by pushing a little circle through the map is never a pancake and never a black hole — its fattest and thinnest radii both stay finite and above zero. Drag the probe and watch it.
5.5 为什么 normalizing flow 真能给出 bi-Lipschitz
我前面断言 normalizing flow 几乎免费就把 bi-Lipschitz 送给你。既然这是唯一一个承重的假设,那就该推一遍。
第一步 —— bi-Lipschitz 其实是在说雅可比的奇异值。 在一个点 $z$ 附近放大,映射看起来是线性的,$f(z+\delta)\approx f(z)+J(z)\delta$,$J$ 是雅可比。沿方向 $\delta$ 的局部拉伸是 $|J\delta|/|\delta|$,扫遍所有方向,这个比值恰好落在 $J$ 的最小和最大奇异值之间:
\[\sigma_{\min}(J)\ \le\ \frac{\|J\delta\|}{\|\delta\|}\ \le\ \sigma_{\max}(J)\]所以局部的上轨是 $L=\sigma_{\max}$,局部的下轨是 $\ell=\sigma_{\min}$;一个映射在某区域上 bi-Lipschitz,当且仅当那里处处 $\sigma_{\min}\ge\ell>0$ 且 $\sigma_{\max}\le L$。几何上,一个小圆被映成一个半轴为 $\sigma_{\max},\sigma_{\min}$ 的椭圆——bi-Lipschitz 就是说这个椭圆永不塌成一条线。
白话版:把一个小圆推过映射得到的小椭圆,既不会被压成薄饼、也不会被吸成黑洞——它最胖和最瘦的半径都保持有限、且都大于零。拖动探针看看。
A coupling-layer flow warps the grid without ever folding it (a bijection), and the unit circle at the probe becomes an ellipse whose axes are exactly the local $\sigma_{\max}$ and $\sigma_{\min}$ — the local $L$ and $\ell$. Drag the probe; raise the layers and the clamp.
Step 2 — invertible ⟺ no fold ⟺ the lower rail survives. $f$ is invertible exactly when $\det J=\prod_i\sigma_i\neq 0$ everywhere — no direction is ever crushed to zero, i.e. $\sigma_{\min}>0$. That is the same statement as “the warped grid never folds onto itself.” And the inverse has Jacobian $J^{-1}$, whose singular values are $1/\sigma_i$, so
\[\mathrm{Lip}(f^{-1})=\frac{1}{\sigma_{\min}(J)}=\frac{1}{\ell}\]That $1/\ell$ is literally the inverse-projection bound from §5 — the radius of the latent trust region. A flow lets you read a latent gap off an observation gap precisely because it keeps $\sigma_{\min}$ off the floor.
Step 3 — two constructions that guarantee it on purpose.
(a) Coupling layers (RealNVP / NICE). Split the latent in two, $z=(z_a,z_b)$, and update one half conditioned on the other: $z_b’=z_b\odot e^{s(z_a)}+t(z_a)$. The Jacobian is triangular, so $\det=\prod e^{s}=e^{\sum s}$, and the scale $s$ controls the singular values. Clamp $s\in[-c,c]$ and the layer is bi-Lipschitz with rails $[e^{-c},e^{c}]$ — explicit, cheap, exact. That’s the flow in the demo above.
(b) Residual flows (i-ResNet). Take $f(z)=z+g(z)$ and force $\mathrm{Lip}(g)=\kappa<1$ with spectral normalization. The reverse triangle inequality gives
\[(1-\kappa)\,\|\delta\|\ \le\ \|f(z+\delta)-f(z)\|\ \le\ (1+\kappa)\,\|\delta\|\]so $\ell=1-\kappa$, $L=1+\kappa$ — bi-Lipschitz, and invertible by the Banach fixed-point theorem precisely because $\kappa<1$. Look at what that is: the very same $\kappa<1$ that made the latent rollout converge back in §4 is what makes the flow invertible here. One inequality, third appearance. Push $\kappa$ past 1 below and the grid folds.
耦合层 flow 把网格变形却从不折叠(一个双射),探针处的单位圆变成一个椭圆,它的两个轴正好是局部的 $\sigma_{\max}$ 和 $\sigma_{\min}$——也就是局部的 $L$ 和 $\ell$。拖动探针;调大层数和夹紧值。
第二步 —— 可逆 ⟺ 不折叠 ⟺ 下轨活着。 $f$ 可逆,当且仅当处处 $\det J=\prod_i\sigma_i\neq 0$——没有任何方向被压到零,也就是 $\sigma_{\min}>0$。这和「变形后的网格从不折叠到自己身上」是同一句话。而逆映射的雅可比是 $J^{-1}$,奇异值是 $1/\sigma_i$,于是
\[\mathrm{Lip}(f^{-1})=\frac{1}{\sigma_{\min}(J)}=\frac{1}{\ell}\]这个 $1/\ell$ 正是第五节逆投影上界里的那个数——latent 信赖域的半径。flow 之所以能让你从观测差读出 latent 差,靠的就是它把 $\sigma_{\min}$ 顶在地板以上。
第三步 —— 两个刻意保证它的构造。
(a) 耦合层(RealNVP / NICE)。把 latent 一分为二 $z=(z_a,z_b)$,用一半去条件地更新另一半:$z_b’=z_b\odot e^{s(z_a)}+t(z_a)$。雅可比是三角的,所以 $\det=\prod e^{s}=e^{\sum s}$,奇异值由尺度 $s$ 掌控。把 $s$ 夹在 $[-c,c]$,这一层就是 bi-Lipschitz,轨是 $[e^{-c},e^{c}]$——显式、便宜、精确。上面那个演示用的就是它。
(b) 残差 flow(i-ResNet)。取 $f(z)=z+g(z)$,用 spectral normalization 强制 $\mathrm{Lip}(g)=\kappa<1$。反向三角不等式给出
\[(1-\kappa)\,\|\delta\|\ \le\ \|f(z+\delta)-f(z)\|\ \le\ (1+\kappa)\,\|\delta\|\]于是 $\ell=1-\kappa$,$L=1+\kappa$——bi-Lipschitz,并且正因为 $\kappa<1$,由 Banach 不动点定理可逆。看清楚这是什么:当初第四节让 latent rollout 收敛的那个 $\kappa<1$,在这里就是让 flow 可逆的条件。 同一个不等式,第三次出现。把下面的 $\kappa$ 推过 1,网格就折叠。
Residual flow $f=z+g$: while $\mathrm{Lip}(g)=\kappa<1$ the grid is a clean bijection (rails $[1-\kappa,1+\kappa]$); cross $\kappa=1$ and red folds appear where $\det J$ flips sign — invertibility is lost. The same $\kappa<1$ as the rollout.
Step 4 — stacking, and the density you get for free. Composition is submultiplicative: for $f=f_L\circ\cdots\circ f_1$ we get $\ell\ge\prod_i\ell_i$ and $L\le\prod_i L_i$, so a deep stack stays bi-Lipschitz — the rails just compound (that’s the “layers” slider in the first demo). And the same nonzero Jacobian that buys invertibility also buys an exact density through change-of-variables, $\log p(x)=\log p(z)-\log\lvert\det J\rvert$. The lower bound that makes the inverse projection possible is the very same one that makes the likelihood well-defined — which is why “treat the latent as a distribution” (§3) and “invert observation gaps into latent corrections” (§5) end up wanting the same architecture. The constraint isn’t a nuisance; it’s quietly choosing the model for us.
残差 flow $f=z+g$:只要 $\mathrm{Lip}(g)=\kappa<1$,网格就是干净的双射(轨 $[1-\kappa,1+\kappa]$);越过 $\kappa=1$,$\det J$ 变号的地方冒出红色折叠——可逆性丢失。和 rollout 是同一个 $\kappa<1$。
第四步 —— 堆叠,以及白送的密度。 复合是次可乘的:对 $f=f_L\circ\cdots\circ f_1$,有 $\ell\ge\prod_i\ell_i$、$L\le\prod_i L_i$,所以深堆叠仍然 bi-Lipschitz——上下轨只是相乘(这就是第一个演示里的「层数」滑块)。而且,那个让映射可逆的非零雅可比,同时通过换元公式白送你一个精确密度 $\log p(x)=\log p(z)-\log\lvert\det J\rvert$。让逆投影成为可能的那个下界,和让似然有良好定义的那个下界,是同一个——这也是为什么「把 latent 当分布」(第三节)和「把观测差逆成 latent 修正」(第五节),最后想要的是同一个架构。约束不是麻烦,它在悄悄替我们选模型。
6. Redefining the problem: from loss to certificate
Put it together and the lower bound pays off. You measure an observation gap $d$ between prediction and reality; the co-Lipschitz bound guarantees the true latent correction lives inside a ball of radius $d/\ell$ around your current latent. Not a guess — a certificate. With a decent $\ell$ it’s tight and you fix the latent without searching; shrink $\ell$ and the ball swallows the space and goes vacuous. Drag both sliders below.
六、重新定义问题:从 loss 到证书
把上面拼起来,那个下界就有了回报。你测到了预测和现实之间一个观测差 $d$;co-Lipschitz 下界保证:真正的 latent 修正一定落在你当前 latent 周围一个半径为 $d/\ell$ 的球里。不是猜测——是证书。$\ell$ 还行时它很紧,你不用搜索就能修好 latent;把 $\ell$ 调小,球就吞掉整个空间、变得空洞。拖下面两个滑块。
And here’s the upgrade that makes the whole detour worth it. The model’s dynamics asserts that the next latent must land inside a $\kappa$-reachable set. Encode the real observation and check:
- Inside the set → an ordinary belief update, plus a free reality-grounded training target (exactly §2).
- Outside the set (more than the Lipschitz step allows) → this is not a big prediction error to average away. It’s the world doing something the model swore was impossible: out-of-distribution, or an unmodelled disturbance.
Control becomes a constraint-satisfaction / certificate problem: a boundary violation is itself a high-value event — it can drive a surprise / curiosity signal, trigger full replanning, or mark the state as somewhere to go collect data. A plain MSE blurs “small wobble where the model is confident” and “the model was just falsified” into one number; this framing keeps them apart and gives the second one a crisp, actionable meaning. Drag “reality” in and out of the ring.
而真正让这趟绕路值回票价的升级在这里。模型的 dynamics 断言:下一个 latent 必须落在一个 $\kappa$-可达集里。把真实观测编码回去,检查一下:
- 落在集合内 → 一次普通的信念更新,外加一份免费的、被现实钉住的训练目标(正是第二节那套)。
- 落在集合外(超出 Lipschitz 允许的一步变化)→ 这不是一个可以平均掉的大预测误差。这是世界做了模型发誓不可能的事:要么 OOD,要么有未建模的扰动。
于是控制变成一个约束满足 / 证书问题:越界本身就是一个高价值事件——它可以驱动 surprise / curiosity 信号、触发完整重规划、或把这个状态标记为该去采数据的地方。普通的 MSE 把「模型自信区域内的小抖动」和「模型刚刚被证伪」糊成同一个数;这个框架把它们分开,并给后者一个清晰、可操作的语义。把「现实」拖进圈里、再拖出去。
7. One thread running through all of it
Worth stressing: the single condition of contraction holds up three different things at once.
- It makes latent-rollout error converge geometrically → it answers “won’t rolling out in latent space drift?” (§4, the 3-D field).
- It makes the distributional supervision signal bounded and measurable, via the Wasserstein–Lipschitz duality → it makes §2–§3 interpretable.
- It makes an observation-space divergence invertible into a latent trust region, via bi-Lipschitz → it makes the §5–§6 certificate exist.
Three apparently separate ideas, one shared mathematical premise. That’s also why “betting on which map satisfies which property” is the whole ballgame.
8. The design choices that decide everything
Whether any of this stands up depends on a few not-yet-pinned-down choices:
- Lipschitz of which map, in which metric? Encoder $o\to z$, decoder $z\to o$, dynamics $(z,a)\to z’$, policy $z\to a$ — they mean entirely different things. My bet: bi-Lipschitz encoder + contractive dynamics, and the metric is probably not Euclidean but a learned Riemannian / contraction metric.
- Local vs global bounds. Global is vacuous; everything has to be local, and getting reliable local $L$ and $\ell$ efficiently is the core difficulty.
- The cost of invertibility. Normalizing flows hand you bi-Lipschitz and exact density, but at a price in capacity and training stability, to be balanced against the action objective.
- The benefit is context-dependent. Prediction / consistency objectives sometimes help little in-distribution and more on generalization and long horizons. “The signal is being wasted” is true; “using it is always a net win” needs experiments.
9. Closing
Action chunking wins as the default because it outsources the hardest job — dynamics — to the real world, and the real world speaks in actions. I’m not arguing to overthrow it. I’m arguing to scoop back the implicit future belief it discards: treat it as a distribution, ground it in reality, and harden it with Lipschitz / contraction geometry.
The reward of that route isn’t “save one inference.” It’s a more structured problem definition — prediction error upgraded to a bounded, falsifiable certificate, control recast as constraint satisfaction. The cost is concentrated in one place: the bi-Lipschitz / local-bound bone. Whether that bone can be chewed pretty much decides whether this is an elegant idea on paper or an actual method.
七、一条贯穿始终的线
值得强调:contraction 这一个条件,同时撑起三件事。
- 它让 latent rollout 的误差几何收敛 → 回答「在 latent 里 roll 会不会漂移」(第四节,那个三维场)。
- 它经由 Wasserstein–Lipschitz 对偶,让分布监督信号有界、可度量 → 让第二、三节可解释。
- 它经由 bi-Lipschitz,让观测发散能逆向投影成 latent 信赖域 → 让第五、六节的证书成立。
三个看似独立的想法,共享同一个数学前提。这也是为什么「押注哪个映射满足什么性质」是整个构想的关键决策。
八、决定成败的设计抉择
这套成不成立,高度依赖几个还没定死的选择:
- Lipschitz 是对哪个映射、在哪个度量下? 编码器 $o\to z$、解码 $z\to o$、dynamics $(z,a)\to z’$、policy $z\to a$,含义完全不同。我的赌注:编码器 bi-Lipschitz + dynamics 收缩,而且度量很可能不是欧氏,而是一个学出来的 Riemannian / contraction metric。
- 局部 vs 全局界。 全局界空洞,一切必须做局部估计;如何高效、可靠地拿到局部 $L$ 和 $\ell$ 是核心难点。
- 可逆架构的代价。 Normalizing flow 给你 bi-Lipschitz 和精确密度,但在表征容量与训练稳定性上有代价,要和 action 目标平衡。
- 收益是 context-dependent 的。 预测 / 一致性目标对 in-distribution 有时帮助有限,对泛化和长 horizon 通常更明显。「信号被浪费了」是对的;「用起来一定净赚」需要实验说话。
九、结语
Action chunking 之所以赢在默认,是因为它把最难的活——dynamics——外包给了真实世界,而真实世界说的是 action 这门语言。我不主张推翻它,我主张把它丢弃的隐式未来 belief 捞回来:视之为分布、用现实接地、再用 Lipschitz / contraction 几何加固。
这条路线的回报不是「省一次推理」,而是一个更结构化的问题定义——预测误差被升级成有界、可证伪的证书,控制随之变成约束满足。代价集中在一处:bi-Lipschitz / 局部界这块硬骨头。能不能把它啃下来,基本决定了这套构想是优雅的纸面框架,还是真正可落地的方法。
References
Pointers, not a precise bibliography — read these for the real thing.
- ACT / ALOHA — Zhao et al. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. arXiv:2304.13705
- Diffusion Policy — Chi et al. arXiv:2303.04137 · π0 arXiv:2410.24164 · π0.5 arXiv:2504.16054
- RTC (real-time / async chunking via inpainting) — Black et al. Real-Time Execution of Action Chunking Flow Policies. arXiv:2506.07339
- Self-predictive / latent consistency — SPR (Schwarzer et al., arXiv:2007.05929) · TD-MPC2 (arXiv:2310.16828) · DreamerV3 (arXiv:2301.04104)
- Predict & plan in latent — V-JEPA 2 / V-JEPA 2-AC (Assran et al., arXiv:2506.09985); JEPA (LeCun, 2022)
- Geometry — Kantorovich–Rubinstein duality (Villani, Optimal Transport) · contraction / incremental stability (Lohmiller & Slotine, Automatica 1998) · normalizing flows (Rezende & Mohamed, arXiv:1505.05770; Papamakarios et al. survey, arXiv:1912.02762)
参考文献
指路,不是精确文献——要看真东西请读它们。
- ACT / ALOHA — Zhao 等. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. arXiv:2304.13705
- Diffusion Policy — Chi 等. arXiv:2303.04137 · π0 arXiv:2410.24164 · π0.5 arXiv:2504.16054
- RTC(实时 / 异步 chunking,inpainting) — Black 等. Real-Time Execution of Action Chunking Flow Policies. arXiv:2506.07339
- 自预测 / latent 一致性 — SPR(Schwarzer 等,arXiv:2007.05929)· TD-MPC2(arXiv:2310.16828)· DreamerV3(arXiv:2301.04104)
- 在 latent 中预测与规划 — V-JEPA 2 / V-JEPA 2-AC(Assran 等,arXiv:2506.09985);JEPA(LeCun, 2022)
- 几何工具 — Kantorovich–Rubinstein 对偶(Villani, Optimal Transport)· contraction / 增量稳定性(Lohmiller & Slotine, Automatica 1998)· normalizing flows(Rezende & Mohamed, arXiv:1505.05770;Papamakarios 等综述, arXiv:1912.02762)