← all writing

Wind Tunnel Theory: The Engineering Endgame of Robot Learning

Why the data → model manifold can’t be crossed by intuition, and why iteration speed is the next battlefield. With animated diagrams and real-robot footage from frontier teams woven in.

为什么数据到模型的 manifold 无法靠直觉穿越,而工程化的迭代速度才是下一个战场。文中穿插了动态图解和前沿团队的真机操作视频。

Real-robot operation — Mobile ALOHA cooking, cleaning, and manipulating in real homes (Fu, Zhao, Finn, Stanford, arXiv:2401.02117). This is the world a VLA policy is ultimately graded in — and the only place its grade can be read.

A field with no north star

In most engineering problems, you know the destination before you set out. How much load the bridge must carry, how high the chip must clock — the target is a number you can write down and compute ahead of time. You walk toward that number; how fast you walk is a question of skill, but the direction was clear from the start.

Robot learning is not like that.

When you train a VLA (Vision-Language-Action) model to operate in the real world, you face a night sky with no north star. You cannot say “this batch of data, trained into this model, will perform this well on the real robot” — because between (data distribution, architecture, recipe) and (downstream real-robot performance) there is no closed form, no formula you can evaluate in advance. The map objectively exists, but the only tool humanity currently has to evaluate it is to actually run the training, then actually put the model on the machine and run it.

This isn’t because we aren’t clever enough. It is an intrinsic property of the map. Admitting that is step one to doing robot learning right.

真机操作——Mobile ALOHA 在真实家庭里做饭、清洁、操作物体(Fu, Zhao, Finn,斯坦福,arXiv:2401.02117)。这就是一个 VLA 策略最终被打分的世界——也是它的分数唯一能被读出来的地方。

一个没有北极星的领域

在大多数工程问题里,你出发前就知道终点在哪。桥要承多重的载、芯片要跑多高的频,目标是一个可以写在纸上、提前算出来的数。你朝着那个数走,走得快不快是能力问题,但方向从一开始就是清楚的。

机器人学习不是这样的。

当你训练一个 VLA(Vision-Language-Action)模型去操作真实世界,你面对的是一个没有北极星的夜空。你说不出「这批数据训出来的模型,在真机上会有多好」——因为从(数据分布、模型架构、训练配方)到(下游任务真机表现)之间,没有闭式解,没有可以提前求值的公式。这个映射客观存在,但人类目前手里唯一能对它求值的工具,就是把训练真实地跑一遍、再把模型真实地放到机器上跑一遍。

这不是因为我们不够聪明。这是这个映射的固有性质。承认这一点,是把机器人学习做对的第一步。

The wind tunnel: when theory gives no answer

A hundred-odd years ago, aeronautical engineers faced the same predicament.

Why does a wing generate lift? The theoretical argument around that question ran for decades. But the Wright brothers didn’t wait for theory to settle it. They built a wind tunnel — a box that blows air over an airfoil at a controlled speed and measures the lift and drag. Inside that box, they systematically tested hundreds of airfoils, recorded how each behaved, and picked the best.

The value of the wind tunnel is not that it “picked a good airfoil.” If someone says “a wind tunnel just judges which airfoil is better — low value,” they’ve missed the entire point. The whole value of the wind tunnel is this: for a function theory cannot predict, it is the only reliable measuring instrument. Without it you can only guess; with it you can measure.

Robot learning needs its own wind tunnel: the engineering system that closes the loop between training and real-robot evaluation, that can systematically measure “this data → this model → this real-robot performance.” Its reason to exist is identical to the Wright brothers’ wooden box: on this problem, experiment is the only evaluator, and theory is not.

风洞:当理论给不出答案

一百多年前,航空工程师面对过同样的处境。

机翼为什么能产生升力?围绕这个问题的理论争论持续了几十年。但莱特兄弟没有等理论吵出结果。他们造了一个风洞——一个能把空气以可控速度吹过翼型、并测出升力和阻力的箱子。在那个箱子里,他们系统地测试了上百种翼型,记录每一种的表现,然后选出最好的。

风洞的价值,不在于它「挑出了好翼型」。如果有人说「风洞不过是在判断哪个翼型好哪个不好,价值太低了」,他完全没抓住要点。风洞的全部价值在于:对一个理论无法预测的函数,它是唯一可靠的测量仪器。 没有它,你只能猜;有了它,你能测。

机器人学习需要它自己的风洞:那个把训练和真机评估闭环起来、能够系统性地「测量」出「这批数据 → 这个模型 → 这个真机表现」的工程系统。它存在的理由,和莱特兄弟那个木箱子一模一样:因为在这个问题上,实验是唯一的求值器,而理论不是。

SOP shapes the input, but never touches the function

A common misreading: as long as you make the data-collection SOP detailed and standardized enough, performance will follow.

That treats the model as a system that “executes a procedure.” But the model isn’t written, it’s learned. What an SOP can decide is which region of data space you sampled — i.e. what you fed in. What it cannot decide is the shape of the map from “the distribution you fed in” to “the policy behavior you learned.” In between sits an opaque learning process. You can shape the input; you cannot shape the function from input to behavior. That is the heart of the manifold problem: between data distribution and model behavior lies a high-dimensional, curved manifold whose shape your intuition cannot trace, and the SOP — that chisel — simply can’t reach it.

Some know-how genuinely can be front-loaded: data ratios, baseline quality filtering, recipe priors known to work. Put those in before you start and you’ll waste fewer turns. But they act on the input end. The stretch from input to model performance — no prior walks it for you.

Simulation is biased, exactly where it matters

So can simulation route around it?

Sim is a cheaper evaluator. It’s useful, and it’s biased. The sim-to-real gap isn’t noise; it’s systematic bias, and it’s biased in the deadliest places: contact dynamics, sensor noise, materials and lighting, the long tail of real operating conditions — precisely the parts a model must learn and a simulator struggles most to reproduce.

There is correlation between sim and real, true. But the exploitable part of that correlation you can already eat with SOP and priors. The residual that’s left is exactly the part that decides real-robot success — and it can only be measured on hardware. You cannot bootstrap a guarantee about real-world performance purely from sim.

So what blows through the wind tunnel must, in the end, be real wind.

SOP 决定输入,但碰不到那个函数

一种常见的误解是:只要把数据采集的 SOP 做得足够详细、足够规范,模型表现自然就好了。

这个想法,把模型当成了一套「按流程执行」的系统。但模型不是写出来的,是学出来的。SOP 能决定的,是你采到了数据空间里的哪一块区域——也就是喂进去什么。它决定不了的,是「喂进去的分布 → 学出来的策略行为」这个映射会长成什么样。这中间隔着一个不透明的学习过程。你能塑造输入,塑造不了输入到行为的那个函数。这正是 manifold 的核心难点:在数据分布和模型行为之间,存在一个高维的、弯曲的、你无法用直觉描出形状的流形,而 SOP 这把刻刀,根本够不到它。

有些 know-how 确实可以前置:数据的配比、基础质量的过滤、已知有效的配方先验——这些该在出发前就放进去,放进去就能少走弯路。但它们作用在「输入端」。从输入到模型效果的那一段,没有任何先验能替你走完。

仿真有偏,而且偏在最关键的地方

那能不能用仿真绕过去?

仿真是一个更便宜的求值器,它有用,但它有偏。sim-to-real 的差距不是噪声,是系统性的偏差,而且偏在最要命的地方:接触动力学、传感器噪声、材质与光照、真实工况的长尾分布——这些恰恰是模型必须学会、又最难被仿真还原的部分。

仿真和真机之间有相关性,这是真的。但那部分可以利用的相关性,你其实已经能用 SOP 和先验吃掉了。剩下的残差,正好是决定最终真机成败的那一部分,而它只能在真机上测出来。你没法纯靠仿真,把对真机表现的保证 bootstrap 出来。

所以风洞里吹的,最终必须是真实的风。

Iteration isn’t rework — it’s how the field breathes

If the destination is unknowable, the SOP can’t reach the function, and sim is biased, only one path remains: train → evaluate on the real robot → locate the failure → collect data to target it → retrain. Around once, then around again.

This gets misread two ways, both wrong.

One misreading is “rework” — as if needing to retrain means you botched the last round. No. The first time you train, you have no idea where the model will fail, because failure modes are only exposed after you train and test. You can’t pre-collect problems you don’t yet know exist. Each iteration isn’t patching holes; it’s using the last round’s measured results to illuminate the next blind spot. This is convergence, not remediation.

The other misreading cuts deeper: “great teams don’t need to iterate; needing iteration means you’re weak.”

That’s half true. Capability genuinely compresses iteration — a team with deep priors knows which architectures work, which ratios are good starting points, which hyperparameters not to bother trying, so it converges fast with few mistakes. A weak team might take ten turns; a strong team three. But no team takes zero turns.

Swapping “iteration count can be compressed by capability” for “iteration can be eliminated by capability” is a slippery slope. It also hides an unfalsifiable trap: you succeed with little data, “as expected”; you need more iteration, “you’re weak.” An argument that’s right no matter the outcome isn’t deep — it’s empty, because it makes no checkable prediction.

The hard counter-evidence: the most advanced VLA and robot-learning teams run the largest, densest train-eval-iterate loops and ablation studies there are. They don’t run experiments because they’re weak; they run them because they actually understand how this field works. An ablation is a controlled experiment — the scientific method itself. Calling it “a sign you don’t understand the algorithm” is like saying every scientist who runs controlled experiments doesn’t understand their field.

迭代不是返工,是这个领域的呼吸方式

如果终点不可预知、SOP 够不到那个函数、仿真又有偏,那剩下的只有一条路:训练 → 真机评估 → 定位失效 → 定向补数据 → 再训练。 转一圈,再转一圈。

这件事经常被误读成两种样子,两种都错。

一种误读是「返工」——好像需要重训,是因为上一次没做好。不是的。第一次训练时,你根本不知道模型会在哪里失效,因为失效场景是训完测出来才暴露的。你无法预先采集那些你还不知道存在的问题。每一轮迭代不是在补窟窿,是在用上一轮的实测结果照亮下一块盲区。这是收敛,不是补救。

另一种误读更伤人:「牛逼的团队不需要迭代,需要迭代说明人菜。」

这话只有一半对。能力确实能压缩迭代——有深厚先验的团队,知道哪些架构能用、哪些配比是好起点、哪些超参不必试,所以收敛快、试错少。一个菜的团队可能要转十轮,一个强的团队三轮搞定。但没有任何团队能转零轮。

把「迭代次数能被能力压缩」偷换成「迭代能被能力消除」,这是个滑坡。它还藏着一个不可证伪的陷阱:你少数据成功了,他说「本该如此」;你需要更多迭代,他说「你们菜」。一个无论结果如何都正确的论点,不是深刻,是空的——因为它没做任何可被检验的预测。

而真实世界里的反证很硬:那些做 VLA 和 robot learning 的最前沿团队,恰恰跑着最大规模、最密集的 train-eval-iterate 闭环和 ablation 实验。他们不是因为菜才做实验,而是因为真懂这个领域怎么运转。ablation 就是控制变量实验,就是科学方法本身。说它是「不懂算法的表现」,等于说所有做受控实验的科学家都不懂自己的领域。

The frontier runs on exactly this loop: RT-1 → RT-2 → Open X-Embodiment/RT-X at Google DeepMind, π0 → π0.5 → π0.7 at Physical Intelligence, Octo and OpenVLA from Berkeley/Stanford, Gemini Robotics, NVIDIA’s GR00T. None of them skipped the turns — they ran more of them, faster. Below: a learned policy executing dexterous, long-horizon manipulation on real hardware.

前沿正是跑着这个闭环:Google DeepMind 的 RT-1 → RT-2 → Open X-Embodiment/RT-X,Physical Intelligence 的 π0 → π0.5 → π0.7,Berkeley/Stanford 的 Octo 和 OpenVLA,Gemini Robotics,NVIDIA 的 GR00T。没有谁跳过那些圈——他们只是转得更多、更快。下面:一个学出来的策略在真机上执行灵巧、长程的操作。

10 can be compressed to 3, but never to 1

So can a smarter algorithm push data efficiency to the limit — so little iteration you barely need any?

That’s a direction with taste, worth chasing forever. But state the boundary clearly.

Think of “wasted” exploration like this: a team that flails by intuition might need 10 units of cost before it finds the path. With solid engineering — closing the collect↔eval loop, making every round produce reusable conclusions, letting know-how truly accumulate — you can compress that 10 to 3, even 2. That’s the real value of engineering capability.

But you can’t compress it to 1. Not for lack of effort, but because that last stretch — the map from data to model performance — has no analytic solution under current science. It can only be measured. You can make measurement extremely fast, cheap, and disciplined, pushing toward the physical limit of the problem; but you cannot make the act of measuring itself disappear. The set of engineering methods that pushes exploration cost toward that physical limit is the answer — and its name is the wind tunnel.

(Aside: the “generalize from little data” abilities — few-shot in large models, fast adaptation of foundation models — are themselves products of first scaling on massive data, not shortcuts around scaling. Few-shot generality is a prize of scaling, not a substitute for it. You can’t skip the foundation and move straight into the penthouse.)

10 可以压到 3,但压不到 1

那能不能靠更聪明的算法,把数据效率推到极致,少到几乎不用迭代?

这是个有品味的方向,值得一直追求。但要把边界说清楚。

可以这样理解探索中「浪费」的数据与时间:一个直觉乱撞、毫无章法的团队,也许要在 10 份代价的探索里才摸到路。靠扎实的工程化——把采集和评估闭环、让每一轮都产出可复用的结论、让 know-how 真正沉淀——你可以把这个 10 压到 3,甚至压到 2。这是工程能力实打实的价值。

但你压不到 1。 不是因为还不够努力,是因为那最后一段——从数据到模型效果的那个映射——在当前科学下没有解析解,只能测量。你可以让测量变得极快、极省、极有章法,逼近这个问题的物理极限;但你无法让测量这个动作本身消失。把探索成本逼到接近物理极限的那套工程方法,本身就是答案,而它的名字,就是风洞。

(顺带一提:那些「少量数据就能泛化」的能力——大模型的 few-shot、基座模型的快速适配——它们恰恰是先在海量数据上 scale 出来的产物,而不是绕过 scale 的捷径。少样本通用能力是 scaling 的奖品,不是它的替代品。你不能跳过盖地基,直接住进顶楼。)

The real divide: collection and inference live in different teams

Here the biggest engineering problem in robot learning today surfaces — not an algorithm, but an org chart.

In many teams, data collection and model inference are split. One group collects data by SOP; another trains models and watches the results. Between them sit handoffs, processes, and separate KPIs. The collection team doesn’t know what consequences its data caused inside the model; the model team can’t quickly reach back and adjust where the data came from.

But a wind tunnel is, in essence, a closed loop. Blow, measure, adjust the airfoil, blow again — it has to happen fast, in one circuit, by one pair of hands. If the person designing the airfoil and the person measuring lift belong to two departments and every handoff takes three days, the wind tunnel is dead — its entire value is in iteration speed, and the split kills speed outright.

So the fix isn’t only technical, it’s organizational design: integrate data collection and model inference, through an engineering system, into one team and one loop. Compress the latency from “we measured the model failing in scenario X” to “we collected data targeting scenario X” from weeks to days. That latency is your iteration speed; and iteration speed is the next battlefield.

Iteration speed is the battlefield

Data will keep growing; quality will keep getting noisier. Hoping to judge “which data is useful, which isn’t” by algorithmic intuition won’t work — because the necessary link from data to model lives on a manifold intuition can’t trace, and can only be measured by iterating.

As both data scale and noise rise, whoever can complete the collect → train → real-robot validate → targeted re-collect loop fastest crosses that starless sky fastest. The competition in robot learning is shifting from “whose algorithm is cleverer” to “whose iteration loop is shorter.” The former still matters; the latter decides the endgame.

The wind tunnel can’t hand you a map with the destination written on it. What it hands you is an instrument that, in a place with no map, measures the direction one step at a time. In robot learning — a field with no north star — that is probably the closest thing to truth we get to have.


No north star, so we build the wind tunnel.

References

The frontier teams whose published work runs exactly this train → real-eval → iterate loop, at scale:

Real-robot footage embedded above: Mobile ALOHA (Stanford). Swap in your own hardware footage for a production hero.

真正的割裂:采集和推理不在一个团队

讲到这里,机器人学习当下最大的工程问题就浮出来了——不是某个算法,而是组织结构。

今天,在很多团队里,数据采集和模型推理是割裂的。一拨人负责按 SOP 采数据,另一拨人负责训模型、看效果。两者之间隔着流程、隔着交接、隔着各自的 KPI。数据采集团队不知道自己采的数据在模型里造成了什么后果,模型团队拿到的数据来源也无法快速反向调整。

但风洞的本质,是一个闭环。吹风、测量、调整翼型、再吹风,必须在同一个回路里、由同一双手快速完成。如果设计翼型的人和测升力的人分属两个部门、每次交接要等三天,那这个风洞就废了——它的价值全在迭代速度上,而割裂会直接杀死速度。

所以解法不只是技术,是工程化的组织设计:把数据采集和模型推理通过工程系统集成进同一个团队、同一个闭环。 让「测出模型在某个场景失效」到「定向补采那个场景的数据」之间的延迟,从几周压缩到几天。这个延迟,就是你的迭代速度;而迭代速度,就是下一个战场。

迭代速度,就是战场

数据会越来越多,质量会越来越参差不齐。指望靠算法直觉、拍脑袋去判断「哪些数据有用、哪些没用」,是行不通的——因为数据到模型 manifold 之间的那条必要联系,直觉描不出来,只能靠迭代测出来。

当数据规模和噪声都在上升,谁能更快地完成「采集—训练—真机验证—定向补采」这个闭环,谁就能更快地穿越那片没有北极星的夜空。机器人学习的竞争,正在从「谁的算法更巧」转向「谁的迭代回路更短」。前者仍然重要,但后者决定终局。

风洞给不了你一张写好终点的地图。它给你的,是一个能在没有地图的地方,一步一步测出方向的仪器。在机器人学习这个没有北极星的领域里,这,大概就是我们能拥有的最接近真理的东西。


没有北极星,所以我们造风洞。

参考

发表过的工作正跑着这个「训练 → 真机评估 → 迭代」闭环、且做到规模化的前沿团队:

上方嵌入的真机视频:Mobile ALOHA(斯坦福)。正式发布时可换成你自己的真机操作素材作为 hero。