Ran Cheng

Robotics, Vision, Learning.

Language-Conditioned Value Functions for Robot Policy: A Modular Approach with V-JEPA 2-AC and Flow Matching

Recent advances in vision foundation models like V-JEPA 2-AC have demonstrated remarkable capabilities in learning latent world models for robot manipulation. While these models typically rely on image goals for planning, this post explores an alternative architecture: using language-conditioned value functions to guide action selection, combined with Flow Matching for action chunk generation.

The key insight is simple but powerful: instead of forcing language instructions into the latent representation space (which requires expensive alignment training), we can let language define what futures are desirable through a value function, while the world model defines what futures are possible.

Table of Contents

  1. Problem Formulation
  2. From Energy Planning to Language Value
  3. Flow Matching for Action Chunks
  4. Learning the Language Value Function
  5. Complete Algorithm
  6. Experimental Design

1. Problem Formulation and Notation

Consider a robot manipulation setup with:

  • Observations: Image frames $x_t$, proprioceptive state $s_t$ (end-effector pose, gripper state)
  • Actions: Continuous control $a_t \in \mathbb{R}^m$ (e.g., delta position/orientation)
  • Language instruction: Natural language command $c$ (e.g., “pick up the red block and place it in the blue bowl”)

V-JEPA 2-AC provides two critical components:

  1. Encoder $E$ (frozen): Maps images to latent representations $z_t = E(x_t)$
  2. Action-conditioned predictor $P_\phi$: Predicts future latents given actions
\[\hat{z}_{t+1} = P_\phi(\hat{z}_{\leq t}, s_{\leq t}, a_{\leq t})\]

The predictor is trained with both teacher forcing (using ground-truth latents) and rollout loss (using predicted latents), which is crucial for reducing error accumulation during multi-step imagination.

We consider action chunks (macro-actions) of horizon $H$:

\[u_t := (a_t, a_{t+1}, \ldots, a_{t+H-1}) \in \mathbb{R}^{H \times m}\]

2. From Energy Planning to Language Value

2.1 V-JEPA 2-AC’s Original Planning Mechanism

The original V-JEPA 2-AC uses goal images for planning. Given a goal image $x_g$, it encodes to $z_g = E(x_g)$ and optimizes action sequences by minimizing an energy function:

\[E(\hat{a}_{1:T}; z_k, s_k, z_g) := \|P_\phi(\hat{a}_{1:T}; s_k, z_k) - z_g\|_1\]

This is solved using Cross-Entropy Method (CEM): iteratively sampling action sequences from a Gaussian, evaluating their energies, and refitting the Gaussian to the best samples.

2.2 The Limitation: Requiring Goal Images

While effective, this approach requires:

  1. A goal image that precisely specifies the desired outcome
  2. The goal image to be meaningful in the same latent space as observations

For language-conditioned policies, this would require language-to-latent alignment — mapping language to $z_g$ in V-JEPA’s space. This is challenging because V-JEPA’s latent space wasn’t trained to be language-aligned.

2.3 Our Alternative: Language-Conditioned Value Functions

Instead of mapping language to latent goals, we propose using language to define a value function that scores futures:

Terminal Success (Simplest): \(J_\psi(u_t; z_t, s_t, c) := -V_\psi(\hat{z}_{t+H}, c)\)

Trajectory Scoring (More Robust): \(J_\psi(u_t; z_t, s_t, c) := -\sum_{i=1}^{H} \gamma^{i-1} r_\psi(\hat{z}_{t+i}, c) - \gamma^H V_\psi(\hat{z}_{t+H}, c)\)

where $\hat{z}{t+i}$ comes from world model rollout, and $V\psi$ evaluates how well the predicted future aligns with the language instruction.

This reformulation has a crucial property: the world model is instruction-agnostic. It generates all physically plausible futures; the language value function then selects among them. This is more modular and doesn’t require retraining the world model for new language.


3. Flow Matching for Action Chunk Generation

3.1 Why Flow Matching?

In the original V-JEPA 2-AC, CEM generates action candidates by sampling from a Gaussian and iteratively refining it. This works but has limitations:

  • Poor coverage: Gaussian sampling may miss multimodal action distributions
  • Expensive iteration: CEM needs multiple rounds of sampling and evaluation
  • No learning: The proposal distribution doesn’t learn from data

Flow Matching offers an alternative: learn a generative model of “physically plausible action chunks” from demonstration data.

3.2 Task-Agnostic Action Chunk Prior

We train a task-agnostic Flow Matching model that learns the distribution of action chunks conditioned on the current state:

\[u_t \sim p_\theta(u \mid z_t, s_t)\]

Critically, language is not a condition for the generator. This keeps the generator general-purpose.

The flow is defined by a velocity field $v_\theta$ that transforms noise to actions:

\[\frac{du(\tau)}{d\tau} = v_\theta(u(\tau), \tau \mid z_t, s_t)\]

Starting from $u(0) \sim \mathcal{N}(0, I)$ and integrating to $\tau = 1$ gives samples from the learned action distribution.

3.3 Two Ways to Use Language Value with Flow Matching

Method A: Sample-then-Select (Simpler, matches V-JEPA 2-AC structure)

  1. Sample $K$ candidate chunks: $u^{(k)} \sim p_\theta(\cdot \mid z_t, s_t)$
  2. Rollout each through world model: $\hat{z}^{(k)}{t+H} = P\phi(u^{(k)}; z_t, s_t)$
  3. Select best: $k^* = \arg\max_k V_\psi(\hat{z}^{(k)}_{t+H}, c)$
  4. Execute first $h$ steps, repeat (MPC)

Method B: Guided Sampling (More efficient, uses gradients)

Define the posterior with an “optimality” variable $O$:

\[p(u \mid z_t, s_t, c, O=1) \propto p_\theta(u \mid z_t, s_t) \cdot \exp\big(\beta \cdot Q_\psi(z_t, s_t, u, c)\big)\]

This can be implemented by adding gradient guidance during sampling:

\[u(\tau + d\tau) = u(\tau) + v_\theta(\cdot) \cdot d\tau + \alpha \cdot \nabla_u Q_\psi(\cdot) \cdot d\tau\]

4. Learning the Language Value Function

The language value function $V_\psi(z, c)$ maps latent states and language instructions to scalar scores. Unlike language-to-latent alignment, this doesn’t require the language to produce a point in JEPA’s representation space — it only needs to define a preference ordering over states.

4.1 Binary Success Classification (Minimal Approach)

Given trajectories $(\tau, c, y)$ where $y \in {0, 1}$ indicates task success:

\[\mathcal{L}_V(\psi) = \mathbb{E}\left[\text{BCE}(V_\psi(z_t, c), y)\right]\]

The model learns to predict: “Given this visual state and this instruction, will the task succeed?”

4.2 Preference Learning (Stronger Approach)

Given paired trajectories or chunks with preferences $(u^+, u^-)$ under instruction $c$:

\[\mathcal{L}_{\text{rank}}(\psi) = -\log \sigma\left(Q_\psi(z, s, u^+, c) - Q_\psi(z, s, u^-, c)\right)\]

Preferences can come from:

  • Human annotations
  • Automatic success/failure signals
  • Reward model outputs
  • Heuristic rules (e.g., closer to target object is better)

4.3 TD Learning with World Model (Optional)

For longer-horizon reasoning, we can use the world model to bootstrap value estimates:

\[Q_\psi(z, s, u, c) \approx \sum_{i=1}^{H} \gamma^{i-1} r_\psi(\hat{z}_{t+i}, c) + \gamma^H V_\psi(\hat{z}_{t+H}, c)\]

where $r_\psi(\hat{z}, c)$ is a learned reward model.

Caution: Model-based TD can suffer from value hacking (exploiting model errors). Mitigations include:

  • Conservative Q-learning (penalize out-of-distribution actions)
  • Ensemble disagreement as uncertainty
  • Short rollout horizons

5. Complete Algorithm

Putting it all together, here’s the full pipeline:

Pseudocode

# Components (pre-trained)
E = VJEPAEncoder()           # Frozen visual encoder
P = ActionConditionedPredictor()  # V-JEPA 2-AC world model
π_prior = FlowMatchingPolicy()    # Task-agnostic action chunk generator
V = LanguageValueFunction()       # Language-conditioned value

def policy(x_t, s_t, instruction_c, K=64, H=10, h=3):
    """MPC with language value guidance"""

    # 1. Encode current observation
    z_t = E(x_t)

    # 2. Sample K action chunk candidates
    candidates = [π_prior.sample(z_t, s_t) for _ in range(K)]

    # 3. Rollout each candidate through world model
    predicted_futures = [P.rollout(z_t, s_t, u, horizon=H) for u in candidates]

    # 4. Score with language value function
    scores = [V(z_future[-1], instruction_c) for z_future in predicted_futures]

    # 5. Select best candidate
    best_idx = argmax(scores)
    best_chunk = candidates[best_idx]

    # 6. Execute first h actions
    return best_chunk[:h]

6. Experimental Design

6.1 Research Questions

  1. RQ1: Can language value functions effectively guide policy without language-latent alignment?
  2. RQ2: Does Flow Matching provide better action candidates than CEM under equal compute budget?
  3. RQ3: How does world model rollout error affect performance? Does teacher forcing + rollout loss help?

6.2 Environments

Simulation (fast iteration):

  • RLBench / ManiSkill tasks: pick, place, stack, open drawer
  • Template + paraphrased language instructions

Real Robot (validation):

  • Single-arm manipulation with camera observation
  • Zero-shot transfer: new objects, layouts, lighting

6.3 Baselines

Method Description
B1: V-JEPA 2-AC (image goal) Original CEM planning with goal images
B2: Language→Latent + CEM Align language to JEPA latent, then CEM
B3: Language-conditioned BC/DP End-to-end behavioral cloning / diffusion policy
Ours-A: FM + Select Flow Matching candidates + value selection
Ours-B: FM + Guided Flow Matching with gradient guidance

6.4 Key Ablations

6.5 Evaluation Metrics

  • Success Rate: Primary metric
  • Average Steps: Efficiency measure
  • Language Robustness: Performance on paraphrased/compositional instructions
  • Generalization: New objects, layouts, lighting conditions
  • Compute: Planning latency per step

Summary: Why This Architecture?

The key insight of this approach is separation of concerns:

Component Responsibility Language-aware?
V-JEPA Encoder Visual understanding No
World Model Physical prediction No
Flow Matching Action diversity No
Value Function Task semantics Yes

Only the value function needs to understand language. This means:

  1. Easier training: Don’t need to align language to JEPA’s learned representation
  2. Better modularity: Swap components independently
  3. Reusable priors: Same world model and action generator for all tasks
  4. Interpretable: Can visualize what futures the value function prefers

The price we pay is needing to evaluate multiple candidates at inference time — but with a fast world model, this is tractable.


References

  • Meta AI. (2025). V-JEPA 2-AC: Learning Manipulation Policies from Latent World Models
  • Lipman et al. (2023). Flow Matching for Generative Modeling
  • Janner et al. (2022). Planning with Diffusion for Flexible Behavior Synthesis
  • OpenVLA (2024). An Open-Source Vision-Language-Action Model

This post presents a theoretical framework. Implementation details may vary based on specific hardware and task requirements.