Ran Cheng

Robotics, Vision, Learning.

Generative Modeling via Drifting: One-Step Generation Through Training-Time Evolution

Traditional generative models like diffusion models achieve impressive results but require many iterative steps during inference. What if we could move this iterative process from inference time to training time? This is exactly what “Generative Modeling via Drifting” by Deng et al. proposes—a elegant framework that enables single-step generation by evolving distributions during training.

The Core Idea

Instead of learning to reverse a diffusion process step-by-step at inference time, this paper proposes learning a drift field that tells generated samples how to move to match the data distribution. The key insight: we can iterate this drifting process during training, so that at inference time, a single forward pass through the network produces samples from the target distribution.

The Drift Field Framework

The paper introduces a drift field $V_{p,q}(x)$ that describes how a sample $x$ from distribution $q$ should move to better match distribution $p$. The critical property is antisymmetry:

\[V_{p,q}(x) = -V_{q,p}(x)\]

This ensures that when $q = p$ (distributions match), we have $V_{p,q}(x) = 0$ (no drift needed).

The Drift Field Design

The drift field uses a kernel-based formulation combining attraction and repulsion:

\[V_{p,q}(x) = \frac{1}{Z_p Z_q} \mathbb{E}_{y^+ \sim p, y^- \sim q}\left[k(x, y^+) k(x, y^-) (y^+ - y^-)\right]\]

where:

  • $y^+$ are positive samples from data distribution $p$ (attract)
  • $y^-$ are negative samples from generated distribution $q$ (repel)
  • $k(x, y)$ is a kernel function measuring similarity

Training Objective

The training loss encourages the network output to follow the drift field:

\[\mathcal{L} = \mathbb{E}_{\epsilon}\left[\|f_\theta(\epsilon) - \text{stopgrad}(f_\theta(\epsilon) + V_{p,q}(f_\theta(\epsilon)))\|^2\right]\]

where:

  • $\epsilon \sim \mathcal{N}(0, I)$ is the input noise
  • $f_\theta$ is the generator network
  • $\text{stopgrad}$ prevents gradients from flowing through the drift computation

The key insight: the drift field provides a moving target that guides generated samples toward the data distribution. As training progresses, $q$ approaches $p$, and the drift diminishes.

Feature Space Drifting

A crucial innovation is computing the drift in feature space rather than pixel space. Using a pretrained encoder $\phi$ (like MAE or MoCo):

\[V_{p,q}^{\text{feat}}(x) = \mathbb{E}\left[k(\phi(x), \phi(y^+)) k(\phi(x), \phi(y^-)) (\phi(y^+) - \phi(y^-))\right]\]

This provides semantically meaningful similarity measures, enabling the model to learn high-level structure rather than just pixel-level details.

Flow Matching vs Drifting: A Visual Comparison

To understand what makes drifting special, let’s compare it directly with Flow Matching—another popular approach for generative modeling.

Flow Matching learns a velocity field $v_t(x)$ that transports samples from noise to data. At inference time, you must integrate this field over multiple steps:

\[x_{t+\Delta t} = x_t + v_t(x_t) \cdot \Delta t\]

Drifting instead evolves the generator during training. At inference, the network directly outputs samples—no integration needed.

The key difference is where the iteration happens:

Aspect Flow Matching Drifting
Training Learn velocity field $v_t(x)$ Iteratively drift generator
Inference Integrate ODE (20-50 steps) Single forward pass
Learned function $v_\theta(x, t)$ (time-dependent) $f_\theta(\epsilon)$ (direct mapping)
Sample quality High (with enough steps) High (after enough training)

Training Ground Truth: How Each Method Computes Supervision

The core difference lies in how training supervision is computed:

Key insight: Flow Matching computes GT velocity per (noise, data) pair and trains across all timesteps $t \in [0,1]$. Drifting computes GT drift from the current generated distribution vs data—no timestep needed, but requires iterating training to evolve the generator.

Single-Step vs Multi-Step Generation

A major advantage of this approach: inference requires only a single forward pass. Compare this to diffusion models that need 20-1000 steps:

Method Inference Steps ImageNet 256×256 FID
Diffusion (DDPM) 1000 ~3-4
Diffusion (DDIM) 50-100 ~4-5
Flow Matching 20-50 ~2-3
Consistency Models 1-2 ~3-4
Drifting (this paper) 1 1.54

The paper achieves state-of-the-art single-step generation, competitive with multi-step methods.

Classifier-Free Guidance

The framework naturally supports conditional generation. The key idea: use class-specific positive samples while treating both generated samples AND other-class samples as negatives.

\[V_{p,q}^{c}(x) = \mathbb{E}\left[k(\phi(x), \phi(y^+_c)) k(\phi(x), \phi(y^-)) (\phi(y^+_c) - \phi(y^-))\right]\]

where:

  • $y^+_c$ are positive samples from target class $c$ (attract toward this class)
  • $y^-$ includes generated samples + samples from other classes (repel)

This creates a class-conditional drift that pulls samples toward the target class while pushing away from others.

Key Results

The paper demonstrates impressive results across multiple domains:

Image Generation

  • ImageNet 256×256 (Latent): FID 1.54 (single step)
  • ImageNet 256×256 (Pixel): FID 1.61 (single step)
  • Outperforms all previous single-step methods
  • Competitive with 50-step diffusion models

Robotics

  • Applied to robot manipulation policies
  • Achieves comparable success rate to 100-step diffusion policies
  • Enables real-time robot control

Why Does It Work?

The elegance of this approach lies in three key insights:

  1. Training absorbs iteration: By iterating the drift during training, the network learns to directly map noise to data in one step.

  2. Antisymmetric drift ensures convergence: The mathematical structure guarantees that when $q = p$, the drift vanishes.

  3. Feature-space similarity: Using pretrained encoders provides semantically meaningful gradients that guide high-level structure learning.

Limitations and Open Questions

  • Requires a high-quality feature encoder (fails without it on ImageNet)
  • Theoretical guarantee that $V \approx 0 \Rightarrow q \approx p$ is not proven
  • Many design choices (kernel functions, architectures) may not be optimal

Conclusion

“Generative Modeling via Drifting” presents a compelling new paradigm for generative modeling. By moving the iterative distribution evolution from inference to training time, it achieves single-step generation with state-of-the-art quality. This has significant implications for real-time applications like robotics, interactive content creation, and resource-constrained deployment.

The key takeaway: iteration is a resource that can be spent during training instead of inference. This perspective opens new directions for efficient generative modeling.


This post features interactive visualizations built with React. Based on the paper “Generative Modeling via Drifting” by Deng et al.