Representation Distribution Matching

One step from real.

Representation Distribution Matching for One-Step Visual Generation

We train a one-step image generator by matching generated and real feature distributions under frozen pretrained encoders. No online teacher, no adversary, no trajectory. Estimate the distance right and refuse to trust any single encoder, and a single network evaluation lands the closest to real reported to date.

Lan Feng1,  Wuyang Li1,  Éloi Zablocki2,  Matthieu Cord2,3,  Alexandre Alahi1
1 EPFL 2 Valeo.ai 3 Sorbonne Université
1.30
SWr14 distance to real, real validation data scores 1.00. One-step state of the art.
63.6%
of samples preferred over real photographs by PickScore, a learned human-preference model.
90h
H200 GPU-hours to post-train four-step FLUX.2 into a single step.
Post-training · our 1-step vs the 4-step FLUX.2 teacher
GenEvalkeeper 0.826 · 4-step 0.794
0.5 0.6 0.7 0.8 0 30 60 90 GPU-hour 4-step · 0.794 keeper untrained
PickScorekeeper 22.76 · 4-step 22.58
20 21 22 23 0 30 60 90 GPU-hour 4-step · 22.58 keeper untrained

Each dot is a checkpoint over 90 H200 GPU-hours; the dashed line is the four-step FLUX.2 teacher. Our one step clears it on GenEval within ~10 GPU-hours and on PickScore by ~30, reaching 0.826 and 22.76 at the keeper.

Each frame is one network evaluation  ·  1-step FLUX.2 after iRDM
One-step ImageNet · Distance to real

How close can a single step get?

Generative quality is a distance between distributions. We measure it with SW r14, a Sliced-Wasserstein distance averaged over fourteen frozen encoders, scaled so a fresh draw of real validation data scores 1.00. It shares no machinery with the training loss, so a low score cannot be gamed by matching the objective. Lower is closer. iRDM sits nearest the real line, below every released generator, including multi-step diffusion.

Model 1.02.03.04.05.06.0 SW r14
iRDM · ours · 1-NFE
1.30
pMF-H FD-SIM · 1-NFE
2.05
REPA-E SiT-XL
2.40
RAE-XL
2.43
LightningDiT-XL
3.10
SiT-XL/2 + REPA
3.61
MAR-H
3.87
SiT-XL/2
4.27
Drifting-L · 1-NFE
5.93
And do humans agree

A preference model we never train against.

PickScore is a learned human-preference proxy, and our objective never optimizes it. It prefers iRDM to every prior one-step generator, and for the first time to held-out real photographs.

preferred over real photographs
first one-step model to pass
63.6%
preferred over pMF-H FD-SIM
the prior best one-step generator
71.2%
preferred over RAE-XL
a recent multi-step model
75.7%
preferred over REPA-E SiT-XL
a recent multi-step model
73.2%
The method

Two axes fix every instance.

Every teacher-free distribution-matching generator is set by two choices, and prior methods fixed both at once. We vary one at a time. The first is how the distributions are compared. The second is which representations they are compared in. Getting each right is what closes the gap.

Axis 01 · Comparison

How the distributions are compared

An exact within-batch repulsion, paired with a Nyström attraction to a reference frozen once over the full data.

manifold attraction repulsion
  • MMDEstimated right. The classical MMD, once dismissed as too weak, becomes a strong objective with an exact within-batch repulsion and a Nyström attraction toward a frozen full-data reference.
  • BATCHLarge and fresh. The generated batch is the operative variable. Quality climbs to an optimum above 2048, an order past common practice, with gradient caching absorbing the memory.
  • JOINTMatch the joint, not the marginal. On conditional tasks we match the joint image-text law, so prompt fidelity becomes part of the objective.
Axis 02 · Representation

Which spaces they are compared in

Any single encoder can be gamed. A diverse battery, held in balance, cannot.

The fourteen-encoder panel
Inception ConvNeXt DINOv2* MAE SigLIP2 CLIP DINOv3 SigLIP v1* PE-Core RADIO* WebSSL AIMv2 DreamSim FLUX VAE*
* four encoders held out from training, a generalization check
  • GAMEOne encoder is never enough. Matched alone, even DINOv2 is driven below the real score while samples stay visibly fake. The limitation is single-encoder matching itself, not the choice of encoder.
  • BALANCEA battery under constrained optimization. A proportional Lagrangian controller upweights whichever encoder is hardest to satisfy and drops those already at their floor, so no space can be gamed.
Matching only DINOv2 drives its distance to the real floor yet improves quality unevenly: a lizard becomes indistinguishable from real while a typewriter keeps clear artifacts.

Single-encoder gaming. Matching only DINOv2 reaches the real floor, yet the lizard becomes photoreal while the typewriter keeps clear artifacts. A saturated single-encoder score does not imply realism.

Text-to-image post-training

Four-step FLUX.2, in a single step.

The same recipe carries to text-to-image. With the joint image-text objective, we post-train the four-step FLUX.2 [klein] into a one-step model that surpasses the four-step teacher on both GenEval and PickScore, in 90 H200 GPU-hours.

Four-step FLUX.2 [klein] compared with one-step iRDM at matched quality, and GenEval and PickScore over post-training compute.
GenEval overall
1-step iRDM vs 4-step base
0.826vs0.794
PickScore
1-step iRDM vs 4-step base
22.76vs22.58
Joint vs marginal
GenEval overall
0.826vs0.801
Compute
single run
90h H200

Against a one-step DMD2 distillation of the same teacher, iRDM also leads: GenEval 0.826 vs 0.804, PickScore 22.76 vs 22.36.

Head to head

Our one step vs the four-step teacher.

Four-step FLUX.2 [klein] is the distillation target. We post-train it into a single step, then set them side by side on the same epic and complex prompts. The four-step teacher averages a PickScore of 23.08; our one-step student holds that quality at a single forward pass, and on a number of prompts scores higher.

Sci-fi worlds Towering mech silhouette in rainy neon city, magenta and cyan glow, volumetric fog, moody cinematic backlight. our 1-step · PS 24.26
4-step teacher: towering mech in rainy neon city
4-step · teacher
Our 1-step: towering mech in rainy neon city
1-step · ours
Space & cosmos Swirling crimson nebula with a bright newborn star, volumetric glow, deep cosmic blacks, breathtaking scale. our 1-step · PS 23.89
4-step teacher: crimson nebula with newborn star
4-step · teacher
Our 1-step: crimson nebula with newborn star
1-step · ours
Mythical creatures A blazing phoenix rising from glowing embers, fiery orange wings spread against a dark smoldering sky. our 1-step · PS 23.80
4-step teacher: phoenix rising from embers
4-step · teacher
Our 1-step: phoenix rising from embers
1-step · ours
Underwater Giant manta ray soaring through turquoise god rays, glowing surface above, deep indigo abyss, epic scale. our 1-step · PS 23.41
4-step teacher: manta ray in turquoise god rays
4-step · teacher
Our 1-step: manta ray in turquoise god rays
1-step · ours
Vehicles in motion Steam train crossing a misty stone viaduct, billowing smoke, volumetric god rays, moody blue valley fog. our 1-step · PS 22.52
4-step teacher: steam train on a misty viaduct
4-step · teacher
Our 1-step: steam train on a misty viaduct
1-step · ours
Epic fantasy A lone dragon glides over jagged misty peaks at dawn, golden god rays piercing the fog, vast silhouette. our 1-step · PS 22.64
4-step teacher: dragon over misty peaks at dawn
4-step · teacher
Our 1-step: dragon over misty peaks at dawn
1-step · ours
See all 48 prompt pairs
Live demo

Generate in one step, yourself.

The post-trained one-step FLUX.2 [klein] runs live on HuggingFace Spaces. Type a prompt and get an image in a single forward pass.

FLUX.2-klein · 1-step RDM Open in HuggingFace ↗

Embedded from the public Space. If it does not load here, open it in a new tab with the link above.

Citation

BibTeX

@article{feng2026rdm,
  title   = {Representation Distribution Matching for One-Step Visual Generation},
  author  = {Feng, Lan and Li, Wuyang and Zablocki, {\'E}loi and Cord, Matthieu and Alahi, Alexandre},
  journal = {arXiv preprint arXiv:2607.02375},
  year    = {2026}
}