Five diffusion papers worth reading today (May 25, 2026)

Monday's batch covers the weekend gap — ArXiv bundles Saturday, Sunday, and Monday submissions into a single listing. This window's five papers divide cleanly into two clusters: decoding and inference efficiency (PiD, Precise, VDE, SANA-SR) and a structural upgrade to discrete diffusion language models (Relay). All five address a concrete bottleneck rather than claiming a new capability from scratch.

1. PiD: NVIDIA's pixel diffusion decoder reaches 2048×2048 in under one second

ArXiv: 2605.23902 | Yifan Lu, Qi Wu, Jay Zhangjie Wu, Zian Wang, Huan Ling, Sanja Fidler, Xuanchi Ren | cs.CV | NVIDIA Research (Toronto SIL Lab)

Peer-review status: Preprint (submitted 2026-05-22). Code released at github.com/nv-tlabs/PiD. Project page: research.nvidia.com/labs/sil/projects/pid/.

Standard latent diffusion pipelines treat decoding and upsampling as two separate stages: a VAE decoder converts the latent to a base-resolution image, then a super-resolution module (often another diffusion model) upsamples to target resolution. Both stages are reconstruction-oriented — they try to invert their respective processes rather than synthesize new detail. At megapixel scales the compute cost of running both stages in sequence becomes a hard deployment barrier. 1

PiD collapses both stages into a single conditional pixel diffusion model. The key mechanism is a sigma-aware adapter: the noisy latent at the current diffusion timestep is injected directly into a pixel-space backbone, which then generates the full-resolution output conditioned on that partial latent. Because the adapter conditions on the sigma level, PiD can receive a latent that has only been partially denoised — the latent diffusion process can terminate early, handing off to PiD before reaching a fully denoised latent. This early-exit property is what produces the speed gain. 1

PiD supports 4× to 8× upsampling and is compatible with both conventional VAE latent representations and semantic latents (SigLIP, DINOv2), making it applicable to RAE-family models. A DMD2 distillation step reduces PiD to four inference steps without separate training of the backbone. 1

Benchmark results: On an RTX 5090 consumer GPU, decoding a 512×512 latent to 2048×2048 pixels takes under one second with peak memory at 13 GB. On an NVIDIA GB200, the same operation completes in 210 ms. Both figures represent approximately 6× speedup over cascaded diffusion super-resolution pipelines at comparable visual quality. 1

Code/resources: github.com/nv-tlabs/PiD

Why read it: The sigma-aware injection design is the transferable idea here. It provides a principled way to condition a pixel-space model on a partially denoised latent — a coupling that other decoder architectures have not formalized. The 6× wall-clock speedup also matters for any team running high-resolution generation in production: the paper provides specific GPU-model comparisons (RTX 5090, GB200) rather than abstract FLOP estimates.

2. Precise: SDE-consistent stochastic sampling cuts RL training time by 13–53%

ArXiv: 2605.23522 | Jade Zou, Tao Huang, Weijie Kong, Junzhe Li, Yue Wu, Qi Tian, Jiangfeng Xiong, Jianwei Zhang, Liefeng Bo, Zhao Zhong | cs.CV, cs.LG | Tencent Hunyuan + Peking University

Peer-review status: Preprint (submitted 2026-05-22). No code repository confirmed at time of writing.

Reinforcement learning post-training for flow matching models (such as GRPO-style reward optimization applied to FLUX or Hunyuan) relies on a stochastic sampler to generate rollouts during training. Two properties of that sampler turn out to matter more than the community has acknowledged. The first is exploration-stability balance: how much noise the sampler injects controls whether the model explores diverse image regions or stays close to its current optimum. The second is SDE discretization consistency: whether the sampler's discrete-time transitions are consistent with the continuous-time SDE the model was trained under. Standard samplers optimize neither property systematically. 2

Precise sampler design: two-axis diagram (exploration-stability vs SDE consistency) and double-ring toy example — Figure 1 from 2605.23522: design space for RL samplers on exploration-stability and SDE-consistency axes; Precise targets the upper-right quadrant. 2

Precise's fix is to freeze the clean-latent posterior mean during stochastic sampling steps. In standard samplers, each denoising step re-estimates the posterior mean from scratch, introducing discretization noise that accumulates across the rollout. Freezing the posterior mean preserves the mean trajectory established at the first step of each rollout, so the noise added between steps is genuinely stochastic exploration rather than numerical drift. 2

Benchmark results: Under a 10-NFE (Network Function Evaluation calls per rollout) / 3,000-iteration protocol, Precise achieves PickScore 23.745, CLIPScore 1.038, and HPSv2.1 0.391 — compared to Dance-GRPO at 22.428, Flow-GRPO at 23.242, and CPS at 23.670. The 30-NFE protocol gives PickScore 23.421. Precise reaches the same best in-domain performance as competing methods in 13.1–53.2% less wall-clock training time, with 14.2–22.1% fewer training iterations to convergence. 2

Why read it: Teams currently running GRPO or similar RL post-training on flow matching models can drop Precise in as a sampler replacement without architecture changes. The training cost reduction (up to 53%) is the primary practical signal — it means the reward optimization budget goes further for the same compute spend. The theoretical framing (why SDE consistency matters for reward convergence) is also worth reading if you're designing RL pipelines for other generative model families.

3. VDE: training-free rectified flow acceleration, 3.22× on Flux, CVPR 2026

ArXiv: 2605.23381 | Junwen Tan, Jinglin Liang, Hongyuan Chen, Shuangping Huang | cs.CV | South China University of Technology

Peer-review status: Accepted at CVPR 2026. No code repository confirmed at time of writing.

Token-caching methods for rectified flow models (TeaCache, EasyCache, PAB) work by reusing attention or velocity computations from earlier denoising steps when the values haven't changed much. The implicit assumption is that cached values remain a good approximation as the input state evolves. VDE's argument is that this assumption breaks down because cached velocity estimates are static while the inputs at each step continue to change — the mismatch grows as sampling progresses and is responsible for the quality drop visible at high speedup ratios. 3

The alternative VDE proposes is a decompose-and-estimate paradigm. Rather than caching and reusing full velocity vectors, VDE decomposes the model velocity at each step into two components: a parallel component (αₜxₜ, aligned with the current state) and an orthogonal component (βₜ‖xₜ‖uₜ, capturing the directional deviation). Each component has distinct temporal dynamics — the parallel component follows the state trajectory predictably, while the orthogonal component carries the semantic direction information and is more stable across steps. VDE estimates each component separately using its respective predictability structure. 3

To prevent error accumulation, VDE periodically resets the estimate by running a full forward pass — effectively re-anchoring the decomposition. The interval between resets is adaptive: tighter where the velocity changes faster, looser where the trajectory is smooth. 3

VDE vs. standard 50-step sampling on Flux, Qwen-Image, and Wan2.1 — Figure 1 from 2605.23381: side-by-side output comparison — VDE at reduced NFE vs. 50-step reference on Flux, Qwen-Image, and Wan2.1. 3

Benchmark results: On Flux.1 [dev], VDE-fast achieves 3.22× speedup (reducing to 16 NFE from 50) with LPIPS 0.1997 and CLIP score 0.3109; VDE-slow gives 2.21× speedup with LPIPS 0.1243. On Qwen-Image, VDE-slow LPIPS reaches 0.0691 vs. EasyCache-fast's 0.1445, a 52.2% improvement. On Wan2.1 video generation, VDE-slow LPIPS hits 0.0554 vs. TeaCache's 0.1277. These gains are consistent across all three model families tested. 3

Why read it: The decompose-and-estimate framing gives a reason for why caching methods degrade at high speedup: they assume the full velocity is reusable when only specific components are. This reframing is applicable to any caching-based acceleration scheme for flow models. The CVPR 2026 acceptance also provides a credibility signal that the comparison methodology was independently reviewed.

4. Relay: forward-thinking discrete diffusion cuts inference latency by 32%

ArXiv: 2605.22967 | Benjamin Rozonoyer*, Jacopo Minniti*, Dhruvesh Patel* (equal contribution), Neil Band, Avishek Joey Bose, Tim G. J. Rudner, Andrew McCallum | cs.LG, cs.CL | UMass Amherst + University of Toronto + Stanford + Imperial College London + Mila

Peer-review status: Preprint (submitted 2026-05-21). Code released at github.com/jacopo-minniti/relay.

Masked Diffusion Models (MDMs) generate text by iteratively unmasking token positions across multiple denoising rounds. Each round is treated as independent: the model receives the current masked sequence, predicts unmasked tokens, and discards its internal activations before the next round starts. This means every round re-computes representations from scratch for all token positions — including positions that were already meaningfully processed in the previous round. The wasted computation is structural, not incidental. 4

Relay addresses this by introducing a per-token relay channel: a set of learnable latent vectors that are passed forward from one denoising round to the next alongside the token sequence. The relay channels carry the model's internal state across the round boundary, giving each new round a head start rather than forcing it to rebuild context from scratch. Training uses truncated backpropagation through time (truncated BPTT) across rounds, so the relay channels are trained to carry information that is genuinely useful for subsequent denoising rather than arbitrary activations. 4

Relay method overview: learnable relay module passing information between denoising rounds — Figure 1 from 2605.22967: the relay module threads per-token latent channels across denoising rounds, allowing the model to build on prior internal state rather than restarting from the masked sequence. 4

The design is compatible with block diffusion (batching multiple denoising steps) and KV caching, so it stacks on top of existing MDM efficiency work. The paper validates design choices on Sudoku-Extreme (a controlled reasoning task) before scaling to coding benchmarks. 4

Benchmark results: On Fast-dLLM v2 (a 1.5B MDM fine-tuned for code generation), Relay reduces inference latency by up to 32% by cutting the number of function evaluations (NFE) from 178.1 to 130.7. On HumanEval Base, Relay matches the vanilla SFT baseline accuracy at 38.4% while using fewer NFE. On MBPP Plus, Relay reaches 39.7% accuracy vs. 38.1% for the vanilla baseline at the same NFE budget (133.0). 4

Code/resources: github.com/jacopo-minniti/relay

Why read it: The latency gain comes from a structural change — fewer denoising rounds needed to reach the same output quality — rather than from quantization, distillation, or speculative decoding. For researchers working on MDMs for coding or reasoning tasks, Relay is a training-side addition (no architecture constraints on the backbone) with an open codebase to experiment with. The Sudoku validation also provides a clean interpretability window: you can measure how much relay improves round-by-round decision quality on a problem with known ground truth.

5. SANA-SR: one-step diffusion super-resolution in 0.019 seconds

ArXiv: 2605.23451 | Bingtian Qiao, Yue Shi, Yingjie Zhou, Yong Guo, Guangtao Zhai, Jiezhang Cao | cs.CV | Shanghai Jiao Tong University + Fuzhou University + Shanghai AI Laboratory

Peer-review status: Preprint (submitted 2026-05-22). No code repository confirmed at time of writing.

Real-world image super-resolution (Real-ISR) has converged toward diffusion-based methods because they can hallucinate perceptually realistic texture that reconstruction methods cannot. But diffusion-based SR pipelines carry two costs: token count grows quadratically with resolution (standard attention scales as O(n²)), and multi-step inference means deploying anything beyond a single denoising step is expensive. Existing one-step SR methods (OSEDiff, AdcSR) address inference cost but still operate in standard latent spaces with quadratic attention, capping their deployment on resource-constrained hardware. 5

SANA-SR attacks both costs simultaneously with three components. First, a 32× deep compression autoencoder reduces spatial token count by 32× compared to a 4× VAE — the image enters the DiT as far fewer tokens, shrinking every subsequent attention operation. Second, a linear attention DiT replaces the standard quadratic self-attention with a linear approximation, reducing attention cost from O(n²) to O(n). Third, LoRA fine-tuning (Low-Rank Adaptation) adapts a pretrained LinearDiT prior to the SR task without retraining from scratch, reusing the generative prior while keeping parameter overhead small. Structured pruning then trims the adapted model further. 5

SANA-SR quality-efficiency trade-off: seven-baseline qualitative comparison and DRealSR scatter plot — Figure 1 from 2605.23451: SANA-SR Pareto position across seven baselines on DRealSR — scatter plot plus qualitative crops for the 32× compression + linear attention + one-step configuration. 5

Benchmark results: Single-step inference runs in 0.019 seconds with 407.95G MACs and 344M parameters. On DIV2K-Val, RealSR, and DRealSR, SANA-SR achieves competitive to superior performance compared to OSEDiff and AdcSR while operating at substantially lower compute. No other published method combines one-step inference, 32× compression, linear attention complexity, and LoRA-based adaptation in a single SR model. 5

Why read it: The 32× compression ratio is the number to stress-test. Aggressive spatial compression preserves perceptual quality only if the compressed latent retains enough structure for the generative prior to reconstruct texture — the paper's contribution is showing that a sufficiently strong pretrained LinearDiT prior plus LoRA can close that gap for SR. For teams interested in mobile or edge deployment of generative SR, SANA-SR provides a concrete architecture template: 344M parameters and sub-20 ms inference is within range of on-device inference on current mobile hardware.

Quick reference

Paper	Core contribution	Institution	Peer-review status	Code
2605.23902 — PiD	Sigma-aware adapter unifies latent decoding + upsampling; <1 s for 2048×2048 on RTX 5090, 6× faster than cascaded SR	NVIDIA Research (Toronto SIL Lab)	Preprint	GitHub (open)
2605.23522 — Precise	Frozen clean-latent posterior mean ensures SDE consistency in RL rollouts; 13–53% less training wall-clock time	Tencent Hunyuan + Peking University	Preprint	Not confirmed
2605.23381 — VDE	Velocity decompose-and-estimate replaces caching; 3.22× on Flux, LPIPS −52% vs EasyCache on Qwen-Image	South China University of Technology	CVPR 2026	Not confirmed
2605.22967 — Relay	Per-token relay channels enable forward-thinking in MDMs; 32% inference latency reduction on Fast-dLLM v2	UMass Amherst / UofT / Stanford / ICL / Mila	Preprint	GitHub (open)
2605.23451 — SANA-SR	32× compression + linear attention + LoRA enables 0.019 s one-step SR at 344M params	SJTU + Fuzhou Univ + Shanghai AI Lab	Preprint	Not confirmed

The weekend batch has a clear axis: every paper here replaces a cheap approximation with a principled one. PiD replaces two sequential reconstruction stages with a single generative decoder. Precise replaces heuristic stochastic sampling with SDE-consistent rollouts. VDE replaces static velocity caching with per-component estimation. Relay replaces stateless round-by-round denoising with state-carrying relay channels. SANA-SR replaces quadratic token processing with a compressed-and-linear alternative. None of them claim a new capability — each one measures the cost of the approximation it removes.

Cover image: AI-generated illustration