NVIDIA’s PiD decodes 512×512 latents into 2048×2048 in under a second
NVIDIA Research has introduced PiD, a Pixel Diffusion Decoder designed for fast and high-resolution latent decoding by unifying decoding and upsampling into a single generative module. PiD synthesizes 4x and even 8x upscaled images with low latency, decoding 512x512 images into 2048x2048 pixels in under 1 second on an RTX 5090 and as fast as 210 ms on a GB200 GPU. This technology achieves improved visual fidelity and is up to 5.9 times faster than cascaded diffusion-based super-resolution pipelines.
Key Takeaways
- PiD unifies latent decoding and upsampling into a single pixel diffusion module instead of a decode-then-super-resolve cascade.
- NVIDIA says PiD can decode 512×512 images into 2048×2048 pixels in under 1 second on an RTX 5090 with 13 GB peak memory.
- On a GB200 GPU, PiD reaches 210 ms for 512² to 2048² decoding, about 5.9× faster than SeedVR2.
- The model uses a lightweight sigma-aware adapter and DMD2 distillation to reduce inference to 4 steps.
- PiD applies to both VAE latents and semantic latents such as SigLIP and DINOv2.
Why It Matters
PiD shortens the path from latent to display-quality pixels by folding decoding and upsampling into one diffusion model, with NVIDIA reporting 2048×2048 output from 512×512 latents in under 1 second on an RTX 5090. That matters for any pipeline doing high-resolution image generation or post-processing, because the decoder is no longer just reconstructing—it is synthesizing detail at megapixel scale. NVIDIA’s comparisons also make the competitive frame clear: PiD is positioned against cascaded diffusion-based super-resolution systems, and the next concrete signal to watch is whether the reported 4-step inference and 210 ms GB200 result hold across the released model and code.
Read full article at research.nvidia.com