核心概念
Our novel image compression codec leverages foundation latent diffusion models to synthesize lost details and produce highly realistic reconstructions at low bitrates.
摘要
The authors propose a novel lossy image compression codec that uses foundation latent diffusion models as a means to synthesize lost details, particularly at low bitrates. The key components of their approach are:
- An autoencoder from a foundation latent diffusion model (Stable Diffusion) to transform an input image to a lower-dimensional latent space.
- A learned adaptive quantization and entropy encoder, enabling inference-time control over bitrate within a single model.
- A learned method to predict the ideal denoising timestep, which allows for balancing between transmission cost and reconstruction quality.
- A diffusion decoding process to synthesize information lost during quantization.
Unlike previous work, their formulation requires only a fraction of iterative diffusion steps and can be trained on a dataset of fewer than 100k images. The authors also directly optimize a distortion objective between input and reconstructed images, enforcing coherency to the input image while maintaining highly realistic reconstructions due to the diffusion backbone.
The authors extensively evaluate their method against state-of-the-art generative compression methods on several datasets. Their experiments verify that their approach achieves state-of-the-art visual quality as measured in FID, and their reconstructions are subjectively preferred by end users, even when other methods use twice the bitrate.
統計資料
The authors report the following key metrics:
Average encoding/decoding time of 3.49 seconds per image on an NVIDIA RTX 3090 GPU, nearly twice as fast as the CDC baseline.
Model size of 1.3B parameters, with the majority coming from the Stable Diffusion backbone.
引述
"Our novel image compression codec leverages foundation latent diffusion models to synthesize lost details and produce highly realistic reconstructions at low bitrates."
"Unlike previous work, our formulation requires only a fraction of iterative diffusion steps and can be trained on a dataset of fewer than 100k images."