The paper proposes Mamba-ST, a novel architecture for efficient image style transfer. The key contributions are:
Adaptation of the inner equations of the Mamba state space model to enable fusion of content and style information, without requiring additional modules like cross-attention or custom normalization layers.
Design of a Mamba-ST Decoder (MSTD) block that can accept two separate data streams (content and style) and combine them into a single output, mimicking the functionality of cross-attention but with lower memory usage and time complexity.
Comprehensive experiments demonstrating the superiority of Mamba-ST over state-of-the-art style transfer models in terms of quality metrics like ArtFID and FID, while also being more efficient in memory usage and inference time.
The paper first provides background on Mamba, a state space model that can efficiently model long sequences. It then details the overall Mamba-ST architecture, which consists of two Mamba encoders (for content and style) and the novel MSTD block. The MSTD block is designed to fuse the content and style information by adapting the internal Mamba matrices to simulate cross-attention without requiring additional layers.
Extensive experiments are conducted, comparing Mamba-ST against various state-of-the-art style transfer models. The results show that Mamba-ST outperforms the competitors in terms of ArtFID and FID, while also being more efficient in memory usage and inference time. Some qualitative examples are also provided, demonstrating the ability of Mamba-ST to effectively transfer style while preserving content.
לשפה אחרת
מתוכן המקור
arxiv.org
תובנות מפתח מזוקקות מ:
by Filippo Bott... ב- arxiv.org 09-17-2024
https://arxiv.org/pdf/2409.10385.pdfשאלות מעמיקות