Video-Guided Audio Generation and Adaptation with Mixture of Experts at Scale

Paper Appendix

Anonymous Authors

Abstract. Video-guided audio generation aims to add synchronized and realistic audio effects to videos and films, significantly benefiting video generative AI and video post-production. However, achieving high-fidelity generation remains challenging due to the complexity of modeling long audio sequences and learning audio-visual correspondence, both semantically and temporally. In this work, we propose VisualAudio, a scalable transformer with mixture-of-experts (MOE) for video-guided audio generation and adaptation. We approximate the velocity vector field with a flow-matching objective and generate sample by solving the probability flow ordinary differential equation (ODE). To guarantee video-audio alignment and synchronization, we scale VisualAudio with time and frequency MOE layers specializing across noise levels and frequency bands, along with a cross-attention architecture that effectively injects visual conditions. To enable adaptation to unseen scenarios, we introduce parameter-efficient module with lora and bias/norm tuning requiring minimal computational overhead. Experimental results demonstrate that VisualAudio achieves state-of-the-art performance in zero-shot video-to-audio generation. Furthermore, efficient adaptation with only 5\% learnable parameters enables generalization to new data (e.g., dancing/music videos) or user-defined tasks (e.g., video-guided audio transfer and interpolation), empowering users to create rich and visual-aligned audio content.

Overview



Table of Contents

Video-to-Audio Generation (VGGSound)

GT (voc) Diff-foley Ours

Video-to-Audio Generation (Landscape)

GT voc Diff-foley MM-Diffusion Ours (Zero-Shot)

Efficient Fine-tuning (AIST)

GT (voc) Lora ALL Bias/Norm

Efficient Fine-tuning (YT8M)

GT (voc) Lora ALL Bias/Norm

Efficient Fine-tuning (Landscape)

In this section, we provide the generated audio samples in landscape.

GT (voc) Lora ALL Bias/Norm

Ablation studies

GT (voc) L XL DDPM Freq-MOE Time-MOE W/O MOE Time Concat Channel Concat