visual-audio-demo

Video-Guided Audio Generation and Adaptation with Mixture of Experts at Scale

Anonymous Authors

Abstract. Video-guided audio generation aims to add synchronized and realistic audio effects to videos and films, significantly benefiting video generative AI and video post-production. However, achieving high-fidelity generation remains challenging due to the complexity of modeling long audio sequences and learning audio-visual correspondence, both semantically and temporally. In this work, we propose VisualAudio, a scalable transformer with mixture-of-experts (MOE) for video-guided audio generation and adaptation. We approximate the velocity vector field with a flow-matching objective and generate sample by solving the probability flow ordinary differential equation (ODE). To guarantee video-audio alignment and synchronization, we scale VisualAudio with time and frequency MOE layers specializing across noise levels and frequency bands, along with a cross-attention architecture that effectively injects visual conditions. To enable adaptation to unseen scenarios, we introduce parameter-efficient module with lora and bias/norm tuning requiring minimal computational overhead. Experimental results demonstrate that VisualAudio achieves state-of-the-art performance in zero-shot video-to-audio generation. Furthermore, efficient adaptation with only 5\% learnable parameters enables generalization to new data (e.g., dancing/music videos) or user-defined tasks (e.g., video-guided audio transfer and interpolation), empowering users to create rich and visual-aligned audio content.

Overview

Video-to-Audio Generation (VGGSound)
Video-to-Audio Generation (Landscape)
Efficient Fine-tuning (AIST)
Efficient Fine-tuning (YT8M)
Efficient Fine-tuning (Landscape)
Ablation studies

Video-to-Audio Generation (VGGSound)

GT (voc)	Diff-foley	Ours

Video-to-Audio Generation (Landscape)

GT voc	Diff-foley	MM-Diffusion	Ours (Zero-Shot)

Efficient Fine-tuning (AIST)

GT (voc)	Lora	ALL	Bias/Norm

Efficient Fine-tuning (YT8M)

GT (voc)	Lora	ALL	Bias/Norm

Efficient Fine-tuning (Landscape)

In this section, we provide the generated audio samples in landscape.

GT (voc)	Lora	ALL	Bias/Norm

Ablation studies

GT (voc)	L	XL	DDPM	Freq-MOE	Time-MOE	W/O MOE	Time Concat	Channel Concat