Anonymous Authors
Abstract. Video-guided audio generation aims to add synchronized and realistic audio effects to videos and films, significantly benefiting video generative AI and video post-production. However, achieving high-fidelity generation remains challenging due to the complexity of modeling long audio sequences and learning audio-visual correspondence, both semantically and temporally. In this work, we propose VisualAudio, a scalable transformer with mixture-of-experts (MOE) for video-guided audio generation and adaptation. We approximate the velocity vector field with a flow-matching objective and generate sample by solving the probability flow ordinary differential equation (ODE). To guarantee video-audio alignment and synchronization, we scale VisualAudio with time and frequency MOE layers specializing across noise levels and frequency bands, along with a cross-attention architecture that effectively injects visual conditions. To enable adaptation to unseen scenarios, we introduce parameter-efficient module with lora and bias/norm tuning requiring minimal computational overhead. Experimental results demonstrate that VisualAudio achieves state-of-the-art performance in zero-shot video-to-audio generation. Furthermore, efficient adaptation with only 5\% learnable parameters enables generalization to new data (e.g., dancing/music videos) or user-defined tasks (e.g., video-guided audio transfer and interpolation), empowering users to create rich and visual-aligned audio content.