CAFA - Controllable Automatic Foley Artist

CAFA (Controllable Automatic Foley Artist) is a controllable text-video-to-audio model for Foley sound generation. Given a short video and a textual prompt, CAFA generates a synchronized audio waveform that matches both the visual content and the desired semantics described in the prompt. This allows users to modify or override the natural sound of the video by changing the prompt, enabling fine-grained control over the generated audio.