Abstract
Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: https://www.robots.ox.ac.uk/~vgg/research/jegal
Community
Project Page: https://www.robots.ox.ac.uk/~vgg/research/jegal/
Codebase: https://github.com/Sindhu-Hegde/jegal
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues (2025)
- MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation without Vector Quantization (2025)
- HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation (2025)
- I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue (2025)
- ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis (2025)
- Shushing! Let's Imagine an Authentic Speech from the Silent Video (2025)
- Audio-driven Gesture Generation via Deviation Feature in the Latent Space (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper