Anna Deichler, Jim O'Regan, Jonas Beskow
KTH Royal Institute of Technology
ECCV Multimodal Agents Workshop
In this paper, we present a novel dataset captured using a VR headset to record conversations between participants within a physics simulator (AI2-THOR). Our primary objective is to extend the field of co-speech gesture generation by incorporating rich contextual information within referential settings. Participants engaged in various conversational scenarios, all based on referential communication tasks. The dataset provides a rich set of multimodal recordings such as motion capture, speech, gaze, and scene graphs. This comprehensive dataset aims to enhance the understanding and development of gesture generation models in 3D scenes by providing diverse and contextually rich data.
System overview
We record 2-party conversations in a VR setup using the AI2-THOR simulator.
Referential Communication is a specific mode of communication that often occurs within a situated dialogue. This can include identifying, describing, or giving instructions related to objects, locations, or people. This form of communication bridges the perceptual and conceptual understanding of one's surroundings. This relies on multimodal expressions, including spatial language and non-verbal behaviors like gaze and pointing gestures. When discussing spatial contexts, pointing or gesturing becomes a crucial addition to spatial language, providing a more immediate and often clearer method of specifying locations or directing attention to particular objects or areas. For agents to effectively understand and participate in referential communication within a situated dialogue, they need to be capable of interpreting and generating both verbal spatial references and non-verbal cues such as pointing gestures and gaze. This dual capability allows for a more nuanced and efficient exchange of information.
ECCV WS Presentation slides
Data recording.
Stream skeletal data from motion capture to simulator.