Core Concepts
The core message of this article is to propose an adapted Temporal Graph Networks (TGN) model that can comprehensively represent social interaction dynamics by incorporating temporal multi-modal behavioral data, including gaze interaction, voice activity, and environmental context. This representation enables practical implementation and outperforms baseline models for tasks like next gaze prediction and next speaker prediction, which are crucial for effective human-robot collaboration.
Abstract
The article presents a method for modeling social interaction dynamics using Temporal Graph Networks (TGN). The key highlights and insights are:
The authors formulate two problems: (A) scene perception and representation of social interaction dynamics using multi-modal inputs, and (B) the application of this representation for downstream tasks like next speaker prediction.
They represent social interaction as a graph model, where nodes represent subjects, and directed edges represent gaze interactions between subjects. The model is trained as a link prediction problem to estimate the probability of future gaze interactions.
The authors use the FUMI-MPF dataset, which contains 24 group discussion sessions with 110 participants and various facilitator types. They extract features like speaking status, relative seating position, and role of subjects to generate messages for the TGN model.
The authors compare two message encoding approaches, BERT and one-hot encoding, and find that one-hot encoding performs slightly better, especially in sessions without facilitators, while BERT performs better in sessions with diverse facilitator types.
The TGN model significantly outperforms a history-based baseline, achieving a 37.0% improvement in F1-score and 24.2% improvement in accuracy for the next gaze prediction task, and a 29.0% improvement in F1-score and 3.0% improvement in accuracy for the next speaker prediction task.
The authors conduct an ablation study to compare different TGN variants and baseline temporal graph models, finding that the TGN-attn and TGN-mean variants offer the best balance between accuracy and speed for modeling group interaction dynamics.
Overall, the proposed approach demonstrates the effectiveness of using Temporal Graph Networks to comprehensively represent social interaction dynamics and its application for improving human-robot collaboration tasks.
Stats
The following sentences contain key metrics or important figures used to support the author's key logics:
The F1-score outperformed the baseline model by 37.0%.
This improvement is consistent for a secondary task of next speaker prediction which achieves an improvement of 29.0%.