核心概念
A multi-modal approach using audio, video, and text data with self and cross-modal attention mechanisms can effectively distinguish subjects with strong positive symptoms of schizophrenia from healthy controls.
摘要
This study focuses on developing a multi-modal system to identify subjects with strong positive symptoms of schizophrenia. The system utilizes audio, video, and text data as input modalities.
For the audio modality, vocal tract variables and voicing information were extracted as low-level features, which were then used to compute high-level coordination features. For the video modality, facial action units were extracted and used to compute coordination features. The text modality used context-independent word embeddings extracted from speech transcripts.
The multi-modal system was developed by fusing a segment-to-session-level classifier for the audio and video modalities with a Hierarchical Attention Network (HAN) for the text modality. Self-attention and cross-modal attention mechanisms were incorporated to leverage the relationships between the different modalities.
The proposed multi-modal system outperformed the previous state-of-the-art multi-modal system by 8.53% in the weighted average F1 score. An ablation study was also conducted to validate the contribution of the attention mechanisms used in the multi-modal architecture.
The key findings of this work are:
- Segment-to-session-level classifiers show better performance in video and audio modalities compared to previous approaches.
- The multi-modal system with self and cross-modal attention mechanisms outperforms the uni-modal and previous multi-modal systems.
- The text modality, despite having the lowest individual performance, contributes to the improved performance of the multi-modal system.
- The coherence of the text transcripts is a potential factor in the misclassifications made by the text-based model.
统计
"Schizophrenia affects around 24 million people worldwide."
"The database contains a total of 19.43 hours of 50 unique interview sessions belonging to 18 selected subjects (7 schizophrenia subjects, and 11 healthy controls)."
引用
"Attention mechanisms [8], [9] capture relationships between hidden states to characterize which aspects of a state's representation contribute more toward the final prediction."
"Previous studies have shown promising results in identifying the severity of mental health disorders like major depressive disorder and schizophrenia using the correlation structure of the movements of various articulators [6], [12]."