This study focuses on developing a multi-modal system to identify subjects with strong positive symptoms of schizophrenia. The system utilizes audio, video, and text data as input modalities.
For the audio modality, vocal tract variables and voicing information were extracted as low-level features, which were then used to compute high-level coordination features. For the video modality, facial action units were extracted and used to compute coordination features. The text modality used context-independent word embeddings extracted from speech transcripts.
The multi-modal system was developed by fusing a segment-to-session-level classifier for the audio and video modalities with a Hierarchical Attention Network (HAN) for the text modality. Self-attention and cross-modal attention mechanisms were incorporated to leverage the relationships between the different modalities.
The proposed multi-modal system outperformed the previous state-of-the-art multi-modal system by 8.53% in the weighted average F1 score. An ablation study was also conducted to validate the contribution of the attention mechanisms used in the multi-modal architecture.
The key findings of this work are:
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Gowtham Prem... في arxiv.org 04-22-2024
https://arxiv.org/pdf/2309.15136.pdfاستفسارات أعمق