Centrala begrepp
Machine learning models face challenges in domain generalization for failure detection through human reactions, impacting model performance across different datasets.
Sammanfattning
This study explores domain generalization in failure detection models trained on human facial expressions, highlighting the need for improved model robustness and real-life applicability. The research focuses on understanding how models perform when tested on datasets collected under different circumstances, emphasizing the importance of addressing issues related to data distribution and generalizability.
The study uses two distinct datasets of human reactions to videos of failures, one from a controlled lab setting and another collected online. Models trained on each dataset show significant performance drops when tested on the alternate dataset, indicating challenges in domain generalization. The findings underscore the complexities of human behavior and interaction across varied contexts, urging researchers to consider diverse and representative data during model training.
By analyzing model performance metrics like accuracy, precision, recall, and Cohen’s Kappa across different approaches and datasets, the study reveals insights into the limitations of current models for failure detection through human reactions. The discussion highlights the importance of robust feature engineering, regularization techniques, and clear use case definitions to enhance model generalizability in human-robot interaction applications.
Statistik
Performance almost always drops in out-of-distribution settings.
Two distinct datasets used: one from a controlled lab setting and another collected online.
Models show significant performance drop when tested on alternate dataset.
Best performing hyperparameters identified for mixed-participant and non-mixed participant approaches.
Evaluation metrics include accuracy, f1-score, precision, recall, and Cohen’s Kappa.
Citat
"We reflect on the causes for the observed model behavior and leave recommendations."
"This work emphasizes the need for HRI research focusing on improving model robustness."
"Evidence for this can be seen in how recall is the highest performance metric when testing the online-trained models on the lab dataset."