Core Concepts
Deep learning-based recommender systems often lack comprehensive evaluation from a human-centric perspective beyond simple interest matching. This study develops a robust human-centric evaluation framework to assess the quality of recommendations generated by five recent open-sourced deep learning models.
Abstract
The researchers developed a comprehensive human-centric evaluation framework that incorporates seven diverse metrics (novelty, diversity, serendipity, perceived accuracy, transparency, trustworthiness, and satisfaction) to assess the quality of recommendations generated by five recent open-sourced deep learning-based recommender system models.
The evaluation datasets consisted of both offline benchmark data and personalized online recommendation feedback collected from 445 real users. The key findings include:
Different deep learning models have different pros and cons in the multi-dimensional metrics tested.
Users generally want a combination of accuracy with at least one other human value in the recommendations.
The degree of combination of different values needs to be carefully experimented to user preferred levels.
The researchers also quantified the causal relationships between each pair of human-centric metrics and ran impact factor analysis. They found that compared to objective metrics, subjective metrics like transparency and trustworthiness are more associated with final recommender system optimization goals including accuracy and satisfaction. User-perceived recommendation diversity and serendipity, along with some user interaction features, were identified as strong impact factors on model trustworthiness, transparency, accuracy, and satisfaction.
Based on the findings, the researchers proposed model-wise optimization strategies and ways of balancing accuracy with other important human values for future deep learning-based recommender system design and development.
Stats
"Deep learning-based (DL) models in recommender systems (RecSys) have gained significant recognition for their remarkable accuracy in predicting user preferences."
"We find that (1) different DL models have different pros and cons in the multi-dimensional metrics that we test with; (2) users generally want a combination of accuracy with at least one another human values in the recommendation; (3) the degree of combination of different values needs to be carefully experimented to user preferred level."
Quotes
"While DL-based models are often only evaluated under standard accuracy metrics in the literature, how well such standards transfer to end user-related values, such as recommendation interpretability, trustworthiness and user satisfaction is still an open question."
"We find that (1) different DL models have different pros and cons in the multi-dimensional metrics that we test with; (2) users generally want a combination of accuracy with at least one another human values in the recommendation; (3) the degree of combination of different values needs to be carefully experimented to user preferred level."