Core Concepts
Multilingual AI-generated fake hotel reviews can be effectively detected using fine-tuned XLM-RoBERTa models, with performance varying across sentiment, language, and location.
Abstract
The paper presents a novel dataset called MAIDE-UP, which contains 10,000 real and 10,000 AI-generated fake hotel reviews balanced across 10 languages (Chinese, English, French, German, Italian, Korean, Romanian, Russian, Spanish, Turkish) and 10 locations (capital cities).
The authors conduct extensive linguistic analyses to compare the AI-generated fake hotel reviews with the real human-written hotel reviews. They find that AI-generated reviews tend to be more complex, descriptive, and less readable compared to real reviews. Topic modeling also reveals differences in the language used, with AI-generated reviews containing more words about "service", "comfort", and "room", while real reviews mention more words related to "reception", "checking", and "bathroom".
The authors then explore the effectiveness of different models for multilingual deception detection in hotel reviews. They test a random classifier, a Naive Bayes classifier, and a fine-tuned XLM-RoBERTa model. The XLM-RoBERTa model achieves the best performance, with an accuracy of 94.8% on the default 80-20% train-test split and 76.6% on a few-shot 1-99% train-test split.
Further analysis shows that deception detection performance varies across different dimensions. It is lowest for Korean and English reviews, indicating that GPT-4 is better at generating deceptive, "human-like" reviews in these languages. Performance is also lower for reviews of hotels in Seoul, Rome, and Beijing, suggesting that GPT-4 is better at generating deceptive reviews for these locations. Additionally, the model performs better on detecting deceptive negative reviews compared to positive reviews.
Stats
The average analytic writing index is higher for AI-generated reviews compared to real reviews in English, but not statistically significant in Chinese, French, and Spanish.
The average descriptiveness (ratio of adjectives) is higher for AI-generated reviews compared to real reviews, except for German reviews where real reviews are more descriptive, and Korean reviews where the difference is not significant.
The average readability (Flesch Reading Ease) is lower for AI-generated reviews compared to real reviews, except for Russian reviews where the difference is not significant.
The average word count is higher for AI-generated reviews compared to real reviews, except for German and Russian reviews where the difference is not significant.
Quotes
"Deceptive reviews are becoming increasingly common, especially given the increase in performance and the prevalence of LLMs."
"While work to date has addressed the development of models to differentiate between truthful and deceptive human reviews, much less is known about the distinction between real reviews and AI-authored fake reviews."
"Most of the research so far has focused primarily on English, with very little work dedicated to other languages."