Core Concepts
AI systems' social intelligence evaluated through SOTOPIA, revealing disparities and challenges.
Abstract
The content introduces SOTOPIA, an interactive environment to evaluate social intelligence in AI systems. It discusses the importance of social goals in human interactions and the limitations of existing benchmarks. SOTOPIA simulates diverse social scenarios for agents to role-play and interact, evaluating their performance with a multi-dimensional framework. The study compares models like GPT-4 with humans in achieving social goals, believability, knowledge acquisition, secret-keeping, relationship maintenance, adherence to social rules, and financial benefits. Results show differences in performance between models and humans on various dimensions.
The study also explores GPT-4's ability to evaluate social interactions and analyzes the performance of different language models like GPT-3.5, Llama-2-70b-chat, and MPT-30b-chat in simulated interactions. It highlights creative strategies used by models and challenges faced in maintaining persona during conversations.
Furthermore, the research investigates how humans interact differently from models like GPT-4 in challenging scenarios within SOTOPIA-hard settings. Humans outperform GPT-4 in goal achievement but exhibit strategic negotiation skills compared to model responses.
Stats
Humans perform significantly better than GPT-4 on the GOAL dimension.
GPT-4 scores concentrate around human scores within a standard deviation.
All models have negative scores in SOC and SEC dimensions.
GPT-4 often rates higher than humans on SOC and SEC dimensions.
Models struggle with maintaining persona and moving conversations forward.
Quotes
"Despite larger LLMs typically achieving higher social intelligence than smaller ones..."
"Our findings indicate that SOTOPIA has potential as a platform for assessing and enhancing the social skills of language-based agents."