Основные понятия
Language models need to be trained to identify and appropriately refuse a broad range of user requests, beyond just those deemed unsafe, to improve user experience and trust.
Статистика
GPT-4 incorrectly complies with as many as 30% of “incomplete” and “unsupported” requests.
COCONOT evaluation set contains 1000 noncompliance prompts.
COCONOT training set contains ~11K prompt-response pairs.
Цитаты
"Chat-based language models are designed to be helpful, yet they should not comply with every user request."
"We posit that the scope of noncompliance should be broadened [beyond safety]."
"By providing direct answers to such questions, these models not only compromise user experience but also risk perpetuating biases, contributing to AI hype, or propagating false information, eroding user trust."