Major AI developers should provide legal and technical safe harbors to protect public interest safety research from account suspensions or legal reprisal.


coremsg

ai-evaluation-and-red-teaming-safe-harbor-proposal


AI Evaluation and Red Teaming: Safe Harbor Proposal


title_rewrite


LLMs struggle with lateral thinking in LatEval, highlighting the need for improved AI capabilities.


lateval-evaluating-llms-with-lateral-thinking-puzzles


LatEval: Evaluating LLMs with Lateral Thinking Puzzles



AutoDE provides a dynamic evaluation framework that closely mirrors human assessments, revealing deficiencies overlooked by static evaluations.


automated-dynamic-evaluation-of-ai-assistants-api-invocation-capabilities


Automated Dynamic Evaluation of AI Assistants' API Invocation Capabilities