核心概念
ITERALIGN proposes a data-driven constitution discovery and self-alignment framework for Large Language Models (LLMs) to improve alignment with human values.
統計資料
Reinforcement learning with human feedback (RLHF) addresses alignment by integrating human feedback directly into the training process.
Constitutional AI (CAI) uses pre-defined guidelines called "constitutions" to ensure LLM outputs adhere to ethical standards.
ITERALIGN improves LLM alignment by up to 13.5% in harmlessness.
引述
"ITERALIGN leverages red teaming to unveil weaknesses of LLMs and automatically discovers new constitutions."
"Empirical results show that ITERALIGN successfully enhances truthfulness, helpfulness, harmlessness, and honesty in LLM alignment."