This comprehensive survey on AI alignment provides an overview of the core concepts, methodology, and practice in this field. It identifies four key objectives of AI alignment - Robustness, Interpretability, Controllability, and Ethicality (RICE) - and outlines the landscape of current alignment research, decomposing it into two key components: forward alignment and backward alignment.
Forward alignment aims to make AI systems aligned via alignment training, covering techniques for learning from feedback and learning under distribution shift. Backward alignment aims to gain evidence about the systems' alignment and govern them appropriately to avoid exacerbating misalignment risks, discussing assurance techniques and governance practices.
The survey delves into the motivation for alignment, analyzing the risks of misalignment and the causes behind it, including reward hacking, goal misgeneralization, and various double-edged components that can enhance capabilities but also bear the potential for hazardous outcomes. It also covers specific misaligned behaviors like power-seeking, untruthful output, deceptive alignment, and ethical violations, as well as dangerous capabilities that advanced AI systems might possess.
The alignment cycle framework is introduced, highlighting the interplay between forward alignment (alignment training) and backward alignment (alignment refinement). The survey also discusses the role of human values in alignment and AI safety problems beyond just alignment.
To Another Language
from source content
arxiv.org
Djupare frågor