The paper presents a systematization of existing approaches to embedding morality in artificial agents, ranging from fully top-down rule-based methods to fully bottom-up learning-based methods. It argues that a hybrid approach, which blends learning algorithms with the interpretability of explicit top-down moral principles, represents a pragmatic solution for creating safe yet flexible moral AI agents.
The paper discusses three case studies that implement this hybrid approach in different ways:
Morally-constrained Reinforcement Learning: Using logic to define constraints for training RL agents, such as a normative supervisor that applies a set of norms to reduce the set of permissible actions.
Constitutional AI: Fine-tuning large language models using reinforcement learning with feedback from a 'constitution' of AI models, each of which is prompted to follow certain explicit principles.
Social Dilemmas: Encoding moral preferences as intrinsic rewards for RL agents playing iterated social dilemma games, based on frameworks from moral philosophy, psychology, and other fields.
The paper also discusses strategies for evaluating the effectiveness of moral learning agents, and outlines open research questions and implications for the future of AI safety and ethics.
Till ett annat språk
från källinnehåll
arxiv.org
Djupare frågor