AI alignment should move beyond a purely preference-based approach, instead focusing on aligning AI systems with normative standards appropriate to their social roles, determined through stakeholder agreement, to promote mutual benefit and limit harm.
Current Reinforcement Learning from Human Feedback (RLHF) methods, despite their popularity, fail to meet basic axioms of social choice theory, raising concerns about their fairness and reliability in aligning AI with human values.
Current AI alignment methods, primarily focused on extrinsic behavior modification, are insufficient for ensuring AI safety; instead, fostering intrinsic motivations for empathetic kindness and developing AI's Theory of Mind are crucial for aligning AI with human values.
Integrating intrinsic motivations, particularly kindness defined as maximizing others' rewards, is crucial for aligning AI systems with human values, especially as AI becomes more autonomous and potentially misaligned through extrinsic rewards alone.
為減輕人工智慧系統潛在的價值鎖定風險,本文提出「進展對齊」(progress alignment)方法,透過模擬人類道德進展機制,讓 AI 系統能隨著時間推移,動態調整自身價值觀,以更安全、有益的方式與人類互動。
法律理論,特別是規則與案例的相互作用,可以為解決AI協作中的多元化和具體化問題提供有價值的見解,促進更具包容性和社會協調性的人工智慧發展。
Legal theory, specifically the interplay of rules and cases, offers a framework for achieving more pluralistic and specified AI alignment, moving beyond simple majority-based approaches.
인간의 피드백을 통한 AI 에이전트 정렬 문제에서 정보 이론적 접근 방식인 정보 기반 샘플링(IDS) 알고리즘이 기존의 '탐색 후 활용' 방식이나 Thompson 샘플링보다 우수한 성능을 보인다.
Traditional "explore then exploit" AI alignment methods, even those like Thompson sampling, prove insufficient in complex environments with unknown human preferences and environmental dynamics. Information-Directed Sampling (IDS), however, offers a more effective approach by balancing reward maximization with continuous, reward-sensitive exploration of both the environment and human preferences.
본 연구는 인간과 LLM 에이전트가 불공정한 상황에 어떻게 반응하는지, 특히 사회적 가치와 감정, 신념 측면에서 어떤 차이를 보이는지 실험을 통해 분석하고, 이를 바탕으로 AI 정렬 문제에 대한 시사점을 제시합니다.