Efficient Alignment of Large Language Models with On-Policy Self-Judgment
The author presents SELF-JUDGE, a novel alignment framework that combines on-policy learning and parameter efficiency by training a single model to act as both a policy and a judge.