Advantage-Leftover Lunch RL (A-LOL) is a stable and efficient method for training language models using offline policy gradient algorithms.