LLM 기반 Q-Shaping은 기존 보상 설계 방식의 한계를 극복하고 강화 학습의 샘플 효율성을 크게 향상시킬 수 있다.
Q-shaping is a novel framework that leverages domain knowledge from large language models (LLMs) to directly shape Q-values, enabling rapid exploration and improved sample efficiency in reinforcement learning, while preserving optimality.
A novel Reinforcement Learning-based solution for Neural Architecture Search that learns to efficiently search large spaces, outperforming strong baselines like local search and random search.
The authors propose an efficient estimation technique called Observation-Aware Spectral (OAS) to learn the transition model of a POMDP with known observation model, and develop the OAS-UCRL algorithm that achieves a regret bound of r Op ? Tq.
Diffusion-based sampling from energy-based policies, represented by the exponentiated Q-function, enables more expressive policy representations that capture multimodal behaviors and improve exploration-exploitation in continuous control tasks.
Large language models can learn to solve reinforcement learning problems and learn graph structures through temporal difference learning, even though they are only trained to predict the next token.
DAPO is a novel duality framework that incorporates general function approximation into policy mirror descent methods. It uses the dual mirror map to measure the function approximation error, in addition to the mirror map used for policy projection, forming a complete duality framework.
ASCPO는 모델 기반 가정 없이도 높은 확률로 상태 제약을 만족하는 정책을 학습할 수 있다.
Absolute State-wise Constrained Policy Optimization (ASCPO) is a novel policy optimization algorithm that guarantees high-probability satisfaction of state-wise safety constraints in reinforcement learning, without assuming knowledge of the underlying system dynamics.
이 논문은 시스템 동역학이 알려지지 않은 경우에도 강화학습 기술을 사용하여 연속시간 확률적 선형-이차 영-합 미분 게임 문제를 해결하는 새로운 정책 반복 알고리즘을 제안한다.