แนวคิดหลัก
A novel off-policy reinforcement learning algorithm that integrates an efficient online exploration policy to significantly improve the performance of multi-task fusion in large-scale recommender systems.
บทคัดย่อ
The paper proposes a novel off-policy reinforcement learning (RL) algorithm customized for multi-task fusion (MTF) in large-scale recommender systems (RSs). The key highlights are:
Existing off-policy RL algorithms for MTF have severe problems - their constraints are overly strict to avoid out-of-distribution (OOD) problem, which damages their performance; they are unaware of the exploration policy used for training data and cannot interact with the real environment, leading to suboptimal policies; and traditional exploration policies are inefficient and hurt user experience.
The proposed RL-MTF algorithm integrates the off-policy RL model with an efficient online exploration policy. This relaxes the overly strict constraints, significantly improving the RL model's performance. The exploration policy focuses on exploring potential high-value state-action pairs, exhibiting extremely high efficiency compared to traditional methods.
The RL-MTF algorithm also adopts a progressive training mode, where the learned policy is iteratively refined through multiple rounds of online exploration and offline model training, further enhancing the performance.
Offline experiments show the proposed RL-MTF algorithm outperforms other methods on the weighted GAUC metric. Online A/B testing in the short video channel of Tencent News demonstrates the RL-MTF model improves user valid consumption by 4.64% and user duration time by 1.74% compared to the baseline.
The RL-MTF model has been fully deployed in the short video channel of Tencent News for about a year and the solution has been adopted in other large-scale RSs in Tencent.
สถิติ
The average of all users' total valid consumptions (watching a video more than 10 seconds) during a day.
The average of all users' total watching time within a day.