SortedRL: Accelerating Reinforcement Learning Training for Large Language Models through Online Length-Aware Scheduling
SortedRL:通过在线长度感知调度加速大语言模型的强化学习训练
SortedRL: Accelerating Reinforcement Learning Training for Large Language Models through Online Length-Aware Scheduling
扩大强化学习的规模在提升大语言模型推理能力方面展现出巨大潜力,尤其是在需要长思维链生成的任务中。然而,强化学习的训练效率通常受制于采样生成阶段。由于自回归生成的缓慢特性以及生成与策略更新之间的同步开销,在生成诸如一万六千个词元的长轨迹时,这一阶段可能占据总训练时间的百分之七十。由于不同样本的生成长度差异巨大,且广泛使用的同策略强化学习算法要求整个批次的响应生成完毕后才能进行模型更新,这导致了严重的硬件利用率低下现象,即计算气泡。
Scaling reinforcement learning has shown strong promise for enhancing the reasoning abilities of large language models, particularly in tasks requiring long chain-of-thought generation. However, reinforcement learning training efficiency is often bottlenecked by the rollout phase. Due to slow autoregressive generation and synchronization overhead between rollout and policy updates, this phase can account for up to seventy percent of total training time when generating long trajectories, such as sixteen thousand tokens. Because generation lengths vary widely across samples, and widely-used on-policy reinforcement learning algorithms require waiting for all responses in a batch to finish before updating, this leads to significant hardware underutilization, commonly referred to as computational bubbles.
为了解决上述挑战,研究人员提出了SortedRL,这是一种旨在平衡大语言模型强化学习微调中样本与计算效率的在线长度感知调度策略。该策略根据输出长度对采样生成的样本进行重新排序,优先处理形成分组的短样本以进行早期更新。这种方法同时实现了大采样批次、灵活的更新批次以及近似同策略的微课程构建。
To address these challenges, researchers proposed SortedRL, an online length-aware scheduling strategy designed to reconcile sample and computation efficiency in reinforcement learning-based large language model tuning. This strategy reorders rollout samples based on output lengths, prioritizing short samples that form groups for early updates. This approach simultaneously enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction.
该系统包含三个核心组件。首先是在线长度感知调度,它动态地将具有相似生成长度的样本组合在一个批次中,通过超额订阅和提前终止机制来最小化采样计算气泡。其次是可控的异策略采样机制,支持完全同策略和部分异策略模式,通过基于缓存的机制来控制异策略训练的程度,允许缓存未完成的样本并立即处理新完成的生成。第三是协同设计的强化学习基础设施,包含一个长度感知控制器和一个状态采样缓冲区,用于高效协调长度调度、缓冲区管理和模型交互。
The system comprises three core components. The first is online length-aware scheduling, which dynamically batches samples with similar generation lengths, minimizing rollout computational bubbles through oversubscription and early termination mechanisms. The second is a controllable off-policy sampling mechanism that supports both fully on-policy and partial off-policy modes, utilizing a cache-based mechanism to control the degree of off-policy training by caching unfinished samples and enabling immediate processing of newly completed generations. The third is a co-designed reinforcement learning infrastructure, featuring a length-aware controller and a stateful rollout buffer, designed to efficiently coordinate length scheduling, buffer management, and model interaction.
研究团队在使用LLaMA-3.1-8B和Qwen-2.5-32B模型的逻辑难题和数学挑战等多项任务上进行了实验评估。在逻辑推理任务中,使用SortedRL的LLaMA-3.1-8B-Instruct模型在使用少百分之四十点七四样本的情况下达到了同样的高分。在包含OlympiadBench、AIME 2024和AMC 2023等竞赛级数学问题上,在给定相同数据量的情况下,该方法比基线方法的性能高出百分之三点九至百分之十八点四。具体而言,在使用Qwen-2.5-32B并在AIME 2024任务中,相比基线方法实现了超过百分之十八的准确率提升。
The research team conducted experimental evaluations on diverse tasks, including logical puzzles and math challenges, using LLaMA-3.1-8B and Qwen-2.5-32B models. On logical reasoning tasks, LLaMA-3.1-8B-Instruct with SortedRL attained the same high score using forty point seven four percent fewer samples. On competition-level mathematical problems including OlympiadBench, AIME 2024, and AMC 2023, the approach exhibited three point nine percent to eighteen point four percent better performance over the baseline given the same amount of data. Specifically, with Qwen-2.5-32B on the AIME 2024 task, it achieved an over eighteen percent accuracy increment compared to the baseline.
在端到端性能测试中,SortedRL显著降低了强化学习训练中的气泡比例。测试显示,气泡比例从基线的百分之七十四大幅下降至不足百分之五点八一,部分模式下甚至降至百分之三点三七。同时,该方法有效提升了采样吞吐量,与基线相比,完全同策略模式和部分异策略模式下的吞吐量分别提高了百分之七点五七和百分之三十九点四八。
In end-to-end performance tests, SortedRL significantly reduced the bubble ratio in reinforcement learning training. Tests showed the bubble ratio experienced a sharp drop from the baseline of seventy-four percent to less than five point eight one percent, and even down to three point three seven percent in the partial mode. Meanwhile, the method effectively boosted the rollout throughput, bringing about speedups of seven point five seven percent and thirty-nine point four eight percent for the fully on-policy mode and partial mode respectively, compared to the baseline.
欢迎大家关注Learn By Doing With Steven 数能生智、steven data talk、steven数据漫谈等节目。我们在小红书、微信公众号、YouTube、Spotify等各大平台同步更新相关内容。播客和视频节目的所有内容均可在节目描述区域找到,您也可以直接访问下方提供的链接合集获取我们的所有社交媒体平台地址。
We invite the audience to check out the Learn By Doing With Steven channel and on other platforms, as well as the steven data talk podcast on YouTube Music, Spotify, and other various platforms. Our related content is regularly updated across major platforms including Xiaohongshu, WeChat Public Account, YouTube, and Spotify. All comprehensive contents for the podcasts and video shows can be found in the show notes area, or you can directly visit the provided linktree below to access all our social media platforms.
https://arxiv.org/pdf/2603.23414 https://linktr.ee/learnbydoingwithsteven
#大语言模型 #强化学习 #算法优化 #人工智能 #SortedRL

