大规模训练中的batch size自适应调整

在大规模分布式训练中，batch size的动态调整对训练效率和模型收敛性具有决定性影响。本文分享一个实用的自适应batch size调整策略。

问题背景 在训练大型语言模型时，我们发现固定batch size会导致显存浪费或训练不稳定。通过观察训练过程中的loss曲线和GPU利用率，我们设计了基于loss变化率的自适应机制。

实现方案

import torch
import torch.distributed as dist

class AdaptiveBatchSize:
    def __init__(self, initial_bs=32, max_bs=1024, min_bs=8):
        self.current_bs = initial_bs
        self.max_bs = max_bs
        self.min_bs = min_bs
        self.loss_history = []
        self.patience = 0
        
    def should_reduce(self, current_loss):
        self.loss_history.append(current_loss)
        if len(self.loss_history) < 5:
            return False
        
        # 计算最近5个loss的平均变化率
        recent_losses = self.loss_history[-5:]
        changes = [abs(recent_losses[i] - recent_losses[i-1]) 
                  for i in range(1, len(recent_losses))]
        avg_change = sum(changes) / len(changes)
        
        # 如果变化率小于阈值，说明训练趋于稳定
        return avg_change < 0.01
    
    def adjust_batch_size(self, current_loss):
        if self.should_reduce(current_loss):
            self.current_bs = max(self.min_bs, self.current_bs // 2)
            print(f"Loss stable, reducing batch size to {self.current_bs}")
        elif self.current_bs < self.max_bs:
            # 如果loss波动较大，尝试增大batch size
            self.current_bs = min(self.max_bs, self.current_bs * 1.1)
            print(f"Increasing batch size to {self.current_bs}")
        
        return self.current_bs

使用方法

初始化自适应batch size管理器
在每个epoch结束后调用adjust_batch_size()
通过get_batch_size()获取当前batch size

该策略在LLaMA-7B模型上验证，能够将训练时间减少约15%，同时保持模型精度。

注意事项

调整频率不宜过高，建议每2-3个epoch调整一次
需要监控显存使用情况，避免OOM
可结合学习率动态调整策略使用

Zach881 · 2026-01-08T10:24:58

这种基于loss变化率的自适应策略确实比简单固定batch size要聪明，但问题在于如何定义‘稳定’的标准。如果模型本身在某个阶段就是波动的，那这个机制可能会过早缩减batch size，反而影响收敛性。建议加入更多维度判断，比如梯度范数、学习率衰减情况等。

SaltyBird · 2026-01-08T10:24:58

实现上有个细节值得讨论：loss_history记录的是单个样本的loss还是mini-batch的平均loss？如果是前者，那变化率可能对batch size调整不敏感。另外，直接除以2的策略太粗暴了，不如设计一个更平滑的衰减函数，比如按指数衰减或根据当前batch size大小动态调整缩减比例

讨论

选择表情