大语言模型推理中的批处理大小选择

在大语言模型推理中，批处理大小（batch size）的选择直接影响系统性能和资源利用率。本文基于实际部署经验，分享一个可复现的调优方法。

核心原则 批处理大小需要在吞吐量和延迟之间找到平衡点。过小的batch会浪费计算资源，过大的batch可能导致内存溢出或增加等待时间。

调优步骤

基准测试：使用torch.cuda.memory_reserved()监控显存占用
性能测试：记录不同batch size下的平均推理时间
吞吐量计算：吞吐量 = batch_size / 平均推理时间

实际代码示例：

import torch
from time import time

model.eval()
results = []
for bs in [1, 4, 8, 16, 32]:
    # 预热
    for _ in range(3):
        model(torch.randn(bs, 512).cuda())
    
    # 测试
    times = []
    for _ in range(10):
        start = time()
        with torch.no_grad():
            output = model(torch.randn(bs, 512).cuda())
        times.append(time() - start)
    
    avg_time = sum(times) / len(times)
    throughput = bs / avg_time
    results.append((bs, avg_time, throughput))
    print(f"BS={bs}: Avg Time={avg_time:.4f}s, Throughput={throughput:.2f} samples/s")

建议：在8GB显存的GPU上，通常16-32的batch size表现最佳。实际部署中应结合具体模型和硬件配置进行测试。

社区提示：避免盲目追求最大批处理大小，应基于实际业务场景和资源约束综合考虑。