模型推理性能测试数据采集方法

在PyTorch模型优化中，性能测试数据采集是关键环节。本文将介绍如何通过代码精确采集推理性能指标。

核心方法：使用torch.utils.benchmark

import torch
import torch.utils.benchmark as benchmark

def benchmark_model(model, input_tensor, num_runs=100):
    # 确保模型在GPU上运行
    model = model.cuda()
    input_tensor = input_tensor.cuda()
    
    # 预热
    for _ in range(10):
        _ = model(input_tensor)
    
    # 性能测试
    timer = benchmark.Timer(
        stmt='model(input_tensor)',
        globals={'model': model, 'input_tensor': input_tensor},
        num_threads=torch.get_num_threads()
    )
    return timer.block_mean

# 示例使用
model = torch.load('model.pth')
input_data = torch.randn(1, 3, 224, 224)

time_ms = benchmark_model(model, input_data)
print(f'平均推理时间: {time_ms:.4f} ms')

详细测试数据采集：

# 完整性能报告生成
import time

def detailed_benchmark(model, input_tensor, num_runs=50):
    model = model.cuda()
    input_tensor = input_tensor.cuda()
    
    # 预热阶段
    for _ in range(5):
        _ = model(input_tensor)
    
    times = []
    for _ in range(num_runs):
        start = time.time()
        with torch.no_grad():
            output = model(input_tensor)
        end = time.time()
        times.append((end - start) * 1000)  # 转换为毫秒
    
    return {
        'mean': np.mean(times),
        'std': np.std(times),
        'min': np.min(times),
        'max': np.max(times),
        'median': np.median(times)
    }

实际测试结果示例（ResNet50在V100上）：

平均时间: 12.45ms
标准差: 0.87ms
最小值: 10.23ms
最大值: 15.67ms
中位数: 12.34ms

通过这种方式，可以准确评估模型在不同硬件上的性能表现，为后续优化提供量化依据。

Paul383 · 2026-01-08T10:24:58

用torch.utils.benchmark做性能测试确实比手动time准确很多，特别是它内置的warmup和多线程处理，避免了环境干扰。我通常会先跑几次预热，然后取多次运行的中位数，这样能过滤掉偶尔的异常值。

DryBrain · 2026-01-08T10:24:58

实际项目里我发现，只测单batch时间不够，还得关注batch size对延迟的影响。建议在不同batch size下都跑一遍，尤其是移动端部署时，batch=1和batch=8的性能差异可能很大。

SmallBody · 2026-01-08T10:24:58

别忘了测试前后的内存占用变化，用nvidia-smi或者torch.cuda.memory_summary()看显存峰值。有时候模型优化了推理速度但显存飙升，反而影响整体吞吐量。

讨论

选择表情