TensorRT量化实战
环境准备:使用TensorRT 8.5+,PyTorch 1.12+
INT8量化步骤:
import torch
import tensorrt as trt
class QuantizationBuilder:
def __init__(self):
self.builder = trt.Builder(trt.Logger(trt.Logger.WARNING))
def build_engine(self, model_path, calib_data):
network = self.builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, self.builder.logger)
# 解析ONNX模型
with open(model_path, 'rb') as f:
parser.parse(f.read())
# 配置FP16和INT8模式
config = self.builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)
config.set_flag(trt.BuilderFlag.INT8)
# 设置校准器
calibrator = TensorCalibrator(calib_data, "calibration.cache")
config.set_quantization_algorithm(trt.QuantizationAlgorithm.MINMAX)
config.int8_calibrator = calibrator
# 构建引擎
engine = self.builder.build_engine(network, config)
return engine
效果评估:模型大小从256MB降至64MB,推理速度提升35%,精度损失<0.5%
ONNX Runtime量化
PyTorch到ONNX转换:
# 导出ONNX模型并启用量化
with torch.no_grad():
traced_model = torch.jit.trace(model, example_input)
torch.onnx.export(
traced_model,
example_input,
"model.onnx",
export_params=True,
opset_version=13,
do_constant_folding=True
)
量化配置:使用onnxruntime.quantization模块
from onnxruntime.quantization import quantize_dynamic
quantize_dynamic(
"model.onnx",
"model_quant.onnx",
weight_type=QuantType.QInt8,
per_channel=True
)
实际效果:模型从45MB压缩至12MB,性能提升约30%
PyTorch量化工具链
使用torch.quantization模块进行静态量化:
# 准备模型
model.eval()
model.qconfig = torch.quantization.get_default_qat_qconfig('qnnpack')
model_prepared = torch.quantization.prepare(model)
# 校准
with torch.no_grad():
for data, target in calib_loader:
model_prepared(data)
# 转换为量化模型
model_quantized = torch.quantization.convert(model_prepared)
部署建议:生产环境推荐TensorRT,开发阶段使用ONNX Runtime,性能要求高时选择PyTorch。

讨论