深度学习模型训练中的模型压缩技术应用

最近在优化一个175B参数的大模型训练时，踩了一个关于模型压缩的坑，分享给大家避雷。

背景： 为了降低训练内存占用，我们尝试了量化压缩，从FP32降到INT4。理论上能节省75%显存，但实际操作中遇到了诡异的问题。

踩坑过程：

先用PyTorch的torch.quantize_dynamic()进行动态量化
然后用accelerate库的auto_find_batch_size()自动调参
结果发现loss曲线异常震荡，精度暴跌

关键问题定位： 原来是量化参数设置不当！我们默认使用了默认的quantization_config，但没有指定qconfig。通过查阅文档才发现，需要手动配置torch.quantization.QConfig()。

复现步骤：

# 错误示范
model = torch.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

# 正确做法
qconfig = torch.quantization.QConfig(
    activation=torch.quantization.default_observer,
    weight=torch.quantization.default_per_channel_weight_observer
)
model = torch.quantize_dynamic(model, {torch.nn.Linear}, qconfig=qconfig)

优化建议：

量化前先做模型结构分析
使用torch.quantization.prepare()预处理再convert()
配合torch.compile()可以进一步加速

这个坑踩得真够惨的，大家在做压缩时一定要小心配置参数！

讨论

选择表情