模型压缩技术：剪枝、量化和蒸馏方法综合对比

在大模型时代，模型压缩技术成为提升推理效率、降低计算成本的关键手段。本文将从剪枝、量化和蒸馏三个维度进行综合对比，并提供可复现的实践方案。

剪枝（Pruning） 剪枝通过移除神经网络中不重要的权重来压缩模型。以PyTorch为例，可以使用torch.nn.utils.prune模块实现结构化剪枝：

import torch.nn.utils.prune as prune
prune.l1_unstructured(module, name='weight', amount=0.3)

该方法通常可压缩20-50%的参数，但需注意保持模型精度。

量化（Quantization） 量化将浮点数权重转换为低精度整数表示。PyTorch提供量化API：

import torch.quantization
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
quantized_model = torch.quantization.prepare(model)
quantized_model = torch.quantization.convert(quantized_model)

可实现8位量化，使模型大小减小4倍，推理速度显著提升。

知识蒸馏（Knowledge Distillation） 蒸馏通过训练一个小模型来模仿大模型的输出。使用交叉熵损失函数：

soft_logits = model_student(input)
hard_logits = model_teacher(input)
loss = alpha * F.kl_div(soft_logits, hard_logits) + (1-alpha) * F.cross_entropy(soft_logits, target)

可实现模型压缩，同时保持较高准确率。

综合来看，剪枝适合参数冗余明显的场景，量化适用于部署环境，蒸馏则在保持精度方面表现更优。