在LLM微调工程化实践中,模型评估指标的选择直接影响微调效果和业务价值。本文分享在LoRA和Adapter微调场景下的实用评估策略。
核心评估维度
1. 任务相关指标
对于对话系统,我们采用:
from sklearn.metrics import f1_score, precision_score, recall_score
def conversation_metrics(y_true, y_pred):
# 计算F1、精确率、召回率
return {
'f1': f1_score(y_true, y_pred, average='weighted'),
'precision': precision_score(y_true, y_pred, average='weighted'),
'recall': recall_score(y_true, y_pred, average='weighted')
}
2. 微调稳定性监控
通过LoRA权重变化幅度检测过拟合:
import torch
def lora_weight_change(model, prev_state_dict):
current_state = model.state_dict()
change_norm = 0
for key in current_state:
if 'lora' in key:
diff = torch.norm(current_state[key] - prev_state_dict[key])
change_norm += diff.item()
return change_norm
3. Adapter微调指标
Adapter层性能评估:
# 计算Adapter层激活值分布
def adapter_activation_stats(model):
stats = {}
for name, module in model.named_modules():
if hasattr(module, 'adapter') and module.adapter is not None:
activation = module.adapter.weight.data
stats[name] = {
'mean': activation.mean().item(),
'std': activation.std().item(),
'sparsity': (activation == 0).sum().item() / activation.numel()
}
return stats
推荐评估流程
- 基准测试:使用验证集计算基础指标
- 过程监控:每epoch记录关键变化指标
- 风险预警:设定阈值触发模型重置或学习率调整
建议在LoRA微调中重点关注权重变化量和任务准确率的平衡点,避免过度微调导致原始模型性能下降。

讨论