特征工程中数据平衡处理技巧

在大模型训练过程中，数据不平衡问题往往成为性能瓶颈。今天分享几个实用的特征工程技巧。

问题场景

假设我们有一个分类任务，标签分布为：[0: 80%, 1: 15%, 2: 5%]，这种不平衡会严重影响模型对少数类别的识别能力。

解决方案

1. 采样策略

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# 划分训练集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 使用SMOTE进行过采样
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

2. 损失函数权重调整

from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# 计算类别权重
weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
class_weights = dict(zip(np.unique(y), weights))  # {0: 1.0, 1: 4.0, 2: 16.0}

3. 特征重要性重新加权

from sklearn.feature_selection import SelectKBest, f_classif

# 基于统计检验选择特征
selector = SelectKBest(score_func=f_classif, k=50)
X_selected = selector.fit_transform(X_resampled, y_resampled)