特征工程中缺失值填充策略效果分析
最近在处理一个大模型训练数据集时,遇到了大量缺失值问题,特此记录几种常见填充策略的效果对比。
数据准备
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# 模拟数据集
np.random.seed(42)
data = pd.DataFrame({
'feature1': np.random.randn(1000),
'feature2': np.random.randn(1000),
'feature3': np.random.randn(1000),
'target': np.random.randint(0, 2, 1000)
})
# 随机插入缺失值
missing_indices = np.random.choice(data.index, size=200, replace=False)
for idx in missing_indices:
col = np.random.choice(['feature1', 'feature2', 'feature3'])
data.loc[idx, col] = np.nan
常见填充策略对比
1. 均值填充(Mean Imputation)
imputer_mean = SimpleImputer(strategy='mean')
data_mean = pd.DataFrame(imputer_mean.fit_transform(data[['feature1', 'feature2', 'feature3']]),
columns=['feature1', 'feature2', 'feature3'])
2. 中位数填充(Median Imputation)
imputer_median = SimpleImputer(strategy='median')
data_median = pd.DataFrame(imputer_median.fit_transform(data[['feature1', 'feature2', 'feature3']]),
columns=['feature1', 'feature2', 'feature3'])
3. 众数填充(Mode Imputation)
imputer_mode = SimpleImputer(strategy='most_frequent')
data_mode = pd.DataFrame(imputer_mode.fit_transform(data[['feature1', 'feature2', 'feature3']]),
columns=['feature1', 'feature2', 'feature3'])
4. KNN填充(KNN Imputation)
from sklearn.impute import KNNImputer
imputer_knn = KNNImputer(n_neighbors=5)
data_knn = pd.DataFrame(imputer_knn.fit_transform(data[['feature1', 'feature2', 'feature3']]),
columns=['feature1', 'feature2', 'feature3'])
效果评估
# 分割数据
X_train, X_test, y_train, y_test = train_test_split(
data_mean, data['target'], test_size=0.2, random_state=42)
# 训练模型并评估
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'准确率: {accuracy:.4f}')
实际踩坑记录
在实际使用中发现:均值填充容易造成数据分布偏移;KNN填充虽然效果好但计算开销大;对于大模型训练,建议结合领域知识选择策略。

讨论