特征工程中缺失值填充策略效果分析

FalseShout +0/-0 0 0 正常 2025-12-24T07:01:19 特征工程

特征工程中缺失值填充策略效果分析

最近在处理一个大模型训练数据集时,遇到了大量缺失值问题,特此记录几种常见填充策略的效果对比。

数据准备

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 模拟数据集
np.random.seed(42)
data = pd.DataFrame({
    'feature1': np.random.randn(1000),
    'feature2': np.random.randn(1000),
    'feature3': np.random.randn(1000),
    'target': np.random.randint(0, 2, 1000)
})

# 随机插入缺失值
missing_indices = np.random.choice(data.index, size=200, replace=False)
for idx in missing_indices:
    col = np.random.choice(['feature1', 'feature2', 'feature3'])
    data.loc[idx, col] = np.nan

常见填充策略对比

1. 均值填充(Mean Imputation)

imputer_mean = SimpleImputer(strategy='mean')
data_mean = pd.DataFrame(imputer_mean.fit_transform(data[['feature1', 'feature2', 'feature3']]), 
                      columns=['feature1', 'feature2', 'feature3'])

2. 中位数填充(Median Imputation)

imputer_median = SimpleImputer(strategy='median')
data_median = pd.DataFrame(imputer_median.fit_transform(data[['feature1', 'feature2', 'feature3']]), 
                       columns=['feature1', 'feature2', 'feature3'])

3. 众数填充(Mode Imputation)

imputer_mode = SimpleImputer(strategy='most_frequent')
data_mode = pd.DataFrame(imputer_mode.fit_transform(data[['feature1', 'feature2', 'feature3']]), 
                     columns=['feature1', 'feature2', 'feature3'])

4. KNN填充(KNN Imputation)

from sklearn.impute import KNNImputer
imputer_knn = KNNImputer(n_neighbors=5)
data_knn = pd.DataFrame(imputer_knn.fit_transform(data[['feature1', 'feature2', 'feature3']]), 
                    columns=['feature1', 'feature2', 'feature3'])

效果评估

# 分割数据
X_train, X_test, y_train, y_test = train_test_split(
    data_mean, data['target'], test_size=0.2, random_state=42)

# 训练模型并评估
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'准确率: {accuracy:.4f}')

实际踩坑记录

在实际使用中发现:均值填充容易造成数据分布偏移;KNN填充虽然效果好但计算开销大;对于大模型训练,建议结合领域知识选择策略。

推广
广告位招租

讨论

0/2000
梦幻蝴蝶
梦幻蝴蝶 · 2026-01-08T10:24:58
均值填充容易压缩数据分布,尤其在偏态分布下会引入偏差。建议先看特征分布,对严重偏态用中位数或众数,或考虑KNN填充。
DeadLaugh
DeadLaugh · 2026-01-08T10:24:58
中位数填充对异常值更鲁棒,但可能破坏变量间相关性结构。我通常会结合模型评估,比如用交叉验证比较不同策略的预测性能。
ColdWind
ColdWind · 2026-01-08T10:24:58
别忘了分类型特征的处理!数值型用均值/中位数,类别型直接填'unknown'或众数,否则模型会把缺失当成一种新特征