特征工程中异常值检测方法对比

在大模型训练数据处理中，异常值检测是特征工程的关键环节。本文将对比几种常用的异常值检测方法，并提供可复现的实现步骤。

1. 基于统计的方法

Z-Score方法：适用于正态分布数据

import numpy as np
from scipy import stats

# 生成示例数据
np.random.seed(42)
data = np.random.normal(0, 1, 1000)
# 添加异常值
outliers = np.array([5, -5])
data = np.append(data, outliers)

# Z-Score检测
z_scores = np.abs(stats.zscore(data))
threshold = 3
outlier_indices = np.where(z_scores > threshold)[0]
print(f"Z-Score异常值索引: {outlier_indices}")

2. 基于距离的方法

IQR（四分位距）方法：适用于非正态分布数据

# IQR检测
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outlier_indices = np.where((data < lower_bound) | (data > upper_bound))[0]
print(f"IQR异常值索引: {outlier_indices}")

3. 基于机器学习的方法

孤立森林（Isolation Forest）：适合高维数据

from sklearn.ensemble import IsolationForest

# 训练模型
iso_forest = IsolationForest(contamination=0.1, random_state=42)
outlier_labels = iso_forest.fit_predict(data.reshape(-1, 1))
outlier_indices = np.where(outlier_labels == -1)[0]
print(f"孤立森林异常值索引: {outlier_indices}")

在实际应用中，建议结合多种方法进行综合判断，特别是在处理大规模数据集时，需要考虑计算效率和准确性平衡。对于大模型训练，异常值处理直接影响模型泛化能力，因此需谨慎选择检测策略。