模型训练数据集构建流程
在大模型训练中,数据集构建是决定模型性能的关键环节。本文将分享一个完整的数据集构建流程,包含可复现的工程实践。
1. 数据收集与初步评估
import pandas as pd
import numpy as np
df = pd.read_csv('raw_data.csv')
print(f'数据集形状: {df.shape}')
print(df.info())
print(df.describe())
2. 数据清洗与预处理
# 处理缺失值
missing_cols = df.columns[df.isnull().any()]
for col in missing_cols:
if df[col].dtype in ['int64', 'float64']:
df[col].fillna(df[col].median(), inplace=True)
else:
df[col].fillna('Unknown', inplace=True)
# 去除重复数据
df.drop_duplicates(inplace=True)
# 异常值处理
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)
3. 特征工程
# 文本特征提取
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')
text_features = vectorizer.fit_transform(df['text_column'])
# 数值特征标准化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numeric_features = scaler.fit_transform(df[numeric_cols])
# 特征组合
from scipy.sparse import hstack
final_features = hstack([text_features, numeric_features])
4. 数据集划分与保存
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
final_features, df['target'], test_size=0.2, random_state=42
)
# 保存处理后的数据集
import joblib
joblib.dump(X_train, 'train_features.pkl')
joblib.dump(y_train, 'train_labels.pkl')
该流程确保了数据质量,为后续模型训练奠定了坚实基础。

讨论