数据清洗流程的持续集成实践
在大模型训练过程中,数据质量直接影响模型性能。本文分享一个可复现的数据清洗流程持续集成实践方案。
清洗流程设计
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# 1. 数据加载与基础检查
raw_data = pd.read_csv('raw_dataset.csv')
print(f'原始数据形状: {raw_data.shape}')
print(raw_data.info())
# 2. 异常值检测
numeric_columns = raw_data.select_dtypes(include=[np.number]).columns
for col in numeric_columns:
Q1 = raw_data[col].quantile(0.25)
Q3 = raw_data[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = raw_data[(raw_data[col] < lower_bound) | (raw_data[col] > upper_bound)]
print(f'{col} 异常值数量: {len(outliers)}')
# 3. 缺失值处理
missing_data = raw_data.isnull().sum()
print(missing_data[missing_data > 0])
# 4. 数据标准化
scaler = StandardScaler()
raw_data[numeric_columns] = scaler.fit_transform(raw_data[numeric_columns])
持续集成实现
# .github/workflows/data_pipeline.yml
name: Data Cleaning Pipeline
on: [push, pull_request]
jobs:
clean-data:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Install dependencies
run: |
pip install pandas numpy scikit-learn
- name: Run cleaning script
run: python data_cleaning.py
- name: Upload cleaned data
uses: actions/upload-artifact@v3
with:
name: cleaned-dataset
path: cleaned_dataset.csv
通过这种方式,可确保每次代码提交后自动执行数据清洗,保障训练数据质量。

讨论