数据清洗工具使用经验分享

在大模型训练过程中，数据质量直接影响模型性能。今天分享几个实用的数据清洗工具和方法。

1. pandas 数据清洗基础操作

import pandas as pd
import numpy as np

# 处理缺失值
 df = pd.read_csv('dataset.csv')
# 查看缺失值情况
print(df.isnull().sum())
# 删除含有缺失值的行
 df.dropna(inplace=True)
# 填充缺失值
 df.fillna(df.mean(), inplace=True)  # 数值型用均值填充

2. 使用regex进行文本清洗

import re

def clean_text(text):
    # 去除特殊字符和多余空格
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df['cleaned_text'] = df['raw_text'].apply(clean_text)

3. 特征工程中的数据标准化

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# 对数值特征进行标准化处理
numeric_features = ['age', 'income', 'score']
df[numeric_features] = scaler.fit_transform(df[numeric_features])

这些方法在实际项目中能有效提升数据质量，建议根据具体场景选择合适的清洗策略。