多模态大模型训练数据清洗流程
在多模态大模型训练中,数据质量直接影响模型性能。以下是可复现的数据清洗流程:
1. 数据预处理阶段
import pandas as pd
import numpy as np
from PIL import Image
import os
# 加载原始数据集
df = pd.read_csv('multimodal_dataset.csv')
# 图像质量检查
def check_image_quality(image_path):
try:
img = Image.open(image_path)
return img.size[0] >= 224 and img.size[1] >= 224 # 至少224x224
except:
return False
# 文本质量检查
def check_text_quality(text):
if not text or len(text.strip()) < 5:
return False
return True
2. 数据去重处理
# 基于图像哈希去重
import hashlib
def calculate_image_hash(image_path):
with open(image_path, 'rb') as f:
return hashlib.md5(f.read()).hexdigest()
# 计算所有图像哈希值
image_hashes = df['image_path'].apply(calculate_image_hash)
df['hash'] = image_hashes
# 去除重复项
df_cleaned = df.drop_duplicates(subset=['hash'], keep='first')
3. 多模态一致性检查
# 使用预训练模型提取文本特征,与图像特征对齐
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')
def get_text_embedding(text):
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
return outputs.last_hidden_state.mean(dim=1) # 平均池化
# 验证文本-图像匹配度
4. 最终数据导出
# 过滤无效样本
final_df = df_cleaned[
(df_cleaned['image_quality'] == True) &
(df_cleaned['text_quality'] == True)
]
final_df.to_csv('cleaned_multimodal_data.csv', index=False)
该流程确保了训练数据的高质量和一致性,为后续联合训练奠定基础。

讨论