多模态训练数据清洗与噪声处理流程
在多模态大模型训练中,数据质量直接影响模型性能。本文将详细介绍图像-文本对的清洗与噪声处理流程。
数据预处理阶段
首先进行基础数据清洗:
import pandas as pd
import cv2
from PIL import Image
import numpy as np
def clean_image_text_pairs(df):
# 去除空值
df = df.dropna(subset=['image_path', 'text'])
# 过滤图像尺寸异常
def check_image_size(path):
try:
img = Image.open(path)
return img.size[0] > 100 and img.size[1] > 100
except:
return False
df['valid_image'] = df['image_path'].apply(check_image_size)
df = df[df['valid_image']]
# 文本长度过滤
df['text_length'] = df['text'].str.len()
df = df[(df['text_length'] > 10) & (df['text_length'] < 500)]
return df
噪声检测与处理
使用CLIP模型进行语义一致性检查:
from transformers import CLIPProcessor, CLIPModel
import torch
class NoiseDetector:
def __init__(self):
self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def detect_inconsistency(self, image_paths, texts):
# 计算图像-文本相似度
similarities = []
for img_path, text in zip(image_paths, texts):
inputs = self.processor(text=[text], images=Image.open(img_path), return_tensors="pt")
with torch.no_grad():
outputs = self.model(**inputs)
similarity = outputs.logits_per_image[0][0].item()
similarities.append(similarity)
# 阈值过滤(相似度低于阈值的为噪声)
threshold = 0.1
return [i for i, sim in enumerate(similarities) if sim < threshold]
可复现步骤:
- 数据加载与基础清洗
- 图像质量检测(尺寸、格式)
- 文本长度过滤
- 语义一致性检测
- 噪声样本剔除
最终输出清洗后的数据集,为后续联合训练做好准备。

讨论