PyTorch分布式训练环境搭建与性能测试
最近在搭建PyTorch分布式训练环境时踩了不少坑,特此记录下完整流程和性能测试结果。
环境准备
使用PyTorch 2.0 + CUDA 11.8,4张RTX 3090显卡。首先安装依赖:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
分布式训练代码示例
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
def setup(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
def train(rank, world_size):
setup(rank, world_size)
model = torch.nn.Linear(1000, 10).to(rank)
ddp_model = DDP(model, device_ids=[rank])
# 模拟数据
data = torch.randn(1000, 1000).to(rank)
target = torch.randint(0, 10, (1000,)).to(rank)
optimizer = torch.optim.SGD(ddp_model.parameters(), lr=0.01)
criterion = torch.nn.CrossEntropyLoss()
# 训练循环
for epoch in range(5):
optimizer.zero_grad()
output = ddp_model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
print(f"Rank {rank}, Epoch {epoch}, Loss: {loss.item()}")
cleanup()
if __name__ == "__main__":
world_size = 4
mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)
性能测试结果
在相同配置下,单机单卡训练耗时约120s,使用4卡分布式训练耗时约35s,加速比约为3.4x。需要注意的是,网络通信开销会随着模型复杂度增加而变大。
坑点总结
- 一定要使用
torch.nn.parallel.DistributedDataParallel - 训练数据要根据rank进行分割
- 分布式初始化必须在每个进程内执行
- 确保所有GPU内存一致,避免OOM问题

讨论