PyTorch训练过程可视化工具
在分布式训练中,实时监控训练过程对于性能调优至关重要。本文将介绍如何使用TensorBoard和torch.utils.tensorboard来可视化PyTorch分布式训练过程。
环境准备
pip install torch torchvision tensorboard
核心配置代码
import torch
import torch.distributed as dist
from torch.utils.tensorboard import SummaryWriter
import os
class DistributedTrainer:
def __init__(self, rank, world_size):
self.rank = rank
self.world_size = world_size
# 只在主进程中创建tensorboard writer
if rank == 0:
self.writer = SummaryWriter('./logs/run_{}'.format(os.getpid()))
else:
self.writer = None
def log_metrics(self, metrics_dict, step):
if self.writer is not None:
for key, value in metrics_dict.items():
self.writer.add_scalar(key, value, step)
def close(self):
if self.writer is not None:
self.writer.close()
# 在训练循环中使用
trainer = DistributedTrainer(rank=0, world_size=4)
for epoch in range(100):
# 训练代码...
metrics = {
'loss': loss.item(),
'accuracy': accuracy,
'lr': optimizer.param_groups[0]['lr']
}
trainer.log_metrics(metrics, epoch)
Horovod集成示例
import horovod.torch as hvd
from torch.utils.tensorboard import SummaryWriter
hvd.init()
rank = hvd.rank()
if rank == 0:
writer = SummaryWriter('./logs/horovod_run')
# 在训练循环中
if hvd.rank() == 0:
writer.add_scalar('loss', loss.item(), global_step)
可视化启动
# 同时在多个节点上运行
tensorboard --logdir=./logs --port=6006
性能优化建议
- 避免频繁写入:每100个batch写入一次
- 按rank区分:不同进程使用不同日志目录
- 资源监控:结合nvidia-smi和torch.cuda.memory_stats()进行GPU监控

讨论