问题描述:
python在windows环境下dist.init_process_group(backend, rank, world_size)处报错‘RuntimeError: Distributed package doesn’t have NCCL built in’,具体信息如下:
File "D:\Software\Anaconda\Anaconda3\envs\segmenter\lib\site-packages\torch\distributed\distributed_c10d.py", line 531, in init_process_group
timeout=timeout)
File "D:\Software\Anaconda\Anaconda3\envs\segmenter\lib\site-packages\torch\distributed\distributed_c10d.py", line 625, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL "
RuntimeError: Distributed package doesn't have NCCL built in
原因分析:
windows不支持NCCL backend
解决方案:
在dist.init_process_group语句之前添加backend=‘gloo’,也就是在windows中使用GLOO替代NCCL。
本文讲述了在Windows上遇到的PyTorch分布式训练中RuntimeError,由于不支持NCCL,提出通过将backend设置为'gloo'来切换到GLOO通信方式。解决方案包括在`dist.init_process_group`前添加backend参数。

1万+

被折叠的 条评论
为什么被折叠?



