Torch.distributed.elastic 关于 pytorch 不稳定

在进行深度学习模型训练时,遇到了错误日志,显示在第229个周期,总耗时0:17:21。测试阶段,精度和准确率分别为78.08%和95.21%,但随后接收到1个死亡信号,导致进程关闭。错误源于torch.distributed.elastic模块,进程收到SIGHUP信号并终止。网上的解决办法可能涉及到进程管理和错误处理策略。

错误日志:

Epoch: [229] Total time: 0:17:21
Test:   [ 0/49]  eta: 0:05:00  loss: 1.7994 (1.7994)  acc1: 78.0822 (78.0822)  acc5: 95.2055 (95.2055)  time: 6.1368  data: 5.9411  max mem: 10624
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44348 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44349 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44350 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44351 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44352 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44353 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44354 closing signal SIGHUP
Traceback (
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值