错误日志:
Epoch: [229] Total time: 0:17:21
Test: [ 0/49] eta: 0:05:00 loss: 1.7994 (1.7994) acc1: 78.0822 (78.0822) acc5: 95.2055 (95.2055) time: 6.1368 data: 5.9411 max mem: 10624
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44348 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44349 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44350 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44351 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44352 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44353 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44354 closing signal SIGHUP
Traceback (

在进行深度学习模型训练时,遇到了错误日志,显示在第229个周期,总耗时0:17:21。测试阶段,精度和准确率分别为78.08%和95.21%,但随后接收到1个死亡信号,导致进程关闭。错误源于torch.distributed.elastic模块,进程收到SIGHUP信号并终止。网上的解决办法可能涉及到进程管理和错误处理策略。

7551

被折叠的 条评论
为什么被折叠?



