详解Keras(tf)报错:"BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm"

最新推荐文章于 2023-09-19 22:55:25 发布

原创

最新推荐文章于 2023-09-19 22:55:25 发布 · 2.1k 阅读

标签

#python #深度学习 #大数据 #keras

收录于

本文详细介绍了在使用Keras（基于TensorFlow）运行VGG16模型时遇到的"BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm"错误。该错误通常是由于显存分配不当导致的。解决方案是在程序开始时限制Keras占用显存的比例，以防止一次性占用过多显存，确保后续操作有足够的资源。通过实验观察，作者发现设置显存限制后，即使模型在运行过程中短暂超过限制，也能顺利完成，避免了报错。

今天用keras内置的VGG16跑模型时遇到了这个报错，在确定不是CUDA等环境版本问题后，矛头指向了是因为显存分配没搞好造成的。（我的电脑只有一块菜卡4G显存）

2020-05-08 00:59:24.206906: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2020-05-08 00:59:24.207493: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2020-05-08 00:59:24.207802: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. 
This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

解决办法：

在程序开头加上这段代码

import tensorflow as tf
config = tf.compat.v1.ConfigProto(allow_soft_placement=True)
config.gpu_options.per_process_gpu_memory_fraction = 0.3
tf.compat.v1.keras.backend.set_session(tf.compat.v1.Session(config=config)