openvino:https://github.com/openvinotoolkit/openvino
openvino version:[2023.3.0]
问题背景
我们从ubuntu18升级到ubuntu22,发现有一个模型会卡住,在gtest、onenode planner、orion 等不同的环境都会复现,这是一个cpu模型,在做模型推理的时候卡住,用gdb attach到进程中可以看到是这个堆栈,但是这个是没有符号表的。所以看不到有啥问题。

问题无法复现
首先我们编译openvino是在voy-sdk中的,在voy-sdk中会处理tbb的依赖关系,会先编译tbb再编译openvino。
在voy-sdk中的编译参数是:
cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON -DDNNL_GPU_RUNTIME="NONE" -DENABLE_CLDNN=OFF -DENABLE_CPPLINT=OFF -DENABLE_DATA=OFF -DENABLE_INTEL_GNA=OFF -DENABLE_INTEL_GPU=OFF -DENABLE_INTEL_MYRIAD_COMMON=OFF -DENABLE_INTEL_MYRIAD=OFF -DENABLE_NCC_STYLE=OFF -DENABLE_ONEDNN_FOR_GPU=OFF -DENABLE_OPENCV=OFF -DENABLE_PYTHON=OFF -DENABLE_SAMPLES=OFF -DENABLE_TEMPLATE=OFF -DENABLE_TESTS=OFF -DENABLE_OV_PYTORCH_FRONTEND=OFF -DENABLE_OV_TF_FRONTEND=OFF -DENABLE_OV_TF_LITE_FRONTEND=OFF -DENABLE_OV_PADDLE_FRONTEND=OFF -DENABLE_OV_ONNX_FRONTEND=OFF -DTREAT_WARNING_AS_ERROR=OFF -DENABLE_SYSTEM_TBB=ON -DENABLE_INTEL_CPU=ON -DENABLE_JS=OFF ..
可以看到这里是release的编译type,release版本的编译类型会剥离符号信息。
把type改成debug进行编译,发现问题居然神奇的消失了。
同时我们在编译环境中安装完voy-sdk再编译openvino,发现问题也不复现了。
问题进展到这里有点崩溃了。
debug版本可以看到符号表,但是问题复现不了,release可以复现问题,但是看不到具体的堆栈。
hack编译器
在openvino的编译流程中,找了很久没找到在哪里strip的。
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON 编译选项,查看获得的compile_command.json 可以看到每个文件的编译命令,发现其中有个 "-s",这个 -s编译选项会传递给链接器,在链接过程就干脆不生成符号表,这样也不会调用到strip,所以hack strip命令的方式没啥用。
最终deepseek给我的解决方案是:
hack /usr/bin/c++ 命令,把这个命令改成自己的bash脚本,在其中过滤-s参数arg:

通过hack编译器的方式,成功在release编译模式下获得了符号表,后面又加上了-g选项(debug_info)
问题根源
也能够成功复现问题:

找到这个堆栈对于的代码:

这里divisor_0 是一个很大的数,同时divisor_1 == 0,无法break,在for循环中一直递减,所以是在这里死循环了。

进一步分析,这个变量为什么会这样,有点像内存没初始化。
调用栈
顺着函数调用栈往前找:
Transformations::MainSnippets
snippetsManager.run_passes(model);
snippets::pass::TokenizeMHASnippets lambda
is_unsupported_parallel_work_amount(lambda)
ov::snippets::pass::SplitDimensionM::can_be_optimized
ov::snippets::pass::SplitDimensionM::split
ov::snippets::pass::SplitDimensionM::get_splited_dimensions
涉及的变量 tokenization_config.concurrency 沿着这个路径一层一层往下传递,现在需要排查传递的过程为什么出错了。

在 MainSnippets 最开始的时候,这个变量没错,是机器的cpu数,到 can_be_optimized 函数中就出问题了。
最终定位到这个 is_unsupported_parallel_work_amount 这个函数指针出问题了,前面的地址是 0x7fff1b2e44a0 这个地址前面是64,后面就变了。


再查看一遍代码:
void Transformations::MainSnippets(void) {
if (snippetsMode == Config::SnippetsMode::Disable ||
!dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx2)) // snippets are implemented only for relevant platforms (avx2+ extensions)
return;
ov::snippets::pass::SnippetsTokenization::Config tokenization_config;
// [111813]: At the moment Snippets supports Transpose on output of MHA pattern only if it is an one node between MatMul and Result.
// However there may be Convert [f32->bf16] before Result since:
// - bf16 Brgemm has f32 output;
// - CPU Node Subgraph requires bf16 on output when inference precision is bf16.
// To avoid sitations when Transpose is not alone node between MatMul and Result,
// Plugin disables Transpose tokenization on output
tokenization_config.mha_token_enable_transpose_on_output = (inferencePrecision == ov::element::f32);
tokenization_config.concurrency = config.streamExecutorConfig._threadsPerStream;
if (tokenization_config.concurrency == 0)
tokenization_config.concurrency = parallel_get_max_threads();
// The optimization "SplitDimensionM" depends on target machine (thread count).
// To avoid uncontrolled behavior in tests, we disabled the optimization when there is Config::SnippetsMode::IgnoreCallback
tokenization_config.split_m_dimension = snippetsMode != Config::SnippetsMode::IgnoreCallback;
// [122706] Some 3D MHA Patterns have perf regressions when Transpose op is tokenized
tokenization_config.mha_supported_transpose_ranks = { 4 };
ov::pass::Manager snippetsManager;
snippetsManager.set_per_pass_validation(false);
if (snippetsMode != Config::SnippetsMode::IgnoreCallback)
CPU_REGISTER_PASS_X64(snippetsManager, SnippetsMarkSkipped, inferencePrecision != ov::element::f32);
CPU_REGISTER_PASS_X64(snippetsManager, snippets::pass::SnippetsTokenization, tokenization_config);
// - MHA has BRGEMM that is supported only on AVX512 platforms
// - CPU Plugin Subgraph supports only f32, bf16 (and quantized) BRGEMM
// [122494] Need to add support of f16
const bool isMHASupported =
dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx512_core) &&
one_of(inferencePrecision, ov::element::bf16, ov::element::f32);
if (!isMHASupported) {
CPU_DISABLE_PASS_X64(snippetsManager, snippets::pass::TokenizeMHASnippets);
CPU_DISABLE_PASS_X64(snippetsManager, snippets::pass::ExtractReshapesFromMHA);
}
if (snippetsMode != Config::SnippetsMode::IgnoreCallback) {
#if defined(OPENVINO_ARCH_X86_64)
auto is_supported_matmul = [this](const std::shared_ptr<const ov::Node>& n) {
const auto matmul = ov::as_type_ptr<const ov::op::v0::MatMul>(n);
if (!matmul)
return false;
const auto in_type0 = matmul->get_input_element_type(0);
const auto in_type1 = matmul->get_input_element_type(1);
if (in_type0 == ov::element::f32 && in_type1 == ov::element::f32 && inferencePrecision == ov::element::f32)
return true;
// [114487] brgemm kernel in oneDNN requires brgemm_copy_b kernel if MatMul node has transposed_b=True
// The current solution with ExtractExplicitMatMulTranspose pass is slower for non-f32 cases than using of brgemm_copy_b kernel
if (matmul->get_transpose_a() || matmul->get_transpose_b())
return false;
// [115165] At the moment Quantized and BF16 Brgemm doesn't support blocking by K and N.
// Big shapes may lead to perf degradation
const auto K = *(matmul->get_input_partial_shape(0).rbegin());
const auto N = *(matmul->get_input_partial_shape(1).rbegin());
if ((K.is_static() && K.get_length() > 512) || // heuristic values
(N.is_static() && N.get_length() > 256))
return false;
if (in_type0 == ov::element::i8)
return dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx512_core_vnni);
if ((in_type0 == ov::element::bf16 && in_type1 == ov::element::bf16) ||
((in_type0 == element::f32 && in_type1 == ov::element::f32 && inferencePrecision == ov::element::bf16))) {
// Implementation calls AMX BF16 brgemm only for tensors with K and N aligned on 2, otherwise fallbacks on vector impl
// Vector madd BF16 instruction on SPR has reduced performance on HW level, which results in overall perf degradation
size_t bf16Factor = 2;
if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx512_core_amx)) {
return K.is_static() && (K.get_length() % bf16Factor == 0) &&
N.is_static() && (N.get_length() % bf16Factor == 0);
}
return dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx512_core_bf16);
}
return true;
};
auto is_unsupported_parallel_work_amount = [&](const std::shared_ptr<const ov::Node>& n, const ov::Shape& shape) {
const size_t parallel_work_amount = std::accumulate(shape.rbegin() + 2, shape.rend(), 1, std::multiplies<size_t>());
const auto is_unsupported_parallel_work_amount =
parallel_work_amount < tokenization_config.concurrency &&
!ov::snippets::pass::SplitDimensionM::can_be_optimized(n, tokenization_config.concurrency);
return is_unsupported_parallel_work_amount;
};
#endif // OPENVINO_ARCH_X86_64
CPU_SET_CALLBACK_X64(snippetsManager, [&](const std::shared_ptr<const ov::Node>& n) -> bool {
// Tranformation callback is called on MatMul0
if (!is_supported_matmul(n))
return true;
// Search for MatMul1
auto child = n->get_output_target_inputs(0).begin()->get_node()->shared_from_this();
while (!ov::is_type<const ov::op::v0::MatMul>(child)) {
child = child->get_output_target_inputs(0).begin()->get_node()->shared_from_this();
}
if (!is_supported_matmul(child))
return true;
const auto& shape = child->get_input_shape(0);
return is_unsupported_parallel_work_amount(n, shape);
}, snippets::pass::TokenizeMHASnippets);
CPU_SET_CALLBACK_X64(snippetsManager, [&](const std::shared_ptr<const ov::Node>& n) -> bool {
return !is_supported_matmul(n) || is_unsupported_parallel_work_amount(n, n->get_output_shape(0));
}, snippets::pass::ExtractReshapesFromMHA);
CPU_SET_CALLBACK_X64(snippetsManager,
[](const std::shared_ptr<const ov::Node>& n) -> bool {
if (n->is_dynamic())
return true;
// CPU Plugin support Swish in Subgraph via conversion to SwichCPU which assumes second input to be constant
const bool is_unsupported_swish =
ov::is_type<const ov::op::v4::Swish>(n) && n->inputs().size() > 1 &&
!ov::is_type<const ov::op::v0::Constant>(n->get_input_node_shared_ptr(1));
if (is_unsupported_swish)
return true;
// todo: general tokenization flow is not currently supported for these operations.
// they can be tokenized only as a part of complex patterns
const bool is_disabled_tokenization = (ov::is_type<const ov::op::v1::Softmax>(n) ||
ov::is_type<const ov::op::v8::Softmax>(n) ||
ov::is_type<const ov::op::v0::MatMul>(n) ||
ov::is_type<const ov::op::v1::Transpose>(n) ||
ov::is_type<const ov::op::v1::Broadcast>(n) ||
ov::is_type<const ov::op::v3::Broadcast>(n));
if (is_disabled_tokenization)
return true;
const auto& inputs = n->inputs();
// todo: clarify whether we can evaluate snippets on const paths
const bool has_only_const_inputs = std::all_of(inputs.begin(), inputs.end(),
[](const ov::Input<const ov::Node>& in) {
return ov::is_type<ov::op::v0::Constant>(
in.get_source_output().get_node_shared_ptr());
});
if (has_only_const_inputs)
return true;
// todo: clarify whether we can evaluate snippets on inputs with larger ranks
auto rank_is_too_large = [](const ov::descriptor::Tensor& t) {
// callback is called has_supported_in_out(), so it's safe to assume that the shapes are static
return t.get_partial_shape().rank().get_length() > 6;
};
const bool bad_input_rank = std::any_of(inputs.begin(), inputs.end(),
[&](const ov::Input<const ov::Node>& in) {
return rank_is_too_large(in.get_tensor());
});
if (bad_input_rank)
return true;
const auto& outputs = n->outputs();
const bool bad_output_rank = std::any_of(outputs.begin(), outputs.end(),
[&](const ov::Output<const ov::Node>& out) {
return rank_is_too_large(out.get_tensor());
});
if (bad_output_rank)
return true;
return false;
},
snippets::pass::TokenizeSnippets);
}
snippetsManager.run_passes(model);
}
这个is_unsupported_parallel_work_amount lambda表达式是在一个if语句里面定义的,但是调用的时候又在 if语句外面 :snippetsManager.run_passes(model);
问题根因
lambda表达式会生成一个匿名对象,引用的变量会成为这个类/struct的成员变量,同时生成一个operator() 来进行函数调用。
对应到 is_unsupported_parallel_work_amount 这个变量会把这个匿名类存储在栈上,栈上的变量有个特点,在作用域结束的时候会被销毁,也就是说在if语句结束的时候,会把调用这个匿名类的析构函数,同时释放内存。
于是在 snippetsManager.run_passes(model); 通过两层 lambda函数调用的时候,会发现 is_unsupported_parallel_work_amount 这个匿名对象的内存数据被销毁了,代码段没有被销毁,还能调用过去,于是就会出现数据异常导致后面死循环的问题。
修复方案
把is_unsupported_parallel_work_amount的定义挪到外面来,保持与snippetsManager.run_passes(model)函数调用一致的作用域。
验证修复OK

777

被折叠的 条评论
为什么被折叠?



