问题描述及结论:
今天遇到一个服务core的问题,Illegal instruction (core dumped),
排查后发现是由于编译用机器和运行用机器CPU不同,
运行机器CPU: AMD Ryzen 5 5600GT
支持指令集: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
编译机器CPU: AMD Ryzen 9 9950X
支持指令集: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d
解决办法:
编译命令中如果有 -march=native 则删除,增加参数 -march=haswell
排查过程:
gdb找到core的栈,发现没什么用
gdb ./my_exec
(gdb) run
Thread 1 "index" received signal SIGILL, Illegal instruction.
0x00005555566b9e14 in std::__recursive_mutex_base::__recursive_mutex_base (
this=0x555557c95470 <spdlog::details::registry::instance()::s_instance+80>) at /usr/include/c++/13/mutex:81
81 __recursive_mutex_base() = default;
(gdb) bt
#0 0x00005555566b9e14 in std::__recursive_mutex_base::__recursive_mutex_base (
this=0x555557c95470 <spdlog::details::registry::instance()::s_instance+80>) at /usr/include/c++/13/mutex:81
#1 std::recursive_mutex::recursive_mutex (this=0x555557c95470 <spdlog::details::registry::instance()::s_instance+80>)
at /usr/include/c++/13/mutex:111
#2 spdlog::details::registry::registry (this=0x555557c95420 <spdlog::details::registry::instance()::s_instance>)
at thirdparty/spdlog-1.14.1/include/spdlog/details/registry-inl.h:34
#3 0x000055555670ef35 in spdlog::details::registry::instance ()
at thirdparty/spdlog-1.14.1/include/spdlog/details/registry-inl.h:237
#4 0x000055555670efa8 in spdlog::default_logger_raw () at thirdparty/spdlog-1.14.1/include/spdlog/spdlog-inl.h:81
#5 0x0000555556713cac in spdlog::info<char const*> (fmt=...) at thirdparty/spdlog-1.14.1/include/spdlog/spdlog.h:168
找到core的点对应的指令
(gdb) disas $pc-32, $pc+32
Dump of assembler code from 0x5555566b9df4 to 0x5555566b9e34:
0x00005555566b9df4 <_ZN6spdlog7details8registryC2Ev+36>: add %cl,-0x77(%rax)
0x00005555566b9df7 <_ZN6spdlog7details8registryC2Ev+39>: rex.R and $0x68,%al
0x00005555566b9dfa <_ZN6spdlog7details8registryC2Ev+42>: xor %eax,%eax
0x00005555566b9dfc <_ZN6spdlog7details8registryC2Ev+44>: movq $0x0,0x20(%rdi)
0x00005555566b9e04 <_ZN6spdlog7details8registryC2Ev+52>: movq $0x0,0x48(%rdi)
0x00005555566b9e0c <_ZN6spdlog7details8registryC2Ev+60>: movq $0x0,0x70(%rdi)
=> 0x00005555566b9e14 <_ZN6spdlog7details8registryC2Ev+68>: vmovdqu8 %ymm0,0x50(%rdi)
0x00005555566b9e1e <_ZN6spdlog7details8registryC2Ev+78>: movq $0x1,0x80(%rdi)
0x00005555566b9e29 <_ZN6spdlog7details8registryC2Ev+89>: movl $0x1,0x60(%rdi)
0x00005555566b9e30 <_ZN6spdlog7details8registryC2Ev+96>: movq $0x0,0x88(%rdi)
End of assembler dump.
=> 0x5555566b9e14 <_ZN6spdlog7details8registryC2Ev+68>: vmovdqu8 %ymm0,0x50(%rdi)
| 含义 | |
|---|---|
0x5555566b9e14 | 指令的内存地址(虚拟地址空间中的位置) |
<_ZN6...C2Ev+68> | 符号化表示:spdlog::details::registry 构造函数 + 68字节偏移 |
vmovdqu8 %ymm0,0x50(%rdi) | 实际导致崩溃的指令 |
=> | GDB 标记,表示这是当前程序计数器指向的指令 |
崩溃指令:vmovdqu8 %ymm0,0x50(%rdi)
-
操作码:
vmovdqu8 -
操作数:
-
%ymm0:256位 YMM 寄存器(AVX-512 的 512位 ZMM 寄存器的低半部分) -
0x50(%rdi):内存地址 = RDI 寄存器的值 + 0x50(80字节偏移)
-
-
指令作用:
将 YMM0 寄存器中的 32 字节(256位)数据,无对齐要求地存储到[RDI + 80]指向的内存地址 -
硬件要求:
-
AVX-512VL:允许使用 256位 YMM 寄存器(而不是必须用 512位 ZMM)
-
AVX-512BW:支持字节粒度操作(
vmovdqu8的8表示 8 位/字节操作) -
必需特性:必须支持 AVX-512 指令集的 CPU(如 Intel Skylake-X 或 AMD Zen4)
-
当 CPU 不支持 AVX-512 时:
-
CPU 遇到
vmovdqu8操作码(属于 AVX-512 扩展指令集) -
CPU 将其识别为非法指令(Illegal Instruction)
-
触发操作系统中断(SIGILL 信号)
-
操作系统终止程序并生成 core dump
重要gdb命令:
(gdb) run # 运行程序直到崩溃
(gdb) bt # 查看调用栈
(gdb) x/i $pc # 查看崩溃指令 <-- 关键步骤!
(gdb) info registers rdi # 查看操作数地址
(gdb) disas $pc-40, $pc+10 # 查看附近汇编

130

被折叠的 条评论
为什么被折叠?



