系统环境
Ubuntu 22.04
问题描述
近期发现一台Ubuntu主机上面的服务发生中断,后发现主机被重启过,询问相关人员是否有过重启操作未果。
第二日凌晨,该主机再次出现服务中断,同样是主机重启造成的,基本可以确定该主机存在自动重启现象,需要具体排查原因。
问题排查
查看主机启动时间。
# uptime -s
2025-11-21 02:28:39
可以确定主机在凌晨02:28:39出现了重启行为。
查看主机启动记录。
# journalctl --list-boots
-2 51980491207e42678388da7a4967fee1 Tue 2025-11-18 11:26:58 CST—Wed 2025-11-19 17:53:17 CST
-1 e7f20916162549f58957533f6f964a39 Wed 2025-11-19 17:55:29 CST—Fri 2025-11-21 02:26:33 CST
0 6a43800d80e54d25b1665ff94825add0 Fri 2025-11-21 02:28:50 CST—Fri 2025-11-21 09:37:50 CST
可以看到主机在18日、19日、21日发生过重启。
查看最近一次启动日志。
# journalctl -b
Nov 21 02:28:50 localhost kernel: Linux version 5.15.0-83-generic (buildd@lcy02-amd64-027) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11>
Nov 21 02:28:50 localhost kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=204dd389-47b1-4375-9d7a-f803>
Nov 21 02:28:50 localhost kernel: KERNEL supported cpus:
......
日志内容比较多,空格键翻页,查找上一次开机的残留错误记录(BERT: Error records from previous boot)。BERT 是 BIOS 内置的硬件错误存储机制,专门记录「致命级 / 严重级硬件故障」(如 PCIe 链路中断、CPU 故障、内存错误等),目的是让用户后续排查时能获取完整的错误信息(而非随系统重启丢失)。因此,我们重点查找BERT内容:
Nov 21 02:28:50 localhost kernel: BERT: Error records from previous boot:
Nov 21 02:28:50 localhost kernel: [Hardware Error]: event severity: fatal
Nov 21 02:28:50 localhost kernel: [Hardware Error]: Error 0, type: fatal
Nov 21 02:28:50 localhost kernel: [Hardware Error]: fru_text: PcieError
Nov 21 02:28:50 localhost kernel: [Hardware Error]: section_type: PCIe error
Nov 21 02:28:50 localhost kernel: [Hardware Error]: port_type: 4, root port
Nov 21 02:28:50 localhost kernel: [Hardware Error]: version: 0.2
Nov 21 02:28:50 localhost kernel: [Hardware Error]: command: 0x0003, status: 0x0010
Nov 21 02:28:50 localhost kernel: [Hardware Error]: device_id: 0000:20:01.1
Nov 21 02:28:50 localhost kernel: [Hardware Error]: slot: 5
Nov 21 02:28:50 localhost kernel: [Hardware Error]: secondary_bus: 0x21
Nov 21 02:28:50 localhost kernel: [Hardware Error]: vendor_id: 0x1022, device_id: 0x14ab
Nov 21 02:28:50 localhost kernel: [Hardware Error]: class_code: 060400
Nov 21 02:28:50 localhost kernel: [Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0000
Nov 21 02:28:50 localhost kernel: [Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x04100000
Nov 21 02:28:50 localhost kernel: [Hardware Error]: aer_uncor_severity: 0x00476030
Nov 21 02:28:50 localhost kernel: [Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
可以看到device_id: 0000:20:01.1的设备出现了致命错误。
查看上一次启动日志。
# journalctl -b -1
同样的空格键翻页,查找上一次开机的残留错误记录(BERT: Error records from previous boot)。
Nov 19 17:55:30 localhost kernel: BERT: Error records from previous boot:
Nov 19 17:55:30 localhost kernel: [Hardware Error]: event severity: fatal
Nov 19 17:55:30 localhost kernel: [Hardware Error]: Error 0, type: fatal
Nov 19 17:55:30 localhost kernel: [Hardware Error]: fru_text: PcieError
Nov 19 17:55:30 localhost kernel: [Hardware Error]: section_type: PCIe error
Nov 19 17:55:30 localhost kernel: [Hardware Error]: port_type: 4, root port
Nov 19 17:55:30 localhost kernel: [Hardware Error]: version: 0.2
Nov 19 17:55:30 localhost kernel: [Hardware Error]: command: 0x0003, status: 0x0010
Nov 19 17:55:30 localhost kernel: [Hardware Error]: device_id: 0000:20:01.1
Nov 19 17:55:30 localhost kernel: [Hardware Error]: slot: 5
Nov 19 17:55:30 localhost kernel: [Hardware Error]: secondary_bus: 0x21
Nov 19 17:55:30 localhost kernel: [Hardware Error]: vendor_id: 0x1022, device_id: 0x14ab
Nov 19 17:55:30 localhost kernel: [Hardware Error]: class_code: 060400
Nov 19 17:55:30 localhost kernel: [Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0000
Nov 19 17:55:30 localhost kernel: [Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x04100000
Nov 19 17:55:30 localhost kernel: [Hardware Error]: aer_uncor_severity: 0x00476030
Nov 19 17:55:30 localhost kernel: [Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
可以看到还是device_id: 0000:20:01.1的设备出现了致命错误。
查看出错的设备信息。
这里需要查询device_id: 0000:20:01.1对应的设备信息(grep过滤时需要省略pci域)。
# lspci | grep '20:01.1'
20:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14ab (rev 01)
可以看到是一个AMD设备出了问题。
查看设备具体信息:
# lspci -s '20:01.1' -v
20:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14ab (rev 01) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 43, NUMA node 2
Bus: primary=20, secondary=21, subordinate=21, sec-latency=0
I/O behind bridge: 00003000-00003fff [size=4K]
Memory behind bridge: c8000000-cc0fffff [size=65M]
Prefetchable memory behind bridge: 0000197800000000-00001980140fffff [size=33089M]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Root Port (Slot+), MSI 00
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD] Device 1453
Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150] Advanced Error Reporting
Capabilities: [270] Secondary PCI Express
Capabilities: [380] Downstream Port Containment
Capabilities: [400] Data Link Feature <?>
Capabilities: [410] Physical Layer 16.0 GT/s <?>
Capabilities: [440] Lane Margining at the Receiver <?>
Capabilities: [4d0] Native PCIe Enclosure Management <?>
Capabilities: [500] Extended Capability ID 0x2a
Capabilities: [530] Extended Capability ID 0x2b
Kernel driver in use: pcieport
这是一个 AMD PCIe 根端口桥设备,位于系统 PCIe 拓扑结构的上游位置,负责连接处理器与下游 PCIe 设备(如显卡、网卡、SSD 等)。
Bus: primary=20, secondary=21, subordinate=21, sec-latency=0
显示其下游设备总线地址为21,查看总线地址为21的设备信息:
# lspci -s '21:' -v
21:00.0 VGA compatible controller: NVIDIA Corporation Device 2b85 (rev a1) (prog-if 00 [VGA controller])
Subsystem: InnoVISION Multimedia Ltd. Device 2059
Physical Slot: 5
Flags: bus master, fast devsel, latency 0, IRQ 582, NUMA node 2
Memory at c8000000 (32-bit, non-prefetchable) [size=64M]
Memory at 197800000000 (64-bit, prefetchable) [size=32G]
Memory at 198012000000 (64-bit, prefetchable) [size=32M]
I/O ports at 3000 [size=128]
Expansion ROM at cc000000 [virtual] [disabled] [size=512K]
Capabilities: [40] Power Management version 3
Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+
Capabilities: [60] Express Legacy Endpoint, MSI 00
Capabilities: [9c] Vendor Specific Information: Len=14 <?>
Capabilities: [b0] MSI-X: Enable+ Count=9 Masked-
Capabilities: [100] Secondary PCI Express
Capabilities: [12c] Latency Tolerance Reporting
Capabilities: [134] Physical Resizable BAR
Capabilities: [140] Virtual Resizable BAR
Capabilities: [14c] Data Link Feature <?>
Capabilities: [158] Physical Layer 16.0 GT/s <?>
Capabilities: [188] Extended Capability ID 0x2a
Capabilities: [1b8] Advanced Error Reporting
Capabilities: [200] Lane Margining at the Receiver <?>
Capabilities: [248] Alternative Routing-ID Interpretation (ARI)
Capabilities: [250] Single Root I/O Virtualization (SR-IOV)
Capabilities: [290] L1 PM Substates
Capabilities: [2a4] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
Capabilities: [2bc] Power Budgeting <?>
Capabilities: [2f4] Device Serial Number 0c-78-dd-26-96-2d-b0-48
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
21:00.1 Audio device: NVIDIA Corporation Device 22e8 (rev a1)
Subsystem: NVIDIA Corporation Device 0000
Physical Slot: 5
Flags: bus master, fast devsel, latency 0, IRQ 41, NUMA node 2
Memory at cc080000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [48] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [60] Express Endpoint, MSI 00
Capabilities: [9c] Vendor Specific Information: Len=14 <?>
Capabilities: [100] Data Link Feature <?>
Capabilities: [10c] Advanced Error Reporting
Capabilities: [154] Alternative Routing-ID Interpretation (ARI)
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
这是个NVIDIA GPU设备(21:00.0/21:00.1),即AMD PCIe 桥设备 (20:01.1) 连接到系统的显卡(该显卡还包含一个音频设备),位于主板 PCIe Slot 5,之前的 PCIe 错误与这条特定链路直接相关。
通过以上信息即基本可以判定故障所在:
A.PCI地址为21:00.0的显卡存在故障,需要维修/更换。
PCI地址为20.01.1的PCI插槽存在问题,需要维修/更换。
临时解决方案
如果暂时不能修复故障,可以先临时禁用指定的显卡,以降低系统重启的触发概率。
通过nvidia-smi命令查看显卡的PCI地址所对应的gpu编号,假设为4。
临时禁用之:
nvidia-smi -i 4 -c PROHIBITED
如需重新启动,请执行:
nvidia-smi -i 4 -c DEFAULT

6759

被折叠的 条评论
为什么被折叠?



