Linux异常重启原因排查

BERT文本分割-中文-通用领域

BERT文本分割-中文-通用领域

NLP
StructBERT

使用modelscope和gradio加载BERT文本分割-中文-通用领域的文本分割模型并前端推理。

系统环境

Ubuntu 22.04

问题描述

近期发现一台Ubuntu主机上面的服务发生中断,后发现主机被重启过,询问相关人员是否有过重启操作未果。

第二日凌晨,该主机再次出现服务中断,同样是主机重启造成的,基本可以确定该主机存在自动重启现象,需要具体排查原因。

问题排查

查看主机启动时间。

# uptime -s

2025-11-21 02:28:39

可以确定主机在凌晨02:28:39出现了重启行为。

查看主机启动记录。

# journalctl --list-boots

-2 51980491207e42678388da7a4967fee1 Tue 2025-11-18 11:26:58 CST—Wed 2025-11-19 17:53:17 CST

-1 e7f20916162549f58957533f6f964a39 Wed 2025-11-19 17:55:29 CST—Fri 2025-11-21 02:26:33 CST

 0 6a43800d80e54d25b1665ff94825add0 Fri 2025-11-21 02:28:50 CST—Fri 2025-11-21 09:37:50 CST

可以看到主机在18日、19日、21日发生过重启。

查看最近一次启动日志。

# journalctl -b

Nov 21 02:28:50 localhost kernel: Linux version 5.15.0-83-generic (buildd@lcy02-amd64-027) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11>

Nov 21 02:28:50 localhost kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=204dd389-47b1-4375-9d7a-f803>

Nov 21 02:28:50 localhost kernel: KERNEL supported cpus:

......

日志内容比较多,空格键翻页,查找上一次开机的残留错误记录(BERT: Error records from previous boot)。BERT 是 BIOS 内置的硬件错误存储机制,专门记录「致命级 / 严重级硬件故障」(如 PCIe 链路中断、CPU 故障、内存错误等),目的是让用户后续排查时能获取完整的错误信息(而非随系统重启丢失)。因此,我们重点查找BERT内容:

Nov 21 02:28:50 localhost kernel: BERT: Error records from previous boot:

Nov 21 02:28:50 localhost kernel: [Hardware Error]: event severity: fatal

Nov 21 02:28:50 localhost kernel: [Hardware Error]:  Error 0, type: fatal

Nov 21 02:28:50 localhost kernel: [Hardware Error]:  fru_text: PcieError

Nov 21 02:28:50 localhost kernel: [Hardware Error]:   section_type: PCIe error

Nov 21 02:28:50 localhost kernel: [Hardware Error]:   port_type: 4, root port

Nov 21 02:28:50 localhost kernel: [Hardware Error]:   version: 0.2

Nov 21 02:28:50 localhost kernel: [Hardware Error]:   command: 0x0003, status: 0x0010

Nov 21 02:28:50 localhost kernel: [Hardware Error]:   device_id: 0000:20:01.1

Nov 21 02:28:50 localhost kernel: [Hardware Error]:   slot: 5

Nov 21 02:28:50 localhost kernel: [Hardware Error]:   secondary_bus: 0x21

Nov 21 02:28:50 localhost kernel: [Hardware Error]:   vendor_id: 0x1022, device_id: 0x14ab

Nov 21 02:28:50 localhost kernel: [Hardware Error]:   class_code: 060400

Nov 21 02:28:50 localhost kernel: [Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0000

Nov 21 02:28:50 localhost kernel: [Hardware Error]:   aer_uncor_status: 0x00000020, aer_uncor_mask: 0x04100000

Nov 21 02:28:50 localhost kernel: [Hardware Error]:   aer_uncor_severity: 0x00476030

Nov 21 02:28:50 localhost kernel: [Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000

可以看到device_id: 0000:20:01.1的设备出现了致命错误。

查看上一次启动日志。

# journalctl -b -1

同样的空格键翻页,查找上一次开机的残留错误记录(BERT: Error records from previous boot)。

Nov 19 17:55:30 localhost kernel: BERT: Error records from previous boot:

Nov 19 17:55:30 localhost kernel: [Hardware Error]: event severity: fatal

Nov 19 17:55:30 localhost kernel: [Hardware Error]:  Error 0, type: fatal

Nov 19 17:55:30 localhost kernel: [Hardware Error]:  fru_text: PcieError

Nov 19 17:55:30 localhost kernel: [Hardware Error]:   section_type: PCIe error

Nov 19 17:55:30 localhost kernel: [Hardware Error]:   port_type: 4, root port

Nov 19 17:55:30 localhost kernel: [Hardware Error]:   version: 0.2

Nov 19 17:55:30 localhost kernel: [Hardware Error]:   command: 0x0003, status: 0x0010

Nov 19 17:55:30 localhost kernel: [Hardware Error]:   device_id: 0000:20:01.1

Nov 19 17:55:30 localhost kernel: [Hardware Error]:   slot: 5

Nov 19 17:55:30 localhost kernel: [Hardware Error]:   secondary_bus: 0x21

Nov 19 17:55:30 localhost kernel: [Hardware Error]:   vendor_id: 0x1022, device_id: 0x14ab

Nov 19 17:55:30 localhost kernel: [Hardware Error]:   class_code: 060400

Nov 19 17:55:30 localhost kernel: [Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0000

Nov 19 17:55:30 localhost kernel: [Hardware Error]:   aer_uncor_status: 0x00000020, aer_uncor_mask: 0x04100000

Nov 19 17:55:30 localhost kernel: [Hardware Error]:   aer_uncor_severity: 0x00476030

Nov 19 17:55:30 localhost kernel: [Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000

可以看到还是device_id: 0000:20:01.1的设备出现了致命错误。

查看出错的设备信息。

这里需要查询device_id: 0000:20:01.1对应的设备信息(grep过滤时需要省略pci域)。

# lspci | grep '20:01.1'

20:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14ab (rev 01)

可以看到是一个AMD设备出了问题。

查看设备具体信息:

# lspci -s '20:01.1' -v

20:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14ab (rev 01) (prog-if 00 [Normal decode])

        Flags: bus master, fast devsel, latency 0, IRQ 43, NUMA node 2

        Bus: primary=20, secondary=21, subordinate=21, sec-latency=0

        I/O behind bridge: 00003000-00003fff [size=4K]

        Memory behind bridge: c8000000-cc0fffff [size=65M]

        Prefetchable memory behind bridge: 0000197800000000-00001980140fffff [size=33089M]

        Capabilities: [48] Vendor Specific Information: Len=08 <?>

        Capabilities: [50] Power Management version 3

        Capabilities: [58] Express Root Port (Slot+), MSI 00

        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+

        Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD] Device 1453

        Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+

        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>

        Capabilities: [150] Advanced Error Reporting

        Capabilities: [270] Secondary PCI Express

        Capabilities: [380] Downstream Port Containment

        Capabilities: [400] Data Link Feature <?>

        Capabilities: [410] Physical Layer 16.0 GT/s <?>

        Capabilities: [440] Lane Margining at the Receiver <?>

        Capabilities: [4d0] Native PCIe Enclosure Management <?>

        Capabilities: [500] Extended Capability ID 0x2a

        Capabilities: [530] Extended Capability ID 0x2b

        Kernel driver in use: pcieport

这是一个 AMD PCIe 根端口桥设备,位于系统 PCIe 拓扑结构的上游位置,负责连接处理器与下游 PCIe 设备(如显卡、网卡、SSD 等)。

Bus: primary=20, secondary=21, subordinate=21, sec-latency=0

显示其下游设备总线地址为21,查看总线地址为21的设备信息:

# lspci -s '21:' -v

21:00.0 VGA compatible controller: NVIDIA Corporation Device 2b85 (rev a1) (prog-if 00 [VGA controller])

        Subsystem: InnoVISION Multimedia Ltd. Device 2059

        Physical Slot: 5

        Flags: bus master, fast devsel, latency 0, IRQ 582, NUMA node 2

        Memory at c8000000 (32-bit, non-prefetchable) [size=64M]

        Memory at 197800000000 (64-bit, prefetchable) [size=32G]

        Memory at 198012000000 (64-bit, prefetchable) [size=32M]

        I/O ports at 3000 [size=128]

        Expansion ROM at cc000000 [virtual] [disabled] [size=512K]

        Capabilities: [40] Power Management version 3

        Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+

        Capabilities: [60] Express Legacy Endpoint, MSI 00

        Capabilities: [9c] Vendor Specific Information: Len=14 <?>

        Capabilities: [b0] MSI-X: Enable+ Count=9 Masked-

        Capabilities: [100] Secondary PCI Express

        Capabilities: [12c] Latency Tolerance Reporting

        Capabilities: [134] Physical Resizable BAR

        Capabilities: [140] Virtual Resizable BAR

        Capabilities: [14c] Data Link Feature <?>

        Capabilities: [158] Physical Layer 16.0 GT/s <?>

        Capabilities: [188] Extended Capability ID 0x2a

        Capabilities: [1b8] Advanced Error Reporting

        Capabilities: [200] Lane Margining at the Receiver <?>

        Capabilities: [248] Alternative Routing-ID Interpretation (ARI)

        Capabilities: [250] Single Root I/O Virtualization (SR-IOV)

        Capabilities: [290] L1 PM Substates

        Capabilities: [2a4] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>

        Capabilities: [2bc] Power Budgeting <?>

        Capabilities: [2f4] Device Serial Number 0c-78-dd-26-96-2d-b0-48

        Kernel driver in use: nvidia

        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

21:00.1 Audio device: NVIDIA Corporation Device 22e8 (rev a1)

        Subsystem: NVIDIA Corporation Device 0000

        Physical Slot: 5

        Flags: bus master, fast devsel, latency 0, IRQ 41, NUMA node 2

        Memory at cc080000 (32-bit, non-prefetchable) [size=16K]

        Capabilities: [40] Power Management version 3

        Capabilities: [48] MSI: Enable- Count=1/1 Maskable+ 64bit+

        Capabilities: [60] Express Endpoint, MSI 00

        Capabilities: [9c] Vendor Specific Information: Len=14 <?>

        Capabilities: [100] Data Link Feature <?>

        Capabilities: [10c] Advanced Error Reporting

        Capabilities: [154] Alternative Routing-ID Interpretation (ARI)

        Kernel driver in use: snd_hda_intel

        Kernel modules: snd_hda_intel

这是个NVIDIA GPU设备(21:00.0/21:00.1),即AMD PCIe 桥设备 (20:01.1) 连接到系统的显卡(该显卡还包含一个音频设备),位于主板 PCIe Slot 5,之前的 PCIe 错误与这条特定链路直接相关。

通过以上信息即基本可以判定故障所在:
A.PCI地址为21:00.0的显卡存在故障,需要维修/更换。
PCI地址为20.01.1的PCI插槽存在问题,需要维修/更换。

临时解决方案


如果暂时不能修复故障,可以先临时禁用指定的显卡,以降低系统重启的触发概率。
通过nvidia-smi命令查看显卡的PCI地址所对应的gpu编号,假设为4。
临时禁用之:
nvidia-smi -i 4 -c PROHIBITED

如需重新启动,请执行:
nvidia-smi -i 4 -c DEFAULT

您可能感兴趣的与本文相关的镜像

BERT文本分割-中文-通用领域

BERT文本分割-中文-通用领域

NLP
StructBERT

使用modelscope和gradio加载BERT文本分割-中文-通用领域的文本分割模型并前端推理。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值