本文为墨天轮数据库管理服务团队第135期技术分享,内容原创,作者为技术顾问罗海鸥,如需转载请联系小墨(VX:modb666)并注明来源。如需查看更多文章可关注【墨天轮】公众号。
适用范围
Oracle RAC 11G、12C、19C
问题概述
客户一套RAC节点,版本为12.2.0.1。2025-07-31 07:39:21 节点2被驱逐。
以下是节点2alter日志。
2025-07-31T07:20:49.680473+08:00
Thread 2 advanced to log sequence 116120 (LGWR switch)
Current log# 7 seq# 116120 mem# 0: +DATA/FDCDB/ONLINELOG/group_7.271.1000847113
Current log# 7 seq# 116120 mem# 1: +ARC/FDCDB/ONLINELOG/group_7.269.1000847113
2025-07-31T07:20:51.220283+08:00
Archived Log entry 193846 added for T-2.S-116119 ID 0xeac4c440 LAD:1
2025-07-31T07:38:00.059963+08:00
minact-scn: got error during useg scan e:12751 usn:1
minact-scn: useg scan erroring out with error e:12751
2025-07-31T07:39:00.979683+08:00
IPC Receiver dump detected. Sender instance 1 Receiver pnum 570 ospid 41888 [oracle@fdcdb2 (PPA7)], pser 12
2025-07-31T07:39:00.979980+08:00
Errors in file /oracle/app/diag/rdbms/fdcdb/fdcdb2/trace/fdcdb2_ppa7_41888.trc:
2025-07-31T07:39:21.277326+08:00
Detected an inconsistent instance membership by instance 1
Errors in file /oracle/app/diag/rdbms/fdcdb/fdcdb2/trace/fdcdb2_lmon_14788.trc (incident=1987597):
ORA-29740: evicted by instance number 1, group incarnation 14
Incident details in: /oracle/app/diag/rdbms/fdcdb/fdcdb2/incident/incdir_1987597/fdcdb2_lmon_14788_i1987597.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2025-07-31T07:39:22.156952+08:00
Errors in file /oracle/app/diag/rdbms/fdcdb/fdcdb2/trace/fdcdb2_lmon_14788.trc:
ORA-29740: evicted by instance number 1, group incarnation 14
Errors in file /oracle/app/diag/rdbms/fdcdb/fdcdb2/trace/fdcdb2_lmon_14788.trc (incident=1987598):
ORA-29740 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /oracle/app/diag/rdbms/fdcdb/fdcdb2/incident/incdir_1987598/fdcdb2_lmon_14788_i1987598.trc
2025-07-31T07:39:22.367538+08:00
CJQ0 (ospid: 17003): terminating the instance due to error 481
2025-07-31T07:39:23.851520+08:00
System state dump requested by (instance=2, osid=17003 (CJQ0)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/app/diag/rdbms/fdcdb/fdcdb2/trace/fdcdb2_diag_14756_20250731073923.trc
2025-07-31T07:39:24.449532+08:00
节点1alter日志显示:
2025-07-31T07:38:33 节点1的LMS进程无法和节点2连接
07:39:00 等待超时,
07:39:21 节点2被驱逐
2025-07-31T07:13:53.757631+08:00
Thread 1 advanced to log sequence 77598 (LGWR switch)
Current log# 4 seq# 77598 mem# 0: +DATA/FDCDB/ONLINELOG/group_4.267.1000847077
Current log# 4 seq# 77598 mem# 1: +ARC/FDCDB/ONLINELOG/group_4.260.1000847077
2025-07-31T07:13:55.322692+08:00
Archived Log entry 193845 added for T-1.S-77597 ID 0xeac4c440 LAD:1
2025-07-31T07:28:14.833306+08:00
Thread 1 advanced to log sequence 77599 (LGWR switch)
Current log# 2 seq# 77599 mem# 0: +DATA/FDCDB/ONLINELOG/group_2.263.1000846993
Current log# 2 seq# 77599 mem# 1: +ARC/FDCDB/ONLINELOG/group_2.258.1000846993
2025-07-31T07:28:16.360189+08:00
Archived Log entry 193847 added for T-1.S-77598 ID 0xeac4c440 LAD:1
2025-07-31T07:38:33.029246+08:00
LMS1 (ospid: 23514) has detected no messaging activity from instance 2
USER (ospid: 23514) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2025-07-31T07:38:33.035568+08:00
LMS4 (ospid: 23521) has detected no messaging activity from instance 2
2025-07-31T07:38:33.035590+08:00
Communications reconfiguration: instance_number 2 by ospid 23514
2025-07-31T07:38:33.035740+08:00
LMS3 (ospid: 23519) has detected no messaging activity from instance 2
2025-07-31T07:38:33.036206+08:00
LMS0 (ospid: 23512) has detected no messaging activity from instance 2
2025-07-31T07:38:33.043106+08:00
USER (ospid: 23521) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2025-07-31T07:38:33.043264+08:00
USER (ospid: 23519) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2025-07-31T07:38:33.043562+08:00
USER (ospid: 23512) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2025-07-31T07:38:33.047480+08:00
LMS2 (ospid: 23516) has detected no messaging activity from instance 2
USER (ospid: 23516) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2025-07-31T07:38:33.077660+08:00
LMON (ospid: 23508) drops the IMR request from LMS4 (ospid: 23521) because IMR is in progress and inst 2 is marked bad.
LMON (ospid: 23508) drops the IMR request from LMS3 (ospid: 23519) because IMR is in progress and inst 2 is marked bad.
LMON (ospid: 23508) drops the IMR request from LMS0 (ospid: 23512) because IMR is in progress and inst 2 is marked bad.
LMON (ospid: 23508) drops the IMR request from LMS2 (ospid: 23516) because IMR is in progress and inst 2 is marked bad.
2025-07-31T07:39:00.964588+08:00
IPC Send timeout detected. Sender: ospid 41916 [oracle@fdcdb1]
Receiver: inst 2 binc 2 ospid 41888
2025-07-31T07:39:21.290230+08:00
Detected an inconsistent instance membership by instance 1
Evicting instance 2 from cluster
Waiting for instances to leave: 2
2025-07-31T07:39:21.444234+08:00
IPC Send timeout to 2.3 inc 12 for msg type 65518 from opid 26
2025-07-31T07:39:21.444310+08:00
IPC Send timeout to 2.3 inc 12 for msg type 151 from opid 26
2025-07-31T07:39:21.444366+08:00
IPC Send timeout to 2.3 inc 12 for msg type 73 from opid 26
2025-07-31T07:39:21.444401+08:00
IPC Send timeout to 2.2 inc 12 for msg type 65521 from opid 25
问题原因
根本原因:网络拥塞导致节点2被驱逐,以下是分析过程。
1、lmon分析
节点1 lmon分析
节点1的lmon进程trc文件fdcdb2\_lmon\_14788.trc表明故障原因是网络无法通信。(根据MOS解释kjxgrrcfgchk: Initiating reconfig, reason 3表示网络无法连接。)
*** 2025-07-31T07:38:33.046462+08:00
2025-07-31 07:38:33.046 : kjxgrrcfg: done (device ok) - ret = 3 hist 0x1679a (initial rsn: 3, cachd rsn: 0)
kjxgrrcfgchk: Initiating reconfig, reason=3
kjxgrrcfgchk: COMM rcfg - Disk Vote Required
kjfmReceiverHealthCB_CheckAll: Recievers are healthy.
2025-07-31 07:38:33.046 : kjxgrnetchk: start 0x44281b3b, end 0x4428d185
2025-07-31 07:38:33.046 : kjxgrnetchk: Network Validation wait: 46 sec
kjxgrnetchk: ce-event: from inst 1 to inst 2 ver 0x3cff
kjxgrrcfgchk: prev pstate 6 mapsz 1024
kjxgrrcfgchk: new bmp: 1 2
kjxgrrcfgchk: cnct bmp: 1 2
kjxgrrcfgchk: disc bmp:
kjxgrrcfgchk: work bmp: 1 2
kjxgrrcfgchk: rr bmp: 1 2

节点2 lmon分析
07:32:52开始知道节点2被驱逐,一直存在网络等待。
2025-07-31 07:29:29.457 : GSIPC:BSEND: last proc to post me: pid 38.14864 in kjc.h LINE:2308 ID:KJCS Post snd proxy to flush msg
2025-07-31 07:29:29.457 : GSIPC:BSEND: deq msg 0x2a5f756e0 type 44 qtm 30049 us seq 0.514174518 from opid 38 to 1.2
2025-07-31 07:29:29.457 : GSIPC:BSEND: last proc to post me: pid 38.14864 in kjc.h LINE:2308 ID:KJCS Post snd proxy to flush msg
IPCLW:[0.913]{E}[WAIT]:PROTO: [1753918372165022]PT DMP_REQ at seq 1292873920 from peer 192.168.2.10:14404
*** 2025-07-31T07:32:52.165931+08:00 IPCLW:[0.914]{-}[WAIT]:UTIL: [1753918372165022] PCNH 0x7f6c64fdcf78 State: 1 MSN: 392301845 Seq: 1292873920 Last Rcv 0:0:0.116.116424 Last valid Rcv 0:0:7.273.273813 IPCLW:[0.915]{-}[WAIT]:UTIL: [1753918372165022] Peer: LMS1.KSMSQ_dlm.23514 AckSeq: 1292873920. # Coalesced: 0 IPCLW:[0.916]{-}[WAIT]:UTIL: [1753918372165022] IVPort: 45836 TVPort: 6284 IMPT: 29019 RMPT: 17464 Ongoing Recv: No, last retrans dmp seq 0 IPCLW:[0.917]{-}[WAIT]:UTIL: [1753918372165022] Flags: 0x00000001 THint: 0x44cab1bf00000226 IHint: 0x705ddeb40000001f IPCLW:[0.918]{-}[WAIT]:UTIL: [1753918372165022] IPCLW State Dump IPCLW:[0.919]{-}[WAIT]:UTIL: [1753918372165022] UDP PORT 0x7f6c6dd93ad8, [LMS1.KSMSQ_dlm] Dump IPCLW:[0.920]{-}[WAIT]:UTIL: [1753918372165022] ID: 0x3b5d2e10 Type: UDP PT Opts: [] Flags: [], Address: 192.168.2.20:9149, MPT: No IPCLW:[0.921]{-}[WAIT]:UTIL: [1753918372165022] Last DataReady time 0:0:0.67.67549 Last Unrel msg rcvd time 487199:32:52.165.165022 IPCLW:[0.922]{-}[WAIT]:UTIL: [1753918372165022] ACNH Dump IPCLW:[0.923]{-}[WAIT]:UTIL: [1753918372165022] PCNH Dump IPCLW:[0.924]{-}[WAIT]:UTIL: [1753918372165022] PCNH 0x7f6c65ff79d8 State: 1 MSN: 2057350729 Seq: 2099703066 Last Rcv 8062:40:44.501.501764 Last valid Rcv 8062:40:44.501.501764 IPCLW:[0.925]{-}[WAIT]:UTIL: [1753918372165022] Peer: LMS1.KSMSQ_dlm.9072 AckSeq: 2099703066. # Coalesced: 0 IPCLW:[0.926]{-}[WAIT]:UTIL: [1753918372165022] IVPort: 45836 TVPort: 60222 IMPT: 29019 RMPT: 10109 Ongoing Recv: No, last retrans dmp seq 0 IPCLW:[0.927]{-}[WAIT]:UTIL: [1753918372165022] Flags: 0x00000000 THint: 0x44cab0510000001f IHint: 0x47d663e50000001f IPCLW:[0.928]{-}[WAIT]:UTIL: [1753918372165022] PCNH 0x7f6c65ff7448 State: 1 MSN: 595742835 Seq: 14107475 Last Rcv 8241:8:1.748.748540 Last valid Rcv
2、lms分析
节点1 lms分析
07:32:51开始,lms进程出现网络相关等待
*** 2025-07-31T07:32:51.968793+08:00
IPCLW:[0.618]{E}[WAIT]:PROTO: [1753918371968513]RETRANS DBG local acnh 0x7f71db3b3ea8 dump:
IPCLW:[0.619]{-}[WAIT]:UTIL: [1753918371968513] ACNH 0x7f71db3b3ea8 State: 1 MSN: 1776967435 Seq: 461160960 # Pending: 53
IPCLW:[0.620]{-}[WAIT]:UTIL: [1753918371968513] Peer: LMS0.KSMSQ_dlm.14795 AckSeq: 0
IPCLW:[0.621]{-}[WAIT]:UTIL: [1753918371968513] Flags: 0x00000000 IHint: 0x64d73e5e0000001f THint: 0x2457a2fa000000fd
IPCLW:[0.622]{-}[WAIT]:UTIL: [1753918371968513] Local Address: 192.168.2.10:44600 Remote Address: 192.168.2.20:44440
IPCLW:[0.623]{-}[WAIT]:UTIL: [1753918371968513] Remote PID: ver 0 flags 1 trans 2 tos 0 opts 0 xdata3 2dd3 xdata2 4fada3f2
IPCLW:[0.624]{-}[WAIT]:UTIL: [1753918371968513] : mmsz 8472 mmr 9200 mms 2 xdata 1e1dd769
IPCLW:[0.625]{-}[WAIT]:UTIL: [1753918371968513] IVPort: 26295 TVPort: 55145 IMPT: 14510 RMPT: 11731 Pending Sends: Yes Unacked Sends: Yes
IPCLW:[0.626]{-}[WAIT]:UTIL: [1753918371968513] Send Engine Queued: No sshdl -1 ssts 0 rtts 1753918371968771 snderrchk 8 creqcnt 1
IPCLW:[0.627]{-}[WAIT]:UTIL: [1753918371968513] Unackd Messages 1776967382 -> 1776967434. SSEQ 461160907 Send Time: 0:0:0.210.210159 SMSN # Xmits: 64 EMSN 0:0:0.210.210159
IPCLW:[0.628]{-}[WAIT]:UTIL: [1753918371968513] Pending send queue:
IPCLW:[0.629]{-}[WAIT]:UTIL: [1753918371968513] [0] Mbuf 0x7f71db165af0 MSN 1776967382 Seq 461160907 -> 461160908 # XMits: 64
IPCLW:[0.630]{-}[WAIT]:UTIL: [1753918371968513] [1] Mbuf 0x7f71d9a7b5f0 MSN 1776967383 Seq 461160908 -> 461160909 # XMits: 64
IPCLW:[0.631]{-}[WAIT]:UTIL: [1753918371968513] [2] Mbuf 0x7f71db11ba50 MSN 1776967384 Seq 461160909 -> 461160910 # XMits: 63
IPCLW:[0.632]{-}[WAIT]:UTIL: [1753918371968513] [3] Mbuf 0x7f71db165f10 MSN 1776967385 Seq 461160910 -> 461160911 # XMits: 62
IPCLW:[0.633]{-}[WAIT]:UTIL: [1753918371968513] [4] Mbuf 0x7f71db40f7f0 MSN 1776967386 Seq 461160911 -> 461160912 # XMits: 62
IPCLW:[0.634]{-}[WAIT]:UTIL: [1753918371968513] [5] Mbuf 0x7f71db404ed0 MSN 1776967387 Seq 461160912 -> 461160913 # XMits: 61
IPCLW:[0.635]{-}[WAIT]:UTIL: [1753918371968513] [6] Mbuf 0x7f71db3fd730 MSN 1776967388 Seq 461160913 -> 461160914 # XMits: 60
IPCLW:[0.636]{-}[WAIT]:UTIL: [1753918371968513] [7] Mbuf 0x7f71db1377d0 MSN 1776967389 Seq 461160914 -> 461160915 # XMits: 60
IPCLW:[0.637]{-}[WAIT]:UTIL: [1753918371968513] [8] Mbuf 0x7f71db404ab0 MSN 1776967390 Seq 461160915 -> 461160916 # XMits: 59
IPCLW:[0.638]{-}[WAIT]:UTIL: [1753918371968513] [9] Mbuf 0x7f71db400490 MSN 1776967391 Seq 461160916 -> 461160917 # XMits: 59
节点2 lms分析
07:32:52 出现网络等待
2025-07-31 07:29:29.457 : GSIPC:BSEND: last proc to post me: pid 38.14864 in kjc.h LINE:2308 ID:KJCS Post snd proxy to flush msg
2025-07-31 07:29:29.457 : GSIPC:BSEND: deq msg 0x2a5f756e0 type 44 qtm 30049 us seq 0.514174518 from opid 38 to 1.2
2025-07-31 07:29:29.457 : GSIPC:BSEND: last proc to post me: pid 38.14864 in kjc.h LINE:2308 ID:KJCS Post snd proxy to flush msg
IPCLW:[0.913]{E}[WAIT]:PROTO: [1753918372165022]PT DMP_REQ at seq 1292873920 from peer 192.168.2.10:14404
*** 2025-07-31T07:32:52.165931+08:00 IPCLW:[0.914]{-}[WAIT]:UTIL: [1753918372165022] PCNH 0x7f6c64fdcf78 State: 1 MSN: 392301845 Seq: 1292873920 Last Rcv 0:0:0.116.116424 Last valid Rcv 0:0:7.273.273813 IPCLW:[0.915]{-}[WAIT]:UTIL: [1753918372165022] Peer: LMS1.KSMSQ_dlm.23514 AckSeq: 1292873920. # Coalesced: 0 IPCLW:[0.916]{-}[WAIT]:UTIL: [1753918372165022] IVPort: 45836 TVPort: 6284 IMPT: 29019 RMPT: 17464 Ongoing Recv: No, last retrans dmp seq 0 IPCLW:[0.917]{-}[WAIT]:UTIL: [1753918372165022] Flags: 0x00000001 THint: 0x44cab1bf00000226 IHint: 0x705ddeb40000001f IPCLW:[0.918]{-}[WAIT]:UTIL: [1753918372165022] IPCLW State Dump IPCLW:[0.919]{-}[WAIT]:UTIL: [1753918372165022] UDP PORT 0x7f6c6dd93ad8, [LMS1.KSMSQ_dlm] Dump IPCLW:[0.920]{-}[WAIT]:UTIL: [1753918372165022] ID: 0x3b5d2e10 Type: UDP PT Opts: [] Flags: [], Address: 192.168.2.20:9149, MPT: No IPCLW:[0.921]{-}[WAIT]:UTIL: [1753918372165022] Last DataReady time 0:0:0.67.67549 Last Unrel msg rcvd time 487199:32:52.165.165022 IPCLW:[0.922]{-}[WAIT]:UTIL: [1753918372165022] ACNH Dump IPCLW:[0.923]{-}[WAIT]:UTIL: [1753918372165022] PCNH Dump IPCLW:[0.924]{-}[WAIT]:UTIL: [1753918372165022] PCNH 0x7f6c65ff79d8 State: 1 MSN: 2057350729 Seq: 2099703066 Last Rcv 8062:40:44.501.501764 Last valid Rcv 8062:40:44.501.501764 IPCLW:[0.925]{-}[WAIT]:UTIL: [1753918372165022] Peer: LMS1.KSMSQ_dlm.9072 AckSeq: 2099703066. # Coalesced: 0
3、等待事件分析
故障期间等待事件情况:
07:30:00至07:50:00 数据库存在大量row cache lock和GC等待,这应该是故障引起的。

对比30号相同时间段,30号没有异常等待事件。

说明故障期间,私网存在问题。
4、网络分析
客户这边数据库每天07:30都有一波定时任务,并且多套数据库都经过一个交换机,导致私网延迟30s发生了节点驱逐。

建议定时任务错开事件,避免网络拥塞。
解决方案
客户自行重启了异常节点,故障已自动恢复。

墨天轮从乐知乐享的数据库技术社区蓄势出发,全面升级,提供多类型数据库管理服务。墨天轮数据库管理服务旨在为用户构建信赖可托付的数据库环境,并为数据库厂商提供中立的生态支持。
墨天轮数据库服务官网:https://www.modb.pro/service

125

被折叠的 条评论
为什么被折叠?



