故障背景:项目负责人找到我说2018-08-28 11:40:00-2018-08-28 11:45:00左右收到频繁告警短信 数据库节点1宕机,节点2 SESSION数一直在增加。这是一个数据仓库系统,正常情况下SESSION一般是90左右,现在将近500,而且还在增长
select *
from dba_hist_active_sess_history
where to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss') >
'2018-08-28 11:40:00'
and to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss') <
'2018-08-28 11:45:00'
and instance_number = 1
group by sql_id order by 1 desc
发现果然没有数据返回,说明这时候节点1已经无法不提供服务了
再次向接口人确认得知从2018-08-28 11:00:00就开始收到告警短信了……于是发出如下查询
select max(to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss')),
min(to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss'))
from dba_hist_active_sess_history
where instance_number = 1
and to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss') >
'2018-08-28 11:00:00'
and to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss') <
'2018-08-28 11:45:00'
'2018-08-28 11:00:09' '2018-08-28 11:21:59'
说明节点一在2018-08-28 11:21:59的时候 就已经GG了
一、查一下这最后的20分钟里面节点1经历了什么?首先查SQL语句,有哪些执行时间很长的SQL语句
select count(*),sql_id
from dba_hist_active_sess_history
where to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss') >
'2018-08-28 11:00:09'
and to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss') <
'2018-08-28 11:21:59'
and instance_number = 1
group by sql_id order by 1 desc
6903 5cj4kmp6aa9ux
5873 9zn6bkn0kcgv7
3032 cxqmbbffgdbhx
1374 22wxyvttg0xr3
1240 g1q9a31rmfhh9
1095 8bp70fca7gbuk
…………省略
发现节点1的SQL执行时间很过分,将近7w秒,20个小时。
(为什么我们采集时间是20分钟,能够采集到将近7000个snapshot?其实很容易理解,同一个SQL_ID执行了很多次……)
产生大批量的SQL"堵塞"的原因 一般是1.有锁,2.SQL本身的执行时间很长。
如果产生锁 一般也是holder的语句执行时间太长,导致后面大量的waiter在排队等待。我一般先排查SQL问题
查看SQL语句历史执行计划以及相关表信息
select * from table(dbms_xplan.display_awr('5cj4kmp6aa9ux'))
SQL_ID 5cj4kmp6aa9ux
--------------------
SELECT trd.ID as id,
trd.BILL_CODE as billCode,
trd.BAR_CODE as barCode,
trd.PRINT_COUNT as printCount,
trd.MONEY_VALUE as moneyValue,
trd.IMPORT_TIME as importTime,
trd.TYPE_CODE as typeCode,
trdd.XML_VALUE as xmlValue
FROM T_AAAAIAA_DDDD trd
LEFT JOIN T_AAAAIAA_DDDD_DETAIL trdd
ON trdd.BILL_CODE = trd.BILL_CODE
WHERE QUERY_FLAG = 0
AND ID IN (:1)
Plan hash value: 477462738
-----------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 4649K(100)| | | |
| 1 | NESTED LOOPS OUTER | | 1 | 1138 | 4649K (1)| 15:29:55 | | |
| 2 | PARTITION RANGE ALL | | 1 | 216 | 4649K (1)| 15:29:54 | 1 | 46 |
| 3 | TABLE ACCESS FULL | T_AAAAIAA_DDDD | 1 | 216 | 4649K (1)| 15:29:54 | 1 | 46 |
| 4 | PARTITION RANGE ALL | | 1 | 922 | 140 (0)| 00:00:02 | 1 | 46 |
| 5 | TABLE ACCESS BY LOCAL INDEX ROWID| T_AAAAIAA_DDDD_DETAIL | 1 | 922 | 140 (0)| 00:00:02 | 1 | 46 |
| 6 | INDEX RANGE SCAN | UK_DATA_DETAIL | 1 | | 138 (0)| 00:00:02 | 1 | 46 |
-----------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1496664323
-----------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 103 (100)| | | |
| 1 | NESTED LOOPS OUTER | | 1 | 1098 | 103 (0)| 00:00:02 | | |
| 2 | TABLE ACCESS BY GLOBAL INDEX ROWID| T_AAAAIAA_DDDD | 1 | 214 | 3 (0)| 00:00:01 | ROWID | ROWID |
| 3 | INDEX UNIQUE SCAN | SYS_C009064 | 1 | | 2 (0)| 00:00:01 | | |
| 4 | PARTITION RANGE ALL | | 1 | 884 | 100 (0)| 00:00:02 | 1 | 49 |
| 5 | TABLE ACCESS BY LOCAL INDEX ROWID| T_AAAAIAA_DDDD_DETAIL | 1 | 884 | 100 (0)| 00:00:02 | 1 | 49 |
| 6 | INDEX RANGE SCAN | UK_DATA_DETAIL | 1 | | 98 (0)| 00:00:02 | 1 | 49 |
-----------------------------------------------------------------------------------------------------------------------------
相关表信息如下:
select segment_name,sum(bytes/1024/1024/1024) as gb from dba_segments where segment_name in('T_AAAAIAA_DDDD','T_AAAAIAA_DDDD_DETAIL') group by segment_name;
T_AAAAIAA_DDDD 133.2734375
T_AAAAIAA_DDDD_DETAIL 299.7568359375
select table_name,num_rows from dba_tables where table_name in ('T_AAAAIAA_DDDD','T_AAAAIAA_DDDD_DETAIL');
T_AAAAIAA_DDDD 297728148
T_AAAAIAA_DDDD_DETAIL 278554082
两个表都是‘大’表,执行计划的区别在于Id=3 ,一个走了全表扫描,一个走了索引唯一扫描。对于一个133G的表来说,他们的差距是136468:5 具体计算方法 全表: 133.2734375*1024*1024/(8*128),索引:root+branch+branch+leaf+回表block=5,直接导致了一个毫秒级响应的SQL执行了上千秒
二、为什么明明有索引,却不走索引?
1.索引的统计信息过期
2.索引失效了
执行如下查询验证:
select a.LAST_ANALYZED,a.STALE_STATS from dba_ind_statistics a where index_name ='SYS_C009064';
2018/8/22 3:44:40 NO
select index_name,table_name,status from dba_indexes where index_name ='SYS_C009064';
SYS_C009064 T_AAAAIAA_DDDD UNUSABLE
索引的状态是UNUSABLE ,失效!
还有一种简单方法验证索引是失效还是统计信息过期:
如果使用INDEX(T,I) HINT报错则说明索引失效,反之则是过期
三、索引为什么会失效?
一般都是对分区表删除分区 忘记维护全局索引会导致全局索引失效!
通过确认是自动清理的脚本运行失败,开发人员手动删分区忘记维护索引所致
对于索引失效的处理rebuild
alter index xxxx.SYS_C009064 rebuild parallel 8;
ORA-00054: 资源正忙, 但指定以 NOWAIT 方式获取资源, 或者超时失效
这里有一个细节,在对业务不能完全把控情况下,维护索引最好不要加online。有错能及时报出来,这是我的一个人习惯。如果我这里加了online 估计会等很长时间。而这里报错能让我知道表上面有DML正在发生,加online会等待所有DML执行完成,结果 你懂得!
查一下执行时间排在前面的其他几条SQL有没有DML语句
select *
from dba_hist_sqltext
where sql_id in ('5cj4kmp6aa9ux',
'9zn6bkn0kcgv7',
'cxqmbbffgdbhx',
'22wxyvttg0xr3',
'g1q9a31rmfhh9',
'8bp70fca7gbuk');
------------------------------------------
SELECT trd.ID as id,
trd.BILL_CODE as billCode,
trd.BAR_CODE as barCode,
trd.PRINT_COUNT as printCount,
trd.MONEY_VALUE as moneyValue,
trd.IMPORT_TIME as importTime,
trd.TYPE_CODE as typeCode,
trdd.XML_VALUE as xmlValue
FROM T_AAAAIAA_DDDD trd
LEFT JOIN T_AAAAIAA_DDDD_DETAIL trdd
ON trdd.BILL_CODE = trd.BILL_CODE
WHERE QUERY_FLAG = 0
AND ID IN (:1)
------------------------------------------
SELECT trd.ID as id,
trd.BILL_CODE as billCode,
trd.BAR_CODE as barCode,
trd.PRINT_COUNT as printCount,
trd.MONEY_VALUE as moneyValue,
trd.IMPORT_TIME as importTime,
trd.TYPE_CODE as typeCode,
trdd.XML_VALUE as xmlValue
FROM T_AAAAIAA_DDDD trd
LEFT JOIN T_AAAAIAA_DDDD_DETAIL trdd
ON trdd.BILL_CODE = trd.BILL_CODE
WHERE QUERY_FLAG = 0
AND ID IN (:1, :2)
------------------------------------------
SELECT trd.ID as id,
trd.BILL_CODE as billCode,
trd.BAR_CODE as barCode,
trd.PRINT_COUNT as printCount,
trd.MONEY_VALUE as moneyValue,
trd.IMPORT_TIME as importTime,
trd.TYPE_CODE as typeCode,
trdd.XML_VALUE as xmlValue
FROM T_AAAAIAA_DDDD trd
LEFT JOIN T_AAAAIAA_DDDD_DETAIL trdd
ON trdd.BILL_CODE = trd.BILL_CODE
WHERE QUERY_FLAG = 0
AND ID IN (:1, :2, :3)
------------------------------------------
SELECT trd.ID as id,
trd.BILL_CODE as billCode,
trd.BAR_CODE as barCode,
trd.PRINT_COUNT as printCount,
trd.MONEY_VALUE as moneyValue,
trd.IMPORT_TIME as importTime,
trd.TYPE_CODE as typeCode,
trdd.XML_VALUE as xmlValue
FROM T_AAAAIAA_DDDD trd
LEFT JOIN T_AAAAIAA_DDDD_DETAIL trdd
ON trdd.BILL_CODE = trd.BILL_CODE
WHERE QUERY_FLAG = 0
AND ID IN (:1, :2, :3, :4)
------------------------------------------
select bill0_.ID as ID7_0_,
bill0_.BAR_CODE as BAR2_7_0_,
bill0_.PACKAGE_NAME as PACKAGE3_7_0_,
bill0_.FILE_NAME as FILE4_7_0_,
bill0_.REPORT_TIME as REPORT5_7_0_,
bill0_.IMPORT_TIME as IMPORT6_7_0_,
bill0_.PRINT_COUNT as PRINT7_7_0_,
bill0_.MONEY_VALUE as MONEY8_7_0_,
bill0_.BILL_CODE as BILL9_7_0_,
bill0_.TRADE_TIME as TRADE10_7_0_,
bill0_.COUNTER_CODE as COUNTER11_7_0_,
bill0_.ACCOUNT_CODE as ACCOUNT12_7_0_,
bill0_.QUERY_FLAG as QUERY13_7_0_,
bill0_.ORG_CODE as ORG14_7_0_,
bill0_.TYPE_CODE as TYPE15_7_0_
from T_AAAAIAA_DDDD bill0_
where bill0_.ID = :1
------------------------------------------
UPDATE T_AAAAIAA_DDDD SET PRINT_COUNT = PRINT_COUNT + 1 WHERE ID = :1
我们查到一条UPDATE语句,而这个UPDATE语句正常的执行计划也是通过INDEX UNIQUE SCAN毫秒级响应……
这里全表扫描需要执行上千秒,而且从PLSQL的SESSION信息里面可以查出后面还有很多的UPDATE语句在执行……
四、解决问题:KILL掉所有SESSION (注意要先停止应用,要不然有些会话会被重新调起执行)
SELECT 'ALTER SYSTEM DISCONNECT SESSION ' || '''' || SID || ',' || SERIAL# || '''' ||
' IMMEDIATE;'
FROM GV$SESSION
WHERE STATUS = 'ACTIVE'
AND USERNAME = 'xxxx'
AND OSUSER = 'was'
ALTER SYSTEM DISCONNECT SESSION '2,35' IMMEDIATE;
ALTER SYSTEM DISCONNECT SESSION '3,19' IMMEDIATE;
…中间省略400行…
ALTER SYSTEM DISCONNECT SESSION '4,75' IMMEDIATE;
ALTER SYSTEM DISCONNECT SESSION '5,17' IMMEDIATE;
ALTER SYSTEM DISCONNECT SESSION '6,15' IMMEDIATE;
五、KILL掉所有会话之后再rebuild索引
alter index xxxx.SYS_C009064 rebuild parallel 8;
alter index xxxx.SYS_C009064 noparallel;
注意rebuild开了并行一定要把索引的并行去掉,要不然只要索引被使用就会自动开并行8
常识了解一下 3亿行数据133GB的表 rebuild索引开并行8需要4分钟

在2018年8月28日的一个数据仓库系统中,由于SQL执行时间过长,导致数据库节点宕机和SESSION数量激增。故障分析发现,SQL执行计划错误,全表扫描而非使用索引,可能是由于索引统计信息过期或失效。索引失效原因是删除分区后未维护全局索引。解决方案包括KILL掉长时间运行的SESSION并重建失效索引,同时强调在维护索引时避免使用ONLINE选项以避免等待DML操作完成。

4832

被折叠的 条评论
为什么被折叠?



