大量慢SQL导致节点宕机的故障分析

在2018年8月28日的一个数据仓库系统中,由于SQL执行时间过长,导致数据库节点宕机和SESSION数量激增。故障分析发现,SQL执行计划错误,全表扫描而非使用索引,可能是由于索引统计信息过期或失效。索引失效原因是删除分区后未维护全局索引。解决方案包括KILL掉长时间运行的SESSION并重建失效索引,同时强调在维护索引时避免使用ONLINE选项以避免等待DML操作完成。

故障背景:项目负责人找到我说2018-08-28 11:40:00-2018-08-28 11:45:00左右收到频繁告警短信 数据库节点1宕机,节点2 SESSION数一直在增加。这是一个数据仓库系统,正常情况下SESSION一般是90左右,现在将近500,而且还在增长

select *
  from dba_hist_active_sess_history
 where to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss') >
       '2018-08-28 11:40:00'
   and to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss') <
       '2018-08-28 11:45:00'
   and instance_number = 1
       group by sql_id order by 1 desc

发现果然没有数据返回,说明这时候节点1已经无法不提供服务了

再次向接口人确认得知从2018-08-28 11:00:00就开始收到告警短信了……于是发出如下查询

select max(to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss')),
       min(to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss'))
  from dba_hist_active_sess_history
 where instance_number = 1
   and to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss') >
       '2018-08-28 11:00:00'
   and to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss') <
       '2018-08-28 11:45:00'

 '2018-08-28 11:00:09'  '2018-08-28 11:21:59'
说明节点一在2018-08-28 11:21:59的时候 就已经GG了

一、查一下这最后的20分钟里面节点1经历了什么?首先查SQL语句,有哪些执行时间很长的SQL语句

select count(*),sql_id
  from dba_hist_active_sess_history
 where to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss') >
       '2018-08-28 11:00:09'
   and to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss') <
       '2018-08-28 11:21:59'
       and  instance_number = 1
       group by sql_id order by 1 desc

6903  5cj4kmp6aa9ux
5873  9zn6bkn0kcgv7
3032  cxqmbbffgdbhx
1374  22wxyvttg0xr3
1240  g1q9a31rmfhh9
1095  8bp70fca7gbuk
…………省略

发现节点1的SQL执行时间很过分,将近7w秒,20个小时。

(为什么我们采集时间是20分钟,能够采集到将近7000个snapshot?其实很容易理解,同一个SQL_ID执行了很多次……)

产生大批量的SQL"堵塞"的原因 一般是1.有锁,2.SQL本身的执行时间很长。

如果产生锁 一般也是holder的语句执行时间太长,导致后面大量的waiter在排队等待。我一般先排查SQL问题

查看SQL语句历史执行计划以及相关表信息

select * from table(dbms_xplan.display_awr('5cj4kmp6aa9ux'))
SQL_ID 5cj4kmp6aa9ux
--------------------
SELECT trd.ID          as id,
       trd.BILL_CODE   as billCode,
       trd.BAR_CODE    as barCode,
       trd.PRINT_COUNT as printCount,
       trd.MONEY_VALUE as moneyValue,
       trd.IMPORT_TIME as importTime,
       trd.TYPE_CODE   as typeCode,
       trdd.XML_VALUE  as xmlValue
  FROM T_AAAAIAA_DDDD trd
  LEFT JOIN T_AAAAIAA_DDDD_DETAIL trdd
    ON trdd.BILL_CODE = trd.BILL_CODE
 WHERE QUERY_FLAG = 0
   AND ID IN (:1)
 
Plan hash value: 477462738
 
-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name                  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                       |       |       |  4649K(100)|          |       |       |
|   1 |  NESTED LOOPS OUTER                 |                       |     1 |  1138 |  4649K  (1)| 15:29:55 |       |       |
|   2 |   PARTITION RANGE ALL               |                       |     1 |   216 |  4649K  (1)| 15:29:54 |     1 |    46 |
|   3 |    TABLE ACCESS FULL                | T_AAAAIAA_DDDD        |     1 |   216 |  4649K  (1)| 15:29:54 |     1 |    46 |
|   4 |   PARTITION RANGE ALL               |                       |     1 |   922 |   140   (0)| 00:00:02 |     1 |    46 |
|   5 |    TABLE ACCESS BY LOCAL INDEX ROWID| T_AAAAIAA_DDDD_DETAIL |     1 |   922 |   140   (0)| 00:00:02 |     1 |    46 |
|   6 |     INDEX RANGE SCAN                | UK_DATA_DETAIL        |     1 |       |   138   (0)| 00:00:02 |     1 |    46 |
-----------------------------------------------------------------------------------------------------------------------------
 
 
Plan hash value: 1496664323
 
-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name                  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                       |       |       |   103 (100)|          |       |       |
|   1 |  NESTED LOOPS OUTER                 |                       |     1 |  1098 |   103   (0)| 00:00:02 |       |       |
|   2 |   TABLE ACCESS BY GLOBAL INDEX ROWID| T_AAAAIAA_DDDD        |     1 |   214 |     3   (0)| 00:00:01 | ROWID | ROWID |
|   3 |    INDEX UNIQUE SCAN                | SYS_C009064           |     1 |       |     2   (0)| 00:00:01 |       |       |
|   4 |   PARTITION RANGE ALL               |                       |     1 |   884 |   100   (0)| 00:00:02 |     1 |    49 |
|   5 |    TABLE ACCESS BY LOCAL INDEX ROWID| T_AAAAIAA_DDDD_DETAIL |     1 |   884 |   100   (0)| 00:00:02 |     1 |    49 |
|   6 |     INDEX RANGE SCAN                | UK_DATA_DETAIL        |     1 |       |    98   (0)| 00:00:02 |     1 |    49 |
-----------------------------------------------------------------------------------------------------------------------------


相关表信息如下:
select segment_name,sum(bytes/1024/1024/1024) as gb from dba_segments where segment_name in('T_AAAAIAA_DDDD','T_AAAAIAA_DDDD_DETAIL') group by segment_name;
T_AAAAIAA_DDDD           133.2734375
T_AAAAIAA_DDDD_DETAIL    299.7568359375

select table_name,num_rows from dba_tables where table_name in ('T_AAAAIAA_DDDD','T_AAAAIAA_DDDD_DETAIL');
T_AAAAIAA_DDDD           297728148
T_AAAAIAA_DDDD_DETAIL    278554082

两个表都是‘大’表,执行计划的区别在于Id=3 ,一个走了全表扫描,一个走了索引唯一扫描。对于一个133G的表来说,他们的差距是136468:5 具体计算方法 全表: 133.2734375*1024*1024/(8*128),索引:root+branch+branch+leaf+回表block=5,直接导致了一个毫秒级响应的SQL执行了上千秒

二、为什么明明有索引,却不走索引?

1.索引的统计信息过期

2.索引失效了

执行如下查询验证:

select a.LAST_ANALYZED,a.STALE_STATS from dba_ind_statistics a where index_name ='SYS_C009064';
2018/8/22 3:44:40	NO

select index_name,table_name,status from dba_indexes where index_name ='SYS_C009064';
SYS_C009064	T_AAAAIAA_DDDD	UNUSABLE

索引的状态是UNUSABLE ,失效!

还有一种简单方法验证索引是失效还是统计信息过期:

如果使用INDEX(T,I) HINT报错则说明索引失效,反之则是过期

三、索引为什么会失效?

一般都是对分区表删除分区 忘记维护全局索引会导致全局索引失效!

通过确认是自动清理的脚本运行失败,开发人员手动删分区忘记维护索引所致

对于索引失效的处理rebuild

alter index xxxx.SYS_C009064 rebuild parallel 8;
ORA-00054: 资源正忙, 但指定以 NOWAIT 方式获取资源, 或者超时失效

这里有一个细节,在对业务不能完全把控情况下,维护索引最好不要加online。有错能及时报出来,这是我的一个人习惯。如果我这里加了online 估计会等很长时间。而这里报错能让我知道表上面有DML正在发生,加online会等待所有DML执行完成,结果 你懂得!

查一下执行时间排在前面的其他几条SQL有没有DML语句

select *
  from dba_hist_sqltext
 where sql_id in ('5cj4kmp6aa9ux',
                  '9zn6bkn0kcgv7',
                  'cxqmbbffgdbhx',
                  '22wxyvttg0xr3',
                  'g1q9a31rmfhh9',
                  '8bp70fca7gbuk');
------------------------------------------
 SELECT trd.ID          as id,
       trd.BILL_CODE   as billCode,
       trd.BAR_CODE    as barCode,
       trd.PRINT_COUNT as printCount,
       trd.MONEY_VALUE as moneyValue,
       trd.IMPORT_TIME as importTime,
       trd.TYPE_CODE   as typeCode,
       trdd.XML_VALUE  as xmlValue
  FROM T_AAAAIAA_DDDD trd
  LEFT JOIN T_AAAAIAA_DDDD_DETAIL trdd
    ON trdd.BILL_CODE = trd.BILL_CODE
 WHERE QUERY_FLAG = 0
   AND ID IN (:1)
------------------------------------------
SELECT trd.ID          as id,
       trd.BILL_CODE   as billCode,
       trd.BAR_CODE    as barCode,
       trd.PRINT_COUNT as printCount,
       trd.MONEY_VALUE as moneyValue,
       trd.IMPORT_TIME as importTime,
       trd.TYPE_CODE   as typeCode,
       trdd.XML_VALUE  as xmlValue
  FROM T_AAAAIAA_DDDD trd
  LEFT JOIN T_AAAAIAA_DDDD_DETAIL trdd
    ON trdd.BILL_CODE = trd.BILL_CODE
 WHERE QUERY_FLAG = 0
   AND ID IN (:1, :2)
------------------------------------------
SELECT trd.ID          as id,
       trd.BILL_CODE   as billCode,
       trd.BAR_CODE    as barCode,
       trd.PRINT_COUNT as printCount,
       trd.MONEY_VALUE as moneyValue,
       trd.IMPORT_TIME as importTime,
       trd.TYPE_CODE   as typeCode,
       trdd.XML_VALUE  as xmlValue
  FROM T_AAAAIAA_DDDD trd
  LEFT JOIN T_AAAAIAA_DDDD_DETAIL trdd
    ON trdd.BILL_CODE = trd.BILL_CODE
 WHERE QUERY_FLAG = 0
   AND ID IN (:1, :2, :3)
------------------------------------------
SELECT trd.ID          as id,
       trd.BILL_CODE   as billCode,
       trd.BAR_CODE    as barCode,
       trd.PRINT_COUNT as printCount,
       trd.MONEY_VALUE as moneyValue,
       trd.IMPORT_TIME as importTime,
       trd.TYPE_CODE   as typeCode,
       trdd.XML_VALUE  as xmlValue
  FROM T_AAAAIAA_DDDD trd
  LEFT JOIN T_AAAAIAA_DDDD_DETAIL trdd
    ON trdd.BILL_CODE = trd.BILL_CODE
 WHERE QUERY_FLAG = 0
   AND ID IN (:1, :2, :3, :4)
------------------------------------------
select bill0_.ID           as ID7_0_,
       bill0_.BAR_CODE     as BAR2_7_0_,
       bill0_.PACKAGE_NAME as PACKAGE3_7_0_,
       bill0_.FILE_NAME    as FILE4_7_0_,
       bill0_.REPORT_TIME  as REPORT5_7_0_,
       bill0_.IMPORT_TIME  as IMPORT6_7_0_,
       bill0_.PRINT_COUNT  as PRINT7_7_0_,
       bill0_.MONEY_VALUE  as MONEY8_7_0_,
       bill0_.BILL_CODE    as BILL9_7_0_,
       bill0_.TRADE_TIME   as TRADE10_7_0_,
       bill0_.COUNTER_CODE as COUNTER11_7_0_,
       bill0_.ACCOUNT_CODE as ACCOUNT12_7_0_,
       bill0_.QUERY_FLAG   as QUERY13_7_0_,
       bill0_.ORG_CODE     as ORG14_7_0_,
       bill0_.TYPE_CODE    as TYPE15_7_0_
  from T_AAAAIAA_DDDD bill0_
 where bill0_.ID = :1
------------------------------------------
UPDATE T_AAAAIAA_DDDD   SET PRINT_COUNT = PRINT_COUNT + 1   WHERE ID = :1  

我们查到一条UPDATE语句,而这个UPDATE语句正常的执行计划也是通过INDEX UNIQUE SCAN毫秒级响应……

这里全表扫描需要执行上千秒,而且从PLSQL的SESSION信息里面可以查出后面还有很多的UPDATE语句在执行……

四、解决问题:KILL掉所有SESSION  (注意要先停止应用,要不然有些会话会被重新调起执行)

SELECT 'ALTER SYSTEM DISCONNECT SESSION ' || '''' || SID || ',' || SERIAL# || '''' ||
       ' IMMEDIATE;'
  FROM GV$SESSION
 WHERE STATUS = 'ACTIVE'
   AND USERNAME = 'xxxx'
   AND OSUSER = 'was'

ALTER SYSTEM DISCONNECT SESSION '2,35' IMMEDIATE;
ALTER SYSTEM DISCONNECT SESSION '3,19' IMMEDIATE;
…中间省略400行…
ALTER SYSTEM DISCONNECT SESSION '4,75' IMMEDIATE;
ALTER SYSTEM DISCONNECT SESSION '5,17' IMMEDIATE;
ALTER SYSTEM DISCONNECT SESSION '6,15' IMMEDIATE;

五、KILL掉所有会话之后再rebuild索引

alter index xxxx.SYS_C009064 rebuild parallel 8;
alter index xxxx.SYS_C009064 noparallel;
注意rebuild开了并行一定要把索引的并行去掉,要不然只要索引被使用就会自动开并行8
常识了解一下 3亿行数据133GB的表 rebuild索引开并行8需要4分钟

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值