GreenPlum死锁问题定位手记

本文记录了在GreenPlum 5.0版本中遇到的死锁问题,主要是在gpcrondump备份期间,同时执行自定义存储过程导致。通过查询pg_stat_activity和pg_locks等表,分析出死锁是由于两个连接在Master节点上相互等待对方持有的锁。解决方案可能是避免DDL和DML操作的同时进行,或者利用额外的表实现操作互斥。

问题描述

  • GreenPlum 5.0版本,在使用 gpcrondump 做备份时,如果同时还在执行一个自己写的存储过程,就有很高概率导致数据库死锁
  • 该存储过程中涉及到表的创建、删除、数据导入等动作

分析过程

  • 因对GPDB以及Postgresql都不太熟,先在网上搜索了下“GreenPlum 死锁”,果真找到一篇定位过程分享 ,作者写的非常详细,几乎是手把手、图文并茂的讲述了他排查GPDB死锁的过程。参考该文章,定位步骤如下:
    1. 待问题重现时,从 pg_stat_activity 表中查找处于等锁状态的任务:
      select * from pg_stat_activity where waiting_reason='lock';
template1=# select * from pg_stat_activity where waiting_reason='lock';
 datid | datname | procpid | sess_id | usesysid | usename |                           current_query                           | waiting |          query_start          |         backend_start         | client_addr | client_port | application_name |          xact_start  
         | waiting_reason | rsgid | rsgname | rsgqueueduration 
-------+---------+---------+---------+----------+---------+-------------------------------------------------------------------+---------+-------------------------------+-------------------------------+-------------+-------------+------------------+----------------------
---------+----------------+-------+---------+------------------
 97128 | dsst    |     923 |     244 |       10 | gpadmin | truncate c_picrecord_1_prt_extra ;                                | t       | 2017-12-11 15:10:18.840394+08 | 2017-12-08 16:57:48.826455+08 |             |          -1 | psql             | 2017-12-11 15:10:18.8
40394+08 | lock           |     0 |         | 
 97128 | dsst    |   20528 |     393 |       10 | gpadmin | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)  | t       | 2017-12-11 15:10:25.10717+08  | 2017-12-11 15:10:24.877717+08 |             |          -1 |                  | 2017-12-11 15:10:24.8
90674+08 | lock           |     0 |         | 
(2 rows)
  1. 可见truncate c_picrecord_1_prt_extraSELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)两个任务死锁了。
    前者就是我们自己写的存储过程中的一个步骤;后者是数据库备份过程中自动产生的任务。
  2. 关联pg_locks,pg_class,pg_stat_activity 表,查询与上述任务相关的锁。这里我们按照存储过程中使用的表名称过滤,搜索relname中匹配表名称%picrecord%的记录
    select a.locktype,b.relname,substring(c.current_query,1,100),c.xact_start,a.pid,a.mode,a.granted from pg_locks a,pg_class b,pg_stat_activity c where a.relation = b.oid and a.pid = c.procpid and b.relname like '%picrecord%';
select a.locktype,b.relname,substring(c.current_query,1,100),c.xact_start,a.pid,a.mode,a.granted from pg_locks a,pg_class b,pg_stat_activity c where a.relation = b.oid and a.pid = c.procpid and b.relname like '%picrecord%';
 locktype |         relname         |                             substring                             |          xact_start           |  pid  |        mode         | granted 
----------+-------------------------+-------------------------------------------------------------------+-------------------------------+-------+---------------------+---------
 relation | c_picrecord_id_seq      | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)  | 2017-12-11 15:10:24.890674+08 | 20528 | AccessShareLock     | t
 relation | c_picrecord             | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)  | 2017-12-11 15:10:24.890674+08 | 20528 | AccessShareLock     | t
 relation | c_picrecord_1_prt_extra | truncate c_picrecord_1_prt_extra ;                                | 2017-12-11 15:10:18.840394+08 |   923 | AccessExclusiveLock | t
 relation | c_picrecord_1_prt_extra | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)  | 2017-12-11 15:10:24.890674+08 | 20528 | AccessShareLock     | f
 relation | c_picrecord_bak         | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)  | 2017-12-11 15:10:24.890674+08 | 20528 | AccessShareLock     | t
 relation | c_picrecord_bak1        | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)  | 2017-12-11 15:10:24.890674+08 | 20528 | AccessShareLock     | t
(6 rows)
- PS:如果找不到相关记录,可尝试匹配其他关键字试试.  

4. 对比网上找到资料,该案例中作者又分别到各个Segment上执行相同的查询,然后整理出不同conn持有和等待的锁。得出结论:有两个conn在Master节点和Segment节点上相互等待对方持有的锁,因此导致死锁。按照这个思路,查找上述任务对应的segment节点。
5. 首先看到 pg_locks 中就包含 segment 字段和 pid字段,因此修改sql语句直接将其查出来:

# select a.locktype,a.pid,a.gp_segment_id,b.relname,substring(c.current_query,1,100),c.xact_start,a.pid,a.mode,a.granted from pg_locks a,pg_class b,pg_stat_activity c where a.relation = b.oid and a.pid = c.procpid and b.relname like '%picrecord%';
 locktype |  pid  | gp_segment_id |         relname         |                             substring                             |          xact_start           |  pid  |        mode         | granted 
----------+-------+---------------+-------------------------+-------------------------------------------------------------------+-------------------------------+-------+---------------------+---------
 relation | 27227 |            -1 | c_picrecord_id_seq      | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)  | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock     | t
 relation | 27227 |            -1 | c_picrecord             | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)  | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock     | t
 relation | 15628 |            -1 | c_picrecord_1_prt_extra | TRUNCATE c_picrecord_1_prt_extra ;                                | 2017-12-11 17:09:27.829242+08 | 15628 | AccessExclusiveLock | t
 relation | 27227 |            -1 | c_picrecord_1_prt_extra | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)  | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock     | f
 relation | 27227 |            -1 | c_picrecord_bak         | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)  | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock     | t
 relation | 27227 |            -1 | c_picrecord_bak1        | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)  | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock     | t
  1. 发现segment都是-1,应该是没有对应的segment节点。为了进一步确认,手动到所有segment节点上查了一把,确实没有在segment上面找到相关记录。(PS:psql直接连接segment的方法参见 Greenplum 如何直连segment节点

  2. 思考: 仅仅根据步骤5的查询结果,无法构成死锁。而segment上面又查不到持锁/等锁的记录。这是为什么呢?
    也许是Master上面这两个任务就已经死锁了,由于我们按照自己业务表名称进行了过滤,可能还有其他的锁我们没有查出来。

  3. 直接查询TRUNCATE c_picrecord_1_prt_extra任务对应的锁,这次我们把匹配条件改成pid(15628): 果然它还在等待 pg_class 表的 RowExclusiveLock 锁.
    select a.locktype,a.pid,a.gp_segment_id,b.relname,substring(c.current_query,1,100),c.xact_start,a.pid,a.mode,a.granted from pg_locks a,pg_class b,pg_stat_activity c where a.relation = b.oid and a.pid = c.procpid and a.pid='15628';
select a.locktype,a.pid,a.gp_segment_id,b.relname,substring(c.current_query,1,100),c.xact_start,a.pid,a.mode,a.granted from pg_locks a,pg_class b,pg_stat_activity c where a.relation = b.oid and a.pid = c.procpid and a.pid='15628';
 locktype |  pid  | gp_segment_id |         relname         |             substring              |          xact_start           |  pid  |        mode         | granted 
----------+-------+---------------+-------------------------+------------------------------------+-------------------------------+-------+---------------------+---------
 relation | 15628 |            -1 | pg_class                | TRUNCATE c_picrecord_1_prt_extra ; | 2017-12-11 17:09:27.829242+08 | 15628 | RowExclusiveLock    | f
 relation | 15628 |            -1 | c_picrecord_1_prt_extra | TRUNCATE c_picrecord_1_prt_extra ; | 2017-12-11 17:09:27.829242+08 | 15628 | AccessExclusiveLock | t
  1. 再来查询SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) 相关的锁,其pid是27227: 可见它已经持有了pg_class的AccessShareLock
select a.locktype,a.pid,a.gp_segment_id,b.relname,substring(c.current_query,1,100),c.xact_start,a.pid,a.mode,a.granted from pg_locks a,pg_class b,pg_stat_activity c where a.relation = b.oid and a.pid = c.procpid and a.pid='27227';
 locktype |  pid  | gp_segment_id |                            relname                             |                             substring                             |          xact_start           |  pid  |      mode       | granted 
----------+-------+---------------+----------------------------------------------------------------+-------------------------------------------------------------------+-------------------------------+-------+-----------------+---------
 relation | 27227 |            -1 | pg_authid                                                      | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)  | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock | t
 ……
 relation | 27227 |            -1 | pg_class                                                       | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)  | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock | t
 ……
 relation | 27227 |            -1 | c_picrecord_1_prt_extra                                        | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)  | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock | f
 ……
  1. 至此死锁的直接原因就已经分析清楚了:
    • TRUNCATE c_picrecord_1_prt_extra 持有c_picrecord_1_prt_extraAccessExclusiveLock锁,等待pg_classRowExclusiveLock;
    • SELECT pg_get_partition_def 持有pg_classAccessShareLock,等待 c_picrecord_1_prt_extraAccessShareLock. 因此造成了死锁。
    • PS: 听说GPDB 5.0 里面,到处数据和DDL操作是有冲突的,这应该是一个bug。可以自己额外建一个表,不同操作之前先对这个表尝试加锁,从而达到互斥效果。

参考资料:

greenplum在执行vacuum和insert产生死锁问题定位及解决方案
Greenplum 如何直连segment节点
PostgreSQL 锁等待监控 珍藏级SQL - 谁堵塞了谁

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值