问题描述
- GreenPlum 5.0版本,在使用 gpcrondump 做备份时,如果同时还在执行一个自己写的存储过程,就有很高概率导致数据库死锁
- 该存储过程中涉及到表的创建、删除、数据导入等动作
分析过程
- 因对GPDB以及Postgresql都不太熟,先在网上搜索了下“GreenPlum 死锁”,果真找到一篇定位过程分享 ,作者写的非常详细,几乎是手把手、图文并茂的讲述了他排查GPDB死锁的过程。参考该文章,定位步骤如下:
- 待问题重现时,从 pg_stat_activity 表中查找处于等锁状态的任务:
select * from pg_stat_activity where waiting_reason='lock';
- 待问题重现时,从 pg_stat_activity 表中查找处于等锁状态的任务:
template1=# select * from pg_stat_activity where waiting_reason='lock';
datid | datname | procpid | sess_id | usesysid | usename | current_query | waiting | query_start | backend_start | client_addr | client_port | application_name | xact_start
| waiting_reason | rsgid | rsgname | rsgqueueduration
-------+---------+---------+---------+----------+---------+-------------------------------------------------------------------+---------+-------------------------------+-------------------------------+-------------+-------------+------------------+----------------------
---------+----------------+-------+---------+------------------
97128 | dsst | 923 | 244 | 10 | gpadmin | truncate c_picrecord_1_prt_extra ; | t | 2017-12-11 15:10:18.840394+08 | 2017-12-08 16:57:48.826455+08 | | -1 | psql | 2017-12-11 15:10:18.8
40394+08 | lock | 0 | |
97128 | dsst | 20528 | 393 | 10 | gpadmin | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) | t | 2017-12-11 15:10:25.10717+08 | 2017-12-11 15:10:24.877717+08 | | -1 | | 2017-12-11 15:10:24.8
90674+08 | lock | 0 | |
(2 rows)
- 可见
truncate c_picrecord_1_prt_extra和SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)两个任务死锁了。
前者就是我们自己写的存储过程中的一个步骤;后者是数据库备份过程中自动产生的任务。 - 关联
pg_locks,pg_class,pg_stat_activity表,查询与上述任务相关的锁。这里我们按照存储过程中使用的表名称过滤,搜索relname中匹配表名称%picrecord%的记录
select a.locktype,b.relname,substring(c.current_query,1,100),c.xact_start,a.pid,a.mode,a.granted from pg_locks a,pg_class b,pg_stat_activity c where a.relation = b.oid and a.pid = c.procpid and b.relname like '%picrecord%';
select a.locktype,b.relname,substring(c.current_query,1,100),c.xact_start,a.pid,a.mode,a.granted from pg_locks a,pg_class b,pg_stat_activity c where a.relation = b.oid and a.pid = c.procpid and b.relname like '%picrecord%';
locktype | relname | substring | xact_start | pid | mode | granted
----------+-------------------------+-------------------------------------------------------------------+-------------------------------+-------+---------------------+---------
relation | c_picrecord_id_seq | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) | 2017-12-11 15:10:24.890674+08 | 20528 | AccessShareLock | t
relation | c_picrecord | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) | 2017-12-11 15:10:24.890674+08 | 20528 | AccessShareLock | t
relation | c_picrecord_1_prt_extra | truncate c_picrecord_1_prt_extra ; | 2017-12-11 15:10:18.840394+08 | 923 | AccessExclusiveLock | t
relation | c_picrecord_1_prt_extra | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) | 2017-12-11 15:10:24.890674+08 | 20528 | AccessShareLock | f
relation | c_picrecord_bak | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) | 2017-12-11 15:10:24.890674+08 | 20528 | AccessShareLock | t
relation | c_picrecord_bak1 | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) | 2017-12-11 15:10:24.890674+08 | 20528 | AccessShareLock | t
(6 rows)
- PS:如果找不到相关记录,可尝试匹配其他关键字试试.
4. 对比网上找到资料,该案例中作者又分别到各个Segment上执行相同的查询,然后整理出不同conn持有和等待的锁。得出结论:有两个conn在Master节点和Segment节点上相互等待对方持有的锁,因此导致死锁。按照这个思路,查找上述任务对应的segment节点。
5. 首先看到 pg_locks 中就包含 segment 字段和 pid字段,因此修改sql语句直接将其查出来:
# select a.locktype,a.pid,a.gp_segment_id,b.relname,substring(c.current_query,1,100),c.xact_start,a.pid,a.mode,a.granted from pg_locks a,pg_class b,pg_stat_activity c where a.relation = b.oid and a.pid = c.procpid and b.relname like '%picrecord%';
locktype | pid | gp_segment_id | relname | substring | xact_start | pid | mode | granted
----------+-------+---------------+-------------------------+-------------------------------------------------------------------+-------------------------------+-------+---------------------+---------
relation | 27227 | -1 | c_picrecord_id_seq | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock | t
relation | 27227 | -1 | c_picrecord | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock | t
relation | 15628 | -1 | c_picrecord_1_prt_extra | TRUNCATE c_picrecord_1_prt_extra ; | 2017-12-11 17:09:27.829242+08 | 15628 | AccessExclusiveLock | t
relation | 27227 | -1 | c_picrecord_1_prt_extra | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock | f
relation | 27227 | -1 | c_picrecord_bak | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock | t
relation | 27227 | -1 | c_picrecord_bak1 | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock | t
发现segment都是-1,应该是没有对应的segment节点。为了进一步确认,手动到所有segment节点上查了一把,确实没有在segment上面找到相关记录。(PS:psql直接连接segment的方法参见 Greenplum 如何直连segment节点 )
思考: 仅仅根据步骤5的查询结果,无法构成死锁。而segment上面又查不到持锁/等锁的记录。这是为什么呢?
也许是Master上面这两个任务就已经死锁了,由于我们按照自己业务表名称进行了过滤,可能还有其他的锁我们没有查出来。- 直接查询
TRUNCATE c_picrecord_1_prt_extra任务对应的锁,这次我们把匹配条件改成pid(15628): 果然它还在等待 pg_class 表的 RowExclusiveLock 锁.
select a.locktype,a.pid,a.gp_segment_id,b.relname,substring(c.current_query,1,100),c.xact_start,a.pid,a.mode,a.granted from pg_locks a,pg_class b,pg_stat_activity c where a.relation = b.oid and a.pid = c.procpid and a.pid='15628';
select a.locktype,a.pid,a.gp_segment_id,b.relname,substring(c.current_query,1,100),c.xact_start,a.pid,a.mode,a.granted from pg_locks a,pg_class b,pg_stat_activity c where a.relation = b.oid and a.pid = c.procpid and a.pid='15628';
locktype | pid | gp_segment_id | relname | substring | xact_start | pid | mode | granted
----------+-------+---------------+-------------------------+------------------------------------+-------------------------------+-------+---------------------+---------
relation | 15628 | -1 | pg_class | TRUNCATE c_picrecord_1_prt_extra ; | 2017-12-11 17:09:27.829242+08 | 15628 | RowExclusiveLock | f
relation | 15628 | -1 | c_picrecord_1_prt_extra | TRUNCATE c_picrecord_1_prt_extra ; | 2017-12-11 17:09:27.829242+08 | 15628 | AccessExclusiveLock | t
- 再来查询
SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true)相关的锁,其pid是27227: 可见它已经持有了pg_class的AccessShareLock
select a.locktype,a.pid,a.gp_segment_id,b.relname,substring(c.current_query,1,100),c.xact_start,a.pid,a.mode,a.granted from pg_locks a,pg_class b,pg_stat_activity c where a.relation = b.oid and a.pid = c.procpid and a.pid='27227';
locktype | pid | gp_segment_id | relname | substring | xact_start | pid | mode | granted
----------+-------+---------------+----------------------------------------------------------------+-------------------------------------------------------------------+-------------------------------+-------+-----------------+---------
relation | 27227 | -1 | pg_authid | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock | t
……
relation | 27227 | -1 | pg_class | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock | t
……
relation | 27227 | -1 | c_picrecord_1_prt_extra | SELECT pg_get_partition_def('97147'::pg_catalog.oid, true, true) | 2017-12-11 17:09:31.269146+08 | 27227 | AccessShareLock | f
……
- 至此死锁的直接原因就已经分析清楚了:
TRUNCATE c_picrecord_1_prt_extra持有c_picrecord_1_prt_extra的AccessExclusiveLock锁,等待pg_class的RowExclusiveLock;SELECT pg_get_partition_def持有pg_class的AccessShareLock,等待c_picrecord_1_prt_extra的AccessShareLock. 因此造成了死锁。- PS: 听说GPDB 5.0 里面,到处数据和DDL操作是有冲突的,这应该是一个bug。可以自己额外建一个表,不同操作之前先对这个表尝试加锁,从而达到互斥效果。
参考资料:
greenplum在执行vacuum和insert产生死锁问题定位及解决方案
Greenplum 如何直连segment节点
PostgreSQL 锁等待监控 珍藏级SQL - 谁堵塞了谁
本文记录了在GreenPlum 5.0版本中遇到的死锁问题,主要是在gpcrondump备份期间,同时执行自定义存储过程导致。通过查询pg_stat_activity和pg_locks等表,分析出死锁是由于两个连接在Master节点上相互等待对方持有的锁。解决方案可能是避免DDL和DML操作的同时进行,或者利用额外的表实现操作互斥。

842

被折叠的 条评论
为什么被折叠?



